Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

(not a bug) question about bert create_pretraining_data.tokenize_lines() #1592

Open
@kiukchung

Description

Description

In the function scripts.pretraining.bert.create_pretraining_data.tokenize_lines()

The code snippet:

for line in lines:
        if not line:
            break
        line = line.strip()
        # Empty lines are used as document delimiters
        if not line:
            results.append([])
        else:
            #<OMITTED FOR BREVITY...>
    return results

Suggests that empty or null lines (e.g. "" or None) break the for-loop returning only the lines that have been processed so far whereas stripped-empty lines (e.g. " ") are used as document delimiters.

Could someone shed light as to what the (empty line + break-from-loop) is meant to accomplish? Are empty/null lines used as delimiters?

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions