(not a bug) question about bert `create_pretraining_data.tokenize_lines()`

## Description
In the function [`scripts.pretraining.bert.create_pretraining_data.tokenize_lines()`](https://github.com/dmlc/gluon-nlp/blob/5ff0519aa5a89e7e2a2c0afab164e17de55231b4/scripts/pretraining/bert/create_pretraining_data.py#L141-L145)

The code snippet:

```
for line in lines:
        if not line:
            break
        line = line.strip()
        # Empty lines are used as document delimiters
        if not line:
            results.append([])
        else:
            #<OMITTED FOR BREVITY...>
    return results
```

Suggests that empty or null lines (e.g. `""` or `None`) break the for-loop returning only the lines that have been processed so far whereas stripped-empty lines (e.g. `"  "`) are used as document delimiters.

Could someone shed light as to what the (empty line + break-from-loop) is meant to accomplish? Are empty/null lines used as <EOF> delimiters?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(not a bug) question about bert `create_pretraining_data.tokenize_lines()` #1592

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

(not a bug) question about bert create_pretraining_data.tokenize_lines() #1592

Description

Description

Activity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

(not a bug) question about bert `create_pretraining_data.tokenize_lines()` #1592