How to build a dataset from a large text file without getting a memory error?

Question

I have a text file with size > 7.02 GB. I have already built a tokenizer based on this text file. I want to build a dataset like so:

from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="data.txt", block_size=128,)

Since the size of my data is very large, a memory error occurs. This is the source code:

with open(file_path, encoding="utf-8") as f:
        lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

    batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
    print(batch_encoding)
    self.examples = batch_encoding["input_ids"]
    self.examples = [{"input_ids": torch.tensor(e, dtype=torch.long)} for e in self.examples]

Supposing that my text file has only 4 lines, the following will be printed:

{'input_ids': [[49, 93, 1136, 1685, 973, 363, 72, 3130, 16502, 18], [44, 73, 1685, 279, 7982, 18, 225], [56, 13005, 1685, 4511, 3450, 18], [56, 19030, 1685, 7544, 18]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

I have changed the source code as the following so that the memory error doesn't appear:

for line in open(file_path, encoding="utf-8"):
        if (len(line) > 0 and not line.isspace()):
            new_line = line.split()

            batch_encoding = tokenizer(new_line, add_special_tokens=True, truncation=True, max_length=block_size)
            print(batch_encoding)
            print(type(batch_encoding))
            self.examples = batch_encoding["input_ids"]
            self.examples = [{"input_ids": torch.tensor(e, dtype=torch.long)} for e in self.examples]
print(batch_encoding)

However, the following will be printed:

{'input_ids': [[49, 93], [3074], [329], [2451, 363, 72, 3130, 16502, 18]], 'token_type_ids': [[0, 0], [0], [0], [0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1], [1], [1], [1, 1, 1, 1, 1, 1]]}

{'input_ids': [[44, 73], [329], [69], [23788, 18]], 'token_type_ids': [[0, 0], [0], [0], [0, 0]], 'attention_mask': [[1, 1], [1], [1], [1, 1]]}

{'input_ids': [[56, 13005], [329], [7522], [7958, 18]], 'token_type_ids': [[0, 0], [0], [0], [0, 0]], 'attention_mask': [[1, 1], [1], [1], [1, 1]]}

{'input_ids': [[56, 19030], [329], [11639, 18]], 'token_type_ids': [[0, 0], [0], [0, 0]], 'attention_mask': [[1, 1], [1], [1, 1]]}
{'input_ids': [[56, 19030], [329], [11639, 18]], 'token_type_ids': [[0, 0], [0], [0, 0]], 'attention_mask': [[1, 1], [1], [1, 1]]}

How can I change the source code in order to be able to read the large text file line by line but get the same output as desired without a memory error?

How to build a dataset from a large text file without getting a memory error?

Answers (1)

Related Questions