Reputation: 468
Trying to tokenize and encode data to feed to a neural network.
I only have 25GB RAM and everytime I try to run the code below my google colab crashes. Any idea how to prevent his from happening? “Your session crashed after using all available RAM”
I thought tokenize/encoding chunks of 50000 sentences would work but unfortunately not. The code works on a dataset with length 1.3 million. The current dataset has a length of 5 million.
max_q_len = 128
max_a_len = 64
trainq_list = train_q.tolist()
batch_size = 50000
def batch_encode(text, max_seq_len):
for i in range(0, len(trainq_list), batch_size):
encoded_sent = tokenizer.batch_encode_plus(
text,
max_length = max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False
)
return encoded_sent
# tokenize and encode sequences in the training set
tokensq_train = batch_encode(trainq_list, max_q_len)
The tokenizer comes from HuggingFace:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')
Upvotes: 2
Views: 4509
Reputation: 24814
You should use generators and pass data to tokenizer.batch_encode_plus
, no matter the size.
Conceptually, something like this:
This one probably holds list of sentences, which is read from some file(s). If this is a single large file, you could follow this answer to lazily read parts of the input (preferably of batch_size
lines at once):
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
Otherwise open a single file (much smaller than memory, because it will be way larger after encoding using BERT), something like this:
import pathlib
def read_in_chunks(directory: pathlib.Path):
# Use "*.txt" or any other extension your file might have
for file in directory.glob("*"):
with open(file, "r") as f:
yield f.readlines()
Encoder should take this generator and yield
back encoded parts, something like this:
# Generator should create lists useful for encoding
def batch_encode(generator, max_seq_len):
tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-uncased")
for text in generator:
yield tokenizer.batch_encode_plus(
text,
max_length=max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False,
)
As the files will be too large to fit in RAM memory, you should save them to disk (or use somehow as they are generated).
Something along those lines:
import numpy as np
# I assume np.arrays are created, adjust to PyTorch Tensors or anything if needed
def save(encoding_generator):
for i, encoded in enumerate(encoding_generator):
np.save(str(i), encoded)
Upvotes: 4