Reading a big language corpus without Memory Error in 16GB RAM computer

Question

I found Google NMT using codecs for reading the input data file.

import codecs
import tensorflow as tf
with codecs.getreader("utf-8")(tf.gfile.GFile(input_file, mode="rb")) as f:
    return f.read().splitlines()

I have two questions.

Does above support in reading huge datasets of size more than 5 GB or so without Memory error in a personal computer of 16GB RAM since it is using tf.gfile.GFile ? I would really appreciate a solution that can help me reading huge language corpus

without getting the Memory error

. 2. I have imported codecs in the code yet why am I getting this error "NameError: name 'codecs' is not defined" ?

EDIT 1 :

For 2. Getting

 OutOfRangeError                           Traceback (most recent call last)
     in ()
          6 input_file = os.path.join(source_path)
          7 with codecs.getreader("utf-8")(tf.gfile.GFile(input_file, mode="rb")) as f:
    ----> 8     source_text = f.read().splitlines()

OutOfRangeError is raised when an operation iterates past the valid input range. How can I fix this ?

Prabhav · Accepted Answer

If the file size is very huge it is recommended to process it line by line. Below code will do the trick:

with open("input_file") as infile:
    for line in infile:
        do_something_with(line)

Reading a big language corpus without Memory Error in 16GB RAM computer

Answers (1)

Related Questions