Reputation: 3498
I found Google NMT using codecs for reading the input data file.
import codecs
import tensorflow as tf
with codecs.getreader("utf-8")(tf.gfile.GFile(input_file, mode="rb")) as f:
return f.read().splitlines()
I have two questions.
size more than 5 GB
or so without Memory error in a personal computer of 16GB RAM since it is using tf.gfile.GFile
? I would really appreciate a solution that can help me reading huge language corpus without getting the Memory error
.
2. I have imported codecs in the code yet why am I getting this error "NameError: name 'codecs' is not defined
" ?
EDIT 1 :
For 2. Getting
OutOfRangeError Traceback (most recent call last) <ipython-input-7-e78786c1f151> in <module>() 6 input_file = os.path.join(source_path) 7 with codecs.getreader("utf-8")(tf.gfile.GFile(input_file, mode="rb")) as f: ----> 8 source_text = f.read().splitlines()
OutOfRangeError is raised when an operation iterates past the valid input range. How can I fix this ?
Upvotes: 2
Views: 160
Reputation: 457
If the file size is very huge it is recommended to process it line by line. Below code will do the trick:
with open("input_file") as infile:
for line in infile:
do_something_with(line)
Upvotes: 2