Reputation: 125
I am going to read about 7 GB text file.
Whenever I try to read this file, it takes long than what I expected.
For example, suppose I have 350 MB text file and my laptop takes about a minute or less. If I suppose to read 7GB, ideally it should take 20 minutes or less. Isn't it? Mine takes much longer than what I expected and I want to shorten the time of reading and processing my data.
I am using the following code for reading:
for line in open(filename, 'r'):
try:
list.append(json.loads(line))
except:
pass
After reading a file, I used to process to filter out unnecessary data from the list by making another list and killing previous list. If you have any suggestion, just let me know.
Upvotes: 2
Views: 2226
Reputation: 154454
The 7GB file is likely taking significantly longer than 20 x 350mb file because you don't have enough memory to hold all the data in memory. This means that, at some point, your operating system will start swapping out some of the data — writing it from memory onto disk — so that the memory can be re-used.
This is slow because your hard disk is significantly slower than RAM, and at 7GB there will be a lot of data being read from your hard disk, put into RAM, then moved back to your page file (the file on disk your operating system uses to store data that has been copied out of RAM).
My suggestion would be to re-work your program so that it only needs to store a small portion of the file in memory at a time. Depending on your problem, you can likely do this by moving some of the logic into the loop that reads the file. For example, if your program is trying to find and print all the lines which contain "ERROR", you could re-write it from:
lines = []
for line in open("myfile"):
lines.append(json.loads(line))
for line in lines:
if "ERROR" in line:
print line
To:
for line_str in open("myfile"):
line_obj = json.loads(line_str)
if "ERROR" in line_obj:
print line_obj
Upvotes: 7