Reputation: 21
I should work with a huge bz2-file (5+ GB) using python. With my actual code, I always get a memory error. Somewhere, I read that I could use sqlite3 to handle the problem. Is this right? If yes, how should I adapt my code? (I'm not very experienced using sqlite3...)
Here is my actual beginning of the code:
import csv, bz2
names = ('ID', 'FORM')
filename = "huge-file.bz2"
with open(filename) as f:
f = bz2.BZ2File(f, 'rb')
reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
tokens = [sentence for sentence in reader]
After this, I need to go through the 'tokens'. It would be great if I could handle this huge bz2-file - so, any help is very very welcome! Thank you very much for any advide!
Upvotes: 1
Views: 2609
Reputation: 8855
The file is huge, and reading all the file won't work because your process will run out of memory.
The solution is to read the file in chunks/lines, and process them before reading the next chunk.
The list comprehension line
tokens = [sentence for sentence in reader]
is reading the whole file to tokens
and it may cause the process to run out of memory.
The csv.DictReader
can read the CSV records line by line, meaning on each iteration, 1 line of data will be loaded to memory.
Like this:
with open(filename) as f:
f = bz2.BZ2File(f, 'rb')
reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
for sentence in reader:
# do something with sentence (process/aggregate/store/etc.)
pass
Please note that if on the added loop, agian the data from the sentence
is being stored in another variable (like tokens
) still lots of memory may be consumed depending on how big is the data. So it's better to aggregate them, or use other type of storage for that data.
About having some of the previous lines available in your process (as discussed in the comments), you can do something like this:
Then you can store the previous line in another variable, which gets replaced on each iteration.
Or if you needed multiple lines (back), then you can have a list of the last n
lines.
Use a collections.deque with a maxlen
to keep track of last n
lines. Import deque
from collections
standard module at the top of your file.
from collections import deque
# rest of the code ...
last_sentences = deque(maxlen=5) # keep the previous lines as we need for processing new lines
for sentence in reader:
# process the sentence
last_sentences.append(sentence)
I suggest the above solution, but you can also implement it yourself using a list, and manually keep track of its size.
define an empty list before the loop, at the end of the loop check if the length of the list is larger than what you need, remove older items from the list, then append the current line.
last_sentences = [] # keep the previous lines as we need for processing new lines
for sentence in reader:
# process the sentence
if len(last_sentences) > 5: # make sure we won't keep all the previous sentences
last_sentences = last_sentences[-5:]
last_sentences.append(sentence)
Upvotes: 2