Marc
Marc

Reputation: 21

Handle huge bz2-file

I should work with a huge bz2-file (5+ GB) using python. With my actual code, I always get a memory error. Somewhere, I read that I could use sqlite3 to handle the problem. Is this right? If yes, how should I adapt my code? (I'm not very experienced using sqlite3...)

Here is my actual beginning of the code:

import csv, bz2

names = ('ID', 'FORM')

filename = "huge-file.bz2"

with open(filename) as f:
    f = bz2.BZ2File(f, 'rb')
    reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
    tokens = [sentence for sentence in reader]

After this, I need to go through the 'tokens'. It would be great if I could handle this huge bz2-file - so, any help is very very welcome! Thank you very much for any advide!

Upvotes: 1

Views: 2609

Answers (1)

farzad
farzad

Reputation: 8855

The file is huge, and reading all the file won't work because your process will run out of memory.

The solution is to read the file in chunks/lines, and process them before reading the next chunk.

The list comprehension line

tokens = [sentence for sentence in reader]

is reading the whole file to tokens and it may cause the process to run out of memory.

The csv.DictReader can read the CSV records line by line, meaning on each iteration, 1 line of data will be loaded to memory.

Like this:

with open(filename) as f:
    f = bz2.BZ2File(f, 'rb')
    reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
    for sentence in reader:
       # do something with sentence (process/aggregate/store/etc.)
       pass

Please note that if on the added loop, agian the data from the sentence is being stored in another variable (like tokens) still lots of memory may be consumed depending on how big is the data. So it's better to aggregate them, or use other type of storage for that data.

Update

About having some of the previous lines available in your process (as discussed in the comments), you can do something like this:

Then you can store the previous line in another variable, which gets replaced on each iteration.

Or if you needed multiple lines (back), then you can have a list of the last n lines.

How

Use a collections.deque with a maxlen to keep track of last n lines. Import deque from collections standard module at the top of your file.

from collections import deque

# rest of the code ...

last_sentences = deque(maxlen=5) # keep the previous lines as we need for processing new lines
for sentence in reader:
    # process the sentence
    last_sentences.append(sentence)

I suggest the above solution, but you can also implement it yourself using a list, and manually keep track of its size.

define an empty list before the loop, at the end of the loop check if the length of the list is larger than what you need, remove older items from the list, then append the current line.

last_sentences = [] # keep the previous lines as we need for processing new lines
for sentence in reader:
    # process the sentence
    if len(last_sentences) > 5: # make sure we won't keep all the previous sentences
        last_sentences = last_sentences[-5:]
    last_sentences.append(sentence)

Upvotes: 2

Related Questions