Reputation: 123

Possible to decompress bz2 in python to a file instead of memory

I've worked with decompressing and reading files on the fly in memory with the bz2 library. However, i've read through the documentation and can't seem to just simply decompress the file to create a brand new file on the file system with the decompressed data without memory storage. Sure, you could read line by line using BZ2Decompressor then write that to a file, but that would be insanely slow. (Decompressing massive files, 50GB+). Is there some method or library I have overlooked to achieve the same functionality as the terminal command bz2 -d myfile.ext.bz2 in python without using a hacky solution involving a subprocess to call that terminal command?

Example why bz2 is so slow:

Decompressing that file via bz2 -d: 104seconds

Analytics on a decompressed file(just involves reading line by line): 183seconds

with open(file_src) as x:
    for l in x:

Decompressing on the file and using analytics: Over 600 seconds (This time should be max 104+183)

if file_src.endswith(".bz2"):
    bz_file = bz2.BZ2File(file_src)
    for l in bz_file:

Upvotes: 5

Answers (2)

bfree67

Reputation: 725

For smaller files that you can store in memory before you save to a file, you can use bz2.open to decompress the file and save it as an uncompressed new file.

import bz2

#decompress data
with bz2.open('compressed_file.bz2', 'rb') as f:
    uncompressed_content = f.read()

#store decompressed file
with open('new_uncompressed_file.dat', 'wb') as f:
   f.write(uncompressed_content)
   f.close()

Upvotes: -2

Jean-François Fabre

Reputation: 140168

You could use the bz2.BZ2File object which provides a transparent file-like handle.

(edit: you seem to use that already, but don't use readlines() on a binary file, or on a text file because in your case the block size isn't big enough which explains why it's slow)

Then use shutil.copyfileobj to copy to the write handle of your output file (you can adjust block size if you can afford the memory)

import bz2,shutil

with bz2.BZ2File("file.bz2") as fr, open("output.bin","wb") as fw:
    shutil.copyfileobj(fr,fw)

Even if the file is big, it doesn't take more memory than the block size. Adjust the block size like this:

shutil.copyfileobj(fr,fw,length = 1000000)  # read by 1MB chunks

Upvotes: 10

Possible to decompress bz2 in python to a file instead of memory

Answers (2)

Related Questions