Reputation: 123
I've worked with decompressing and reading files on the fly in memory with the bz2
library. However, i've read through the documentation and can't seem to just simply decompress the file to create a brand new file on the file system with the decompressed data without memory storage. Sure, you could read line by line using BZ2Decompressor then write that to a file, but that would be insanely slow. (Decompressing massive files, 50GB+). Is there some method or library I have overlooked to achieve the same functionality as the terminal command bz2 -d myfile.ext.bz2
in python without using a hacky solution involving a subprocess to call that terminal command?
Example why bz2 is so slow:
Decompressing that file via bz2 -d: 104seconds
Analytics on a decompressed file(just involves reading line by line): 183seconds
with open(file_src) as x:
for l in x:
Decompressing on the file and using analytics: Over 600 seconds (This time should be max 104+183)
if file_src.endswith(".bz2"):
bz_file = bz2.BZ2File(file_src)
for l in bz_file:
Upvotes: 5
Views: 4752
Reputation: 725
For smaller files that you can store in memory before you save to a file, you can use bz2.open
to decompress the file and save it as an uncompressed new file.
import bz2
#decompress data
with bz2.open('compressed_file.bz2', 'rb') as f:
uncompressed_content = f.read()
#store decompressed file
with open('new_uncompressed_file.dat', 'wb') as f:
f.write(uncompressed_content)
f.close()
Upvotes: -2
Reputation: 140168
You could use the bz2.BZ2File
object which provides a transparent file-like handle.
(edit: you seem to use that already, but don't use readlines()
on a binary file, or on a text file because in your case the block size isn't big enough which explains why it's slow)
Then use shutil.copyfileobj
to copy to the write handle of your output file (you can adjust block size if you can afford the memory)
import bz2,shutil
with bz2.BZ2File("file.bz2") as fr, open("output.bin","wb") as fw:
shutil.copyfileobj(fr,fw)
Even if the file is big, it doesn't take more memory than the block size. Adjust the block size like this:
shutil.copyfileobj(fr,fw,length = 1000000) # read by 1MB chunks
Upvotes: 10