Reputation: 1201
In search of a solution of similar this but in python using gzip
or zlib
.
This SO question How to inflate a partial zlib file does not work (see the first test case)
Not a duplicate of Unzipping part of a .gz file using python, that is not working (and outdated)
These two: Unzip part of a file using python gzip module and Is it possible to figure how to decompress a file, knowing its first bytes? are close to this question (though different) but unfortunately first one doesn't have a working solution and the second one doesn't have any answers at all...
I am iterating over a chunked pieces of gzip bytes received from a remote server, it looks something like this :
async with aiohttp.ClientSession() as session:
async with session.get(LINK) as response:
with open(FILE, "wb") as f:
async for chunk in response.content.iter_chunked(chunk_size):
# Write the decompressed chunk
# to `f`
...
The following are the non-working solutions :
1)
decompressor = zlib.decompressobj(16 + zlib.MAX_WBITS)
async with aiohttp.ClientSession() as session:
async with session.get(LINK) as response:
with open(FILE, "wb") as f:
async for chunk in response.content.iter_chunked(chunk_size):
# Write the decompressed chunk
r = decompressor.decompress(chunk, chunk_size)
# for some reason `r` is always empty
# writing to `f` is pointless
print(f"{len(chunk) = }, {r = }, {len(r) = }")
And here, the r
seems to be empty.
stdout :
len(chunk) = 64, r = b'', len(r) = 0
len(chunk) = 64, r = b'', len(r) = 0
len(chunk) = 64, r = b'', len(r) = 0
len(chunk) = 64, r = b'', len(r) = 0
len(chunk) = 64, r = b'', len(r) = 0
len(chunk) = 64, r = b'', len(r) = 0
...
2)
doing zlib.decompress(...)
doesn't seem to work either on partial data
async with aiohttp.ClientSession() as session:
async with session.get(LINK) as response:
with open(DIR, "wb") as f:
async for chunk in response.content.iter_chunked(chunk_size):
f.write(zlib.decompress(chunk))
This raises :
Traceback (most recent call last):
File "c:\Users\lumin\Desktop\rplace\get_data.py", line 54, in <module>
asyncio.run(main())
File "C:\Users\lumin\AppData\Local\Programs\Python\Python310\lib\asyncio\runners.py", line 44, in run
return loop.run_until_complete(main)
File "C:\Users\lumin\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 641, in run_until_complete
return future.result()
File "c:\Users\lumin\Desktop\rplace\get_data.py", line 51, in main
await download_content(0)
File "c:\Users\lumin\Desktop\rplace\get_data.py", line 47, in download_content
f.write(zlib.decompress(chunk))
zlib.error: Error -3 while decompressing data: incorrect header check
3)
Passing in gzip.decompress(chunk)
like this :
with open(DIR, "wb") as f:
async for chunk in response.content.iter_chunked(chunk_size):
f.write(gzip.decompress(chunk))
Causes this :
Traceback (most recent call last):
File "c:\Users\lumin\Desktop\rplace\get_data.py", line 54, in <module>
asyncio.run(main())
File "C:\Users\lumin\AppData\Local\Programs\Python\Python310\lib\asyncio\runners.py", line 44, in run
return loop.run_until_complete(main)
File "C:\Users\lumin\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 641, in run_until_complete
return future.result()
File "c:\Users\lumin\Desktop\rplace\get_data.py", line 51, in main
await download_content(0)
File "c:\Users\lumin\Desktop\rplace\get_data.py", line 47, in download_content
f.write(gzip.decompress(chunk))
File "C:\Users\lumin\AppData\Local\Programs\Python\Python310\lib\gzip.py", line 557, in decompress
return f.read()
File "C:\Users\lumin\AppData\Local\Programs\Python\Python310\lib\gzip.py", line 301, in read
return self._buffer.read(size)
File "C:\Users\lumin\AppData\Local\Programs\Python\Python310\lib\_compression.py", line 118, in readall
while data := self.read(sys.maxsize):
File "C:\Users\lumin\AppData\Local\Programs\Python\Python310\lib\gzip.py", line 479, in read
self._read_eof()
File "C:\Users\lumin\AppData\Local\Programs\Python\Python310\lib\gzip.py", line 523, in _read_eof
crc32, isize = struct.unpack("<II", self._read_exact(8))
File "C:\Users\lumin\AppData\Local\Programs\Python\Python310\lib\gzip.py", line 425, in _read_exact
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
The full code looks something like this :
from typing import Final
import aiohttp
import asyncio
import os
if os.name == "nt":
# Prevent noisy exit on Windows
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
async def download_content(
number: int, *, directory: str | None = None, chunk_size: int = 64
) -> None:
"""
Download the content of a archived canvas history file.
And extracts it immediately.
Args:
number: The number associated with the archive.
directory: The directory to extract the file to, defaults to root.
chunk_size: The size of the chunks to download and extract, defaults to 64.
Raises:
TypeError: Argument got invalid type.
ValueError: number wasn't between 0 and 77.
"""
if not isinstance(number, int):
raise TypeError(f"'number' must be of type 'int' got {type(number)}")
if not isinstance(directory, str) and directory is not None:
raise TypeError(f"'directory' must be of type 'str' got {type(directory)}")
if not isinstance(chunk_size, int):
raise TypeError(f"'chunk_size' must be of type 'int' got {type(chunk_size)}")
if 0 > number > 77:
raise ValueError(f"'number' must be between 0 and 77 got {number}")
LINK: Final[str] = "https://placedata.reddit.com/data/canvas-history/2022_place_canvas_history-"
FILE_LOCATION: Final[str] = f"{'0' * (12 - len(str(number)))}{number}.csv.gzip"
DIR: Final[str] = directory if directory is not None else "./"
async with aiohttp.ClientSession() as session:
async with session.get(LINK + FILE_LOCATION) as response:
with open(DIR + FILE_LOCATION[:-5], "wb") as f:
async for chunk in response.content.iter_chunked(chunk_size):
# Write the decompressed chunk to the file
...
async def main():
await download_content(0)
asyncio.run(main())
TLDR: We have received an gzip
file and are iterating over the chunks, we are interested in decompressing these said partial data and writing them to a file.
Upvotes: 1
Views: 1611
Reputation: 112339
I don't think that the second argument of decompress()
means what you think it means. It is not the length of the input (which is already available from the byte array itself), but rather a constraint on the length of the decompressed data returned. You should not even specify it, allowing decompress()
to return all of the decompressed data it has so far.
The code below works for me. I used split -b 64
to split a gzip file into 64-byte chunks xaa
, xab
, etc., and then ran the command below with the arguments x??
to provide those chunks in order. The combined decompressed result was correctly written to stdout.
#!/usr/bin/python3
import sys
import zlib
gz = zlib.decompressobj(31)
for arg in sys.argv[1:]:
with open(arg, "rb") as f:
chunk = f.read()
sys.stdout.buffer.write(gz.decompress(chunk))
sys.stdout.buffer.write(gz.flush())
(The final flush isn't really needed, as the last decompress()
will return all of the decompressed data from the last chunk. I include it for completeness to effectively close the decompression object and release any resources it has sequestered.)
Upvotes: 2