Reputation: 6914
I am trying to read a bunch of gzip-compressed csv files from S3 using pyarrow.
The documentation page of pyarrow.csv.read_csv
says
If a string or path, and if it ends with a recognized compressed file extension (e.g. “.gz” or “.bz2”
Unfortunately, I cannot provide a string value as the input path, so the CSV reader assumes no compression.
import s3fs
import pyarrow.csv as pv
s3 = s3fs.core.S3FileSystem(anon=False)
csv_path = 's3://bucket_name/path/to/file.csv.gz'
with s3.open(csv_path) as s3fp:
table = pv.read_csv(s3fp)
I tried to dig deeper into pyarrow internals but I wasn't able to identify a way to pass an additional argument for compression type.
Upvotes: 0
Views: 2511
Reputation: 6914
Found a workaround for it. It is possible to add a gzip decompression in between before reading the csv from the file handler:
import gzip
import s3fs
import pyarrow.csv as pv
s3 = s3fs.core.S3FileSystem(anon=False)
csv_path = 's3://bucket_name/path/to/file.csv.gz'
with s3.open(csv_path) as s3fp:
with gzip.open(s3fp) as fp:
table = pv.read_csv(fp)
Upvotes: 3