Reputation: 2573
I want to load .csv.zst
into a dataframe:
for ex in examples:
path = root + "f=" + ex + "/" + date
data = os.listdir(path)
for d in data:
zst_datapath = path + "/" + d
with open(zst_datapath, 'rb') as fh:
data = fh.read()
dctx = zstd.ZstdDecompressor(max_window_size=2147483648)
decompressed = dctx.decompress(data)
What I want do is read the decompressed file as csv file:
with open(decompressed, 'rb') as f:
csv_data = f.read()
csv = pd.read_csv(csv_data)
However, I get a File name too long
error. How do I load the decompressed data into pandas dataframe?
Upvotes: 1
Views: 1581
Reputation: 1136
Your main problem is that after going:
decompressed = dctx.decompress(data)
The variable decompress
now contains the whole un-compressed data (so the content itself of the csv.zst.
And then when you do:
with open(decompressed, 'rb') as f:
You are trying to open a file whose name is "{content of your csv}".
What you are thinking about is making an input stream of the decompressed data. Module io's StringIO is what you would be looking for. You pass it a text content, you get a file-like object that works as if it was coming from a file opened with open()
:
import io
with io.StringIO(decompressed) as f:
csv_data = f.read()
csv = pd.read_csv(csv_data)
# crashes here:---^
Except that, THIS WILL crash too, because read_csv()
is considering strings as being a "path", so again it will be looking a file whose name is "{content of your csv}".
If you want to pass a block of text to csv_read, you need to pass the f object itself:
import io
with io.StringIO(decompressed) as f:
csv = pd.read_csv(f)
This will work, EXCEPT THAT, read _csv can also decompress files. So with recent pandas you can actually completely skip the whole "decompression" part, and directly give the file name. Pandas will take care of decompressing:
csv = pd.read_csv(zst_datapath)
note that different compression scheme requires different dependencies to be installed to work.
Hope that this helps.
Upvotes: 2