MoneyBall
MoneyBall

Reputation: 2573

reading .csv.zst file with pandas

I want to load .csv.zst into a dataframe:

for ex in examples:
    path = root + "f=" + ex + "/" + date
    data = os.listdir(path)
    
    for d in data:
        zst_datapath = path + "/" + d
        with open(zst_datapath, 'rb') as fh:
            data = fh.read()
            dctx = zstd.ZstdDecompressor(max_window_size=2147483648)
            decompressed = dctx.decompress(data)          

What I want do is read the decompressed file as csv file:

with open(decompressed, 'rb') as f:
    csv_data = f.read()
    csv = pd.read_csv(csv_data)

However, I get a File name too long error. How do I load the decompressed data into pandas dataframe?

Upvotes: 1

Views: 1581

Answers (1)

DrYak
DrYak

Reputation: 1136

Your main problem is that after going:

decompressed = dctx.decompress(data)

The variable decompress now contains the whole un-compressed data (so the content itself of the csv.zst. And then when you do:

with open(decompressed, 'rb') as f:

You are trying to open a file whose name is "{content of your csv}".

What you are thinking about is making an input stream of the decompressed data. Module io's StringIO is what you would be looking for. You pass it a text content, you get a file-like object that works as if it was coming from a file opened with open():

import io

with io.StringIO(decompressed) as f:
   csv_data = f.read()
   csv = pd.read_csv(csv_data)
   # crashes here:---^

Except that, THIS WILL crash too, because read_csv() is considering strings as being a "path", so again it will be looking a file whose name is "{content of your csv}".

If you want to pass a block of text to csv_read, you need to pass the f object itself:

import io

with io.StringIO(decompressed) as f:
   csv = pd.read_csv(f)

This will work, EXCEPT THAT, read _csv can also decompress files. So with recent pandas you can actually completely skip the whole "decompression" part, and directly give the file name. Pandas will take care of decompressing:

csv = pd.read_csv(zst_datapath)

note that different compression scheme requires different dependencies to be installed to work.

Hope that this helps.

Upvotes: 2

Related Questions