Pain
Pain

Reputation: 75

Contradictory error when using Polars read_csv() with multiple files for csv.gz

I'm trying to read multiple csv.gz files into a dataframe but it's not working as I expect.

When I use this globbing pattern:

pl.read_csv('folder_1\*.csv.gz')

It returns this error:

ComputeError: cannot scan compressed csv; use read_csv for compressed data

This error occurred with the following context stack: >[1] 'csv scan' failed [2] 'select' input failed to resolve

Which is strange considering I'm using the very function they suggest. However, passing this globbing pattern for csv works completely fine:

pl.read_csv('folder_1\*.csv')

How can I get around this? I'm currently just using glob.glob() and iterating through the list but I thought it'll look neater without it.

Upvotes: 2

Views: 328

Answers (1)

ifly6
ifly6

Reputation: 5331

When I pass a glob string blah/blah/blah/*.csv.gz to pl.read_csv, it passes this to pl.scan_csv because it is a glob string. See polars.io.csv.functions line 514 et seq in version 1.1.0.

There are two separate questions here:

  • How do you read multiple CSVs? You can read multiple CSVs by passing a glob string to pl.scan_csv. It returns a lazy data frame that you can then evaluate with .collect().

  • How do you read a compressed CSV? You can read certain types of compressed CSVs with pl.read_csv (notable exception being csv.xzs which do not work).

But put the two questions together and it turns out pl.scan_csv does not support compressed files at all. This is an open issue.

If you want a one liner for reading your CSVs, you will have to fall back on something like a list comprehension with eager execution:

from glob import glob
l = [pl.read_csv(i) for i in glob('*.csv.gz')]

Then do what you will with the list of CSVs (eg pl.concat).

Upvotes: 2

Related Questions