Contradictory error when using Polars read_csv() with multiple files for csv.gz

Question

I'm trying to read multiple csv.gz files into a dataframe but it's not working as I expect.

When I use this globbing pattern:

pl.read_csv('folder_1\*.csv.gz')

It returns this error:

ComputeError: cannot scan compressed csv; use read_csv for compressed data

This error occurred with the following context stack: >[1] 'csv scan' failed [2] 'select' input failed to resolve

Which is strange considering I'm using the very function they suggest. However, passing this globbing pattern for csv works completely fine:

pl.read_csv('folder_1\*.csv')

How can I get around this? I'm currently just using glob.glob() and iterating through the list but I thought it'll look neater without it.

ifly6 · Accepted Answer

There are two separate questions here:

How do you read multiple CSVs? You can read multiple CSVs by passing a glob string to pl.scan_csv. It returns a lazy data frame that you can then evaluate with .collect().
How do you read a compressed CSV? You can read certain types of compressed CSVs with pl.read_csv (notable exception being csv.xzs which do not work).

But put the two questions together and it turns out pl.scan_csv does not support compressed files at all. This is an open issue.

If you want a one liner for reading your CSVs, you will have to fall back on something like a list comprehension with eager execution:

from glob import glob
l = [pl.read_csv(i) for i in glob('*.csv.gz')]

Then do what you will with the list of CSVs (eg pl.concat).

Answers (1)