Reputation: 75
I'm trying to read multiple csv.gz files into a dataframe but it's not working as I expect.
When I use this globbing pattern:
pl.read_csv('folder_1\*.csv.gz')
It returns this error:
ComputeError: cannot scan compressed csv; use read_csv for compressed data
This error occurred with the following context stack: >[1] 'csv scan' failed [2] 'select' input failed to resolve
Which is strange considering I'm using the very function they suggest. However, passing this globbing pattern for csv works completely fine:
pl.read_csv('folder_1\*.csv')
How can I get around this? I'm currently just using glob.glob() and iterating through the list but I thought it'll look neater without it.
Upvotes: 2
Views: 328
Reputation: 5331
When I pass a glob string blah/blah/blah/*.csv.gz
to pl.read_csv
, it passes this to pl.scan_csv
because it is a glob string. See polars.io.csv.functions
line 514 et seq in version 1.1.0.
There are two separate questions here:
How do you read multiple CSVs? You can read multiple CSVs by passing a glob string to pl.scan_csv
. It returns a lazy data frame that you can then evaluate with .collect()
.
How do you read a compressed CSV? You can read certain types of compressed CSVs with pl.read_csv
(notable exception being csv.xz
s which do not work).
But put the two questions together and it turns out pl.scan_csv
does not support compressed files at all. This is an open issue.
If you want a one liner for reading your CSVs, you will have to fall back on something like a list comprehension with eager execution:
from glob import glob
l = [pl.read_csv(i) for i in glob('*.csv.gz')]
Then do what you will with the list of CSVs (eg pl.concat
).
Upvotes: 2