Reputation: 2719
In Polars, how can one specify a single dtype for all columns in read_csv
?
According to the docs, the schema_overrides
argument to read_csv
can take either a mapping (dict) in the form of {'column_name': dtype}
, or a list of dtypes, one for each column.
However, it is not clear how to specify "I want all columns to be a single dtype".
If you wanted all columns to be String for example and you knew the total number of columns, you could do:
pl.read_csv('sample.csv', schema_overrides=[pl.String]*number_of_columns)
However, this doesn't work if you don't know the total number of columns. In Pandas, you could do something like:
pd.read_csv('sample.csv', dtype=str)
But this doesn't work in Polars.
Upvotes: 15
Views: 26049
Reputation: 8286
If you want to read all columns as str
(pl.String
in polars) set infer_schema=False
as polars uses string as default type when reading csvs.
pl.read_csv('sample.csv', infer_schema=False)
This is the TLDR of ritchie46's more detailed answer. I broke it out into a separate answer as his code snippet solves the general case for any datatype and not the special but common case of reading all as strings.
Upvotes: 7
Reputation: 14730
Reading all data in a csv to any other type than pl.String
likely fails with a lot of null
values. We can use expressions to declare how we want to deal with those null values.
If you read a csv with infer_schema_length=0
, polars does not know the schema and will read all columns as pl.String
as that is a super type of all polars types.
When read as String
we can use expressions to cast all columns.
(pl.read_csv("test.csv", infer_schema_length=0)
.with_columns(pl.all().cast(pl.Int32, strict=False))
Update: infer_schema=False
was added in 1.2.0 as a more user-friendly name for this feature.
pl.read_csv("test.csv", infer_schema=False) # read all as pl.String
Upvotes: 21