Benjamin Du
Benjamin Du

Reputation: 1881

Can I specify a schema when reading/writing a Parquet file using Polars in Python?

When reading a CSV file using Polars in Python, we can use the parameter dtypes to specify the schema to use (for some columns). I wonder can we do the same when reading or writing a Parquet file? I tried to specify the dtypes parameter but it doesn't work.

I have some Parquet files generated from PySpark and want to load those Parquet files into Rust. The Rust requires unsigned integers while Spark/PySpark does not have unsigned integers and output signed integers into Parquet files. To make things simpler, I'd like to convert types of columns of Parquet files before loading them into Rust. I know there are several different ways to achieve this (both in pandas and polars) but I wonder whether there's easy and efficient way to do this using polars.

The code that I used to cast column types using polars in Python is as below.

import polars as pl

...
df["id0"] = df.id0.cast(pl.datatypes.UInt64)

Upvotes: 5

Views: 7141

Answers (1)

ritchie46
ritchie46

Reputation: 14670

Parquet files have a schema. We respect the schema of:

  • the parquet file upon reading
  • the DataFrame upon writing

If you want to change the schema you read/write, you need to cast columns in the DataFrame.

That's what we would do if we would accept a schema, so efficiency is the same.

Upvotes: 4

Related Questions