Lucas
Lucas

Reputation: 1448

How to define types of columns while loading dataframe in polars?

I'm using polars and I would like to define the type of the columns while loading a dataframe. In pandas, I can use dtype:

df=pd.read_csv("iris.csv", dtype={'petal_length':str})

I'm trying to do the same thing in polars, but without success until now. Here is what I have tried:

use polars::prelude::*;
use std::fs::File;
use std::collections::HashMap;


fn main() {
    let df = example();
    println!("{:?}", df.expect("Cannot find dataframe").head(Some(10)))
}

fn example() -> Result<DataFrame> {
    let file = File::open("iris.csv")
                    .expect("could not read file");
    let mut myschema = HashMap::new();
    myschema.insert("sepal_length", f64);
    myschema.insert("sepal_width", f64); 
    myschema.insert("petal_length",String); 
    myschema.insert("petal_width", f64); 
    myschema.insert("species", String); 

    CsvReader::new(file)
            .with_schema(myschema)
            .has_header(true)
            .finish()
}

My doubt is what type of data the implementation with_schema expects? I printed the schema of the DataFrame loaded using infer_schema(None).This prints a object that looks like a dictionary:

Schema { fields: [Field { name: "sepal_length", data_type: Float64 }, Field { name: "sepal_width", data_type: Float64 }, Field { name: "petal_length", data_type: Float64 }, Field { name: "petal_width", data_type: Float64 }, Field { name: "species", data_type: Utf8 }] }

But I cannot figure what object I should use to implement my schema.

Also, there is a way to specify the type of one variable, instead of all of them?

Upvotes: 5

Views: 4890

Answers (3)

Panagiotis Kokolis
Panagiotis Kokolis

Reputation: 11

The above code with Schema::new will not compile as of today. The solution is to use:

    let myschema = Schema::from_iter(
        vec![
            Field::new("sepal_length", DataType::Float64),
            Field::new("sepal_width", DataType::Float64),
            Field::new("petal_length", DataType::String),
            Field::new("petal_width", DataType::Float64),
            Field::new("species", DataType::Utf8),
        ]
    );

Upvotes: 1

C. Thomas Brittain
C. Thomas Brittain

Reputation: 386

A slight update to ritche46's answer. As Robert stated, the vector needs to be changed to an iterator. And it looks like we should use from now instead of new? I've not executed the code below, but it compiles.

...
        let myschema = Schema::from(
            vec![
                Field::new("sepal_length", DataType::Float64),
                Field::new("sepal_width", DataType::Float64),
                Field::new("petal_length", DataType::Utf8),
                Field::new("petal_width", DataType::Float64),
                Field::new("species", DataType::Utf8),
            ]
            .into_iter(),
        );
...

Upvotes: 1

ritchie46
ritchie46

Reputation: 14630

The with_schema method expects an Arc<Schema> type, not a Hashmap.

The following code works:

use polars::prelude::*;
use std::sync::Arc;

fn example() -> Result<DataFrame> {
    let file = "iris.csv";

    let myschema = Schema::new(
        vec![
            Field::new("sepal_length", DataType::Float64),
            Field::new("sepal_width", DataType::Float64),
            Field::new("petal_length", DataType::Utf8),
            Field::new("petal_width", DataType::Float64),
            Field::new("species", DataType::Utf8),
        ]
    );

    CsvReader::from_path(file)?
        .with_schema(Arc::new(myschema))
        .has_header(true)
        .finish()
}

Also, there is a way to specify the type of one variable, instead of all of them?

Yes, you can use with_dtype_overwrite. Which expects a partial schema.

Upvotes: 4

Related Questions