How to use polars dataframes with scikit-learn?

I'm unable to use polars dataframes with scikit-learn for ML training.

Currently, I'm preprocessing all dataframes in polars and convert them to pandas for model training in order for it to work.

Is there any method to directly use polars dataframes with the scikit-learn API (without converting to pandas first)?

Upvotes: 13

Answers (4)

Hericks

Reputation: 10279

Since asking the question, scikit-learn 1.4 was released improving compatibility with polars.

For example, see the set_output() method of an instance of the sklearn.compose.ColumnTransformer. It can be used as follows.

We start with some sample data

import polars as pl

df = pl.DataFrame({
    "num": [1, 2, 3],
    "cat": ["a", "b", "c"],
})

and apply the ColumnTransformer as follows.

from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer

# create column transformer
transformer = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), ["num"]),
        ("cat", OrdinalEncoder(), ["cat"]),
    ]
)

# enable polars output
transformer.set_output(transform="polars")

# fit and transform polars dataframe
transformer.fit_transform(df)

The output again is a pl.DataFrame object.

shape: (3, 2)
┌───────────┬──────────┐
│ num__num  ┆ cat__cat │
│ ---       ┆ ---      │
│ f64       ┆ f64      │
╞═══════════╪══════════╡
│ -1.224745 ┆ 0.0      │
│ 0.0       ┆ 1.0      │
│ 1.224745  ┆ 2.0      │
└───────────┴──────────┘

Upvotes: 15

Chrstfer CB

Reputation: 65

The upcoming development version of scikit-learn, 1.4, has added polars output support (and added support for __dataframe__ protocol to Estimators). See the github PRs 26464 and 27315 for more info

(Note: I just happened across this question and then saw this in the scikit-learn changelog, credit goes to Thomas J. Fan)

Edit: ColumnTransformer for preprocessing may (hopefully!) be coming as well, but is not yet available in the nightlies.

Upvotes: 1

Regular Tech Guy

Reputation: 469

encoding_transformer1 = ColumnTransformer(
    [("Normalizer", Normalizer(), ['Age', 'Fare']),
     ("One-hot encoder",
      OneHotEncoder(dtype=int, handle_unknown='infrequent_if_exist'),
      ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'])],
    n_jobs=-1,
    verbose=True,
    verbose_feature_names_out=True)

encoding_transformer1.fit(xtrain)
train_data = encoding_transformer1.transform(xtrain).tocsr()
test_data = encoding_transformer1.transform(xtest).tocsr()

I'm getting this error:

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

what should i do?

Upvotes: 1

ritchie46

Reputation: 14730

You must call to_numpy when passing a DataFrame to sklearn. Though sometimes sklearn can work on polars Series it is still good type hygiene to transform to the type the host library expects.

import polars as pl
from sklearn.linear_model import LinearRegression

data = pl.DataFrame(
    np.random.randn(100, 5)
)

x = data.select([
    pl.all().exclude("column_0"),
])

y = data.select(pl.col("column_0").alias("y"))


x_train = x[:80]
y_train = y[:80]

x_test = x[80:]
y_test = y[80:]


m = LinearRegression()

m.fit(X=x_train.to_numpy(), y=y_train.to_numpy())
m.predict(x_test.to_numpy())

Upvotes: 11

How to use polars dataframes with scikit-learn?

Answers (4)

Related Questions