Reputation: 469
I'm unable to use polars dataframes with scikit-learn for ML training.
Currently, I'm preprocessing all dataframes in polars and convert them to pandas for model training in order for it to work.
Is there any method to directly use polars dataframes with the scikit-learn API (without converting to pandas first)?
Upvotes: 13
Views: 12844
Reputation: 10279
Since asking the question, scikit-learn 1.4 was released improving compatibility with polars.
For example, see the set_output()
method of an instance of the sklearn.compose.ColumnTransformer
. It can be used as follows.
We start with some sample data
import polars as pl
df = pl.DataFrame({
"num": [1, 2, 3],
"cat": ["a", "b", "c"],
})
and apply the ColumnTransformer
as follows.
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer
# create column transformer
transformer = ColumnTransformer(
transformers=[
("num", StandardScaler(), ["num"]),
("cat", OrdinalEncoder(), ["cat"]),
]
)
# enable polars output
transformer.set_output(transform="polars")
# fit and transform polars dataframe
transformer.fit_transform(df)
The output again is a pl.DataFrame
object.
shape: (3, 2)
┌───────────┬──────────┐
│ num__num ┆ cat__cat │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═══════════╪══════════╡
│ -1.224745 ┆ 0.0 │
│ 0.0 ┆ 1.0 │
│ 1.224745 ┆ 2.0 │
└───────────┴──────────┘
Upvotes: 15
Reputation: 65
The upcoming development version of scikit-learn, 1.4, has added polars output support (and added support for __dataframe__
protocol to Estimators). See the github PRs 26464 and 27315 for more info
(Note: I just happened across this question and then saw this in the scikit-learn changelog, credit goes to Thomas J. Fan)
Edit: ColumnTransformer for preprocessing may (hopefully!) be coming as well, but is not yet available in the nightlies.
Upvotes: 1
Reputation: 469
encoding_transformer1 = ColumnTransformer(
[("Normalizer", Normalizer(), ['Age', 'Fare']),
("One-hot encoder",
OneHotEncoder(dtype=int, handle_unknown='infrequent_if_exist'),
['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'])],
n_jobs=-1,
verbose=True,
verbose_feature_names_out=True)
encoding_transformer1.fit(xtrain)
train_data = encoding_transformer1.transform(xtrain).tocsr()
test_data = encoding_transformer1.transform(xtest).tocsr()
I'm getting this error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
what should i do?
Upvotes: 1
Reputation: 14730
You must call to_numpy
when passing a DataFrame
to sklearn. Though sometimes sklearn
can work on polars Series
it is still good type hygiene to transform to the type the host library expects.
import polars as pl
from sklearn.linear_model import LinearRegression
data = pl.DataFrame(
np.random.randn(100, 5)
)
x = data.select([
pl.all().exclude("column_0"),
])
y = data.select(pl.col("column_0").alias("y"))
x_train = x[:80]
y_train = y[:80]
x_test = x[80:]
y_test = y[80:]
m = LinearRegression()
m.fit(X=x_train.to_numpy(), y=y_train.to_numpy())
m.predict(x_test.to_numpy())
Upvotes: 11