echo55
echo55

Reputation: 329

Why isn't the MLJ OneHotEncoder transforming the data frame?

I'm sorry if I miss something but I don't understand why this doesn't work:

using DataFrames, MLJ

julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4×2 DataFrame
│ Row │ A     │ B      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ M      │
│ 2   │ 2     │ F      │
│ 3   │ 3     │ F      │
│ 4   │ 4     │ M      │

julia> hot_model = OneHotEncoder()
julia> hot = machine(hot_model, df)
julia> fit!(hot)
julia> Xt = MLJ.transform(hot, df)

Xt is exacty as df, it didn't tranform the columns. I tried to specify the features in OneHotEncoder() but it doesn't change. I also saw that you can make a pipeline with it by wrapping it and fitting only at the end with the model but it should work like that, no? Is it maybe because of the type of the columns? What scitype should it be? Categorical? How can I change it into that?

Upvotes: 1

Views: 264

Answers (1)

Cameron Bieganek
Cameron Bieganek

Reputation: 7694

Yes, you will need to change the scitypes of the columns. You can check the scitype of each column by using schema on the data frame:

julia> schema(df)
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ A       │ Int64   │ Count      │
│ B       │ String  │ Textual    │
└─────────┴─────────┴────────────┘
_.nrows = 4

Here you can see that the scitype of column B is Textual, so you will need to change that to Multiclass. You can use the coerce function to change the scitypes of the columns. Note that in MLJ integer columns are interpreted as count data, so you will also need to coerce column A if you want it to represent continuous data. The coerce method can be used as follows:

julia> coerce!(df, :A => Continuous, :B => Multiclass)
4×2 DataFrame
│ Row │ A       │ B    │
│     │ Float64 │ Cat… │
├─────┼─────────┼──────┤
│ 1   │ 1.0     │ M    │
│ 2   │ 2.0     │ F    │
│ 3   │ 3.0     │ F    │
│ 4   │ 4.0     │ M    │

Now the one-hot encoder will work properly.

ohe = machine(OneHotEncoder(), df)
fit!(ohe)
Xt = MLJ.transform(ohe, df)
4×3 DataFrame
│ Row │ A       │ B__F    │ B__M    │
│     │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1   │ 1.0     │ 0.0     │ 1.0     │
│ 2   │ 2.0     │ 1.0     │ 0.0     │
│ 3   │ 3.0     │ 1.0     │ 0.0     │
│ 4   │ 4.0     │ 0.0     │ 1.0     │

See the section of the MLJ manual on working with categorical data for more information.

Upvotes: 1

Related Questions