teratoulis
teratoulis

Reputation: 67

Python: Random forest regression with discrete (categorial) features?

I am using random forest regressor as my target values is not categorial. However, the features are.

When I run the algorithm it treats them as continuous variables.

Is there any way to treat them as categorial?

example:

enter image description here

when I try random forest regressor it treats user ID for example as continuous (taking values 1.5 etc.)

The dtype in the data frame is int64.

Could you help me with that?

thanks

here is the code I have tried:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import numpy as np

df = pd.read_excel('Data_frame.xlsx', sheet_name=5)
df.head
df.dtypes



X = df.drop('productivity', axis='columns')
y = df['productivity']


X_train, X_test, y_train, y_test = train_test_split(X, y)
rf = RandomForestRegressor(bootstrap=False, n_estimators=1000, criterion='squared_error', max_depth=5, max_features='sqrt')
rf.fit(X_train.values, y_train)

plt.figure(figsize=(15,20))
_ = tree.plot_tree(rf.estimators_[1], feature_names=X.columns, filled=True,fontsize=8)

y_predict = rf.predict(X_test.values)
mae = mean_absolute_error(y_predict,y_test)
print(mae)

Upvotes: 1

Views: 639

Answers (1)

Alex Serra Marrugat
Alex Serra Marrugat

Reputation: 2042

First of all, RandomForestRegressor only accepts numerical values. So encoding your numerical values to categorical is not a solution because you are not going to be able to train you model.

The way to deal with this type of problem is OneHotEncoder. This function will create one column for every value that you have in the specified feature.

Below there is the example of code:

# creating initial dataframe
values = (1,10,1,2,2,3,4)
df = pd.DataFrame(values, columns=['Numerical_data'])

Datafram will look like this:

    Numerical_data
0   1
1   10
2   1
3   2
4   2
5   3
6   4

Now, OneHotEncode it:

enc = OneHotEncoder(handle_unknown='ignore') 
enc_df = pd.DataFrame(enc.fit_transform(df[['Bridge_Types']]).toarray())
enc_df

    0   1   2   3   4
0   1.0 0.0 0.0 0.0 0.0
1   0.0 0.0 0.0 0.0 1.0
2   1.0 0.0 0.0 0.0 0.0
3   0.0 1.0 0.0 0.0 0.0
4   0.0 1.0 0.0 0.0 0.0
5   0.0 0.0 1.0 0.0 0.0
6   0.0 0.0 0.0 1.0 0.0

Then, depending your necesities, you can join this calculated frame to you DataSet. Be aware that you should remove the initial feature:

# merge with main df bridge_df on key values

df = df.join(enc_df)
df

    Numerical_data 0    1   2   3   4
0   1   1.0 0.0 0.0 0.0 0.0
1   10  0.0 0.0 0.0 0.0 1.0
2   1   1.0 0.0 0.0 0.0 0.0
3   2   0.0 1.0 0.0 0.0 0.0
4   2   0.0 1.0 0.0 0.0 0.0
5   3   0.0 0.0 1.0 0.0 0.0
6   4   0.0 0.0 0.0 1.0 0.0

Of course, if you have hundreds of different values in your specified feature, many columns will be created. But this is the way to proceed.

Upvotes: 1

Related Questions