How to use multiple catagorical input variables of different dimension into random forest regressor model?

Question

I have data which describes an item going through a release process. The item has different variables such as "Product category", "Design_country", "product line" and so on. In total I have 18 different types of either binary or categorical data. These different variables are of different dimension. For instance there are 3 different design countries while 8 different product categories. The output variables is the time it takes for an item to go through the release process, which is a continous variable. I want to predict how long it will take for an item to go through the process.

 Design_cntry      Prod_category    prod_line    ...   time_minutes
     A                  A1             A11       ...     43.2
     B                  B1             A11       ...     20.1    
     C                  E1             B11       ...     15.0
    ...                ...             ...       ...     ....

In order for me to use these as input into a random forest regressor, how do I handle the different input variables?

I know that using categorical variables you can apply one hot encoding. But do I do this on each seperate variable?

 X_des_country = pd.get_dummies([data['design_cntry'], prefix = "design_country")
 X_prod_cat = pd.get_dummies([data['prod_cat'], prefix = "prod_cat")

Then I would have 18 different input dataframes of varying number of columns. How do I then use these variables as input when training my model? Do I put all of them inside one dataframe "X" by merging with respect to the index?

Or is it better to apply one hot encoding on the original dataframe directly?

   X = df.drop("time_minutes", axis = 1)
   X = pd.get_dummies(X)

Lo&#239;c L. · Accepted Answer

As an (important) side note, to avoid the dummy variable trap, you need to remove one value for each categorical variable: you can do that by adding drop_first=True in pd.get_dummies

For your regression model, you can just put together all these new features and use it to train your model. But you don't have to create 18 different DataFrames, you can do it all at once:

>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],'C': [1, 2, 3]})
   A  B  C
0  a  b  1
1  b  a  2
2  a  c  3

>>> pd.get_dummies(df, drop_first=True)
   C  A_b  B_b  B_c
0  1    0    1    0
1  2    1    0    0
2  3    0    0    1

This will create dummy variables only for the categorical variables (ie the char columns) and leave the int column as they are (cf column C above). If one of your variable contains only integers but you want it to be considered as a categorical variable, simply convert it to a character variable beforehand.

How to use multiple catagorical input variables of different dimension into random forest regressor model?

Answers (1)

Related Questions