Jeroen
Jeroen

Reputation: 957

Create dummy variables for interdependent categories in pandas

I'm trying to set up a linear regression model in order to predict traffic counts based on the day, and the time of day. Since both are categorical variables, I have to create dummy variables. The get_dummies function makes this very easy, when doing this for both variables individually. However, in the case of predicting traffic volumes, the interdependence between the day and the time of day are important. Therefore, I'll need dummies for all days * all time intervals.

I made a small example, avoiding to trouble you with big datasets:

import pandas as pd

df = pd.DataFrame({'Day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
                    'Time': [11,15,9,15,17,10,20],
                    'Count': [100,150,150,150,180,60,50]})

df_dummies = pd.get_dummies(df.Day)

print(df_dummies)

Results in a nice dataframe with dummies:

   Fri  Mon  Sat  Sun  Thu  Tue  Wed
0    0    1    0    0    0    0    0
1    0    0    0    0    0    1    0
2    0    0    0    0    0    0    1
3    0    0    0    0    1    0    0
4    1    0    0    0    0    0    0
5    0    0    1    0    0    0    0
6    0    0    0    1    0    0    0

So what I'm after is something like this:

import pandas as pd

df = pd.DataFrame({'Day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
                    'Time': [11,15,9,15,17,10,20],
                    'Count': [100,150,150,150,180,60,50]})

df_dummies = pd.get_dummies(df.Day * df.Time)

print(df_dummies)

With a result like this:

   Fri_9  Fri_15  Mon_9 Mon_15 Sat_9 Sat_15 Sun_9 ...
0    0    1    0    0    0    0    0 ...
1    0    0    0    0    0    1    0 ...
2    0    0    0    0    0    0    1 ...
3    0    0    0    0    1    0    0 ...
4    1    0    0    0    0    0    0 ... 
5    0    0    1    0    0    0    0 ...
6    0    0    0    1    0    0    0 ...
7    0    0    0    0    0    0    0 ...
[...]

Is there any way in which this can be done elegantly?

Upvotes: 1

Views: 965

Answers (2)

Ami Tavory
Ami Tavory

Reputation: 76386

I'm trying to set up a linear regression model in order to predict

Technically, you can make a dummy of the tuples:

>>> pd.get_dummies(df[['Day', 'Time']].apply(tuple, axis=1))

(Fri, 17)   (Mon, 11)   (Sat, 10)   (Sun, 20)   (Thu, 15)   (Tue, 15)   (Wed, 9)
0   0   1   0   0   0   0   0
1   0   0   0   0   0   1   0
2   0   0   0   0   0   0   1
3   0   0   0   0   1   0   0
4   1   0   0   0   0   0   0
5   0   0   1   0   0   0   0
6   0   0   0   1   0   0   0
...

however, I think this approach is not the best at the ML level. This will probably fragment the data very much, making things hard for your regressor. You might consider using a gradient-boosted decision tree, if you're after the interactions.

Upvotes: 1

jezrael
jezrael

Reputation: 863361

I belive need join columns together with cast to strings:

df_dummies = pd.get_dummies(df.Day + '_' + df.Time.astype(str))

#df_dummies = pd.get_dummies(df.Day.str.cat(df.Time.astype(str), sep='_'))

print(df_dummies)
   Fri_17  Mon_11  Sat_10  Sun_20  Thu_15  Tue_15  Wed_9
0       0       1       0       0       0       0      0
1       0       0       0       0       0       1      0
2       0       0       0       0       0       0      1
3       0       0       0       0       1       0      0
4       1       0       0       0       0       0      0
5       0       0       1       0       0       0      0
6       0       0       0       1       0       0      0

Upvotes: 2

Related Questions