Create dummy variables for interdependent categories in pandas

Question

I'm trying to set up a linear regression model in order to predict traffic counts based on the day, and the time of day. Since both are categorical variables, I have to create dummy variables. The get_dummies function makes this very easy, when doing this for both variables individually. However, in the case of predicting traffic volumes, the interdependence between the day and the time of day are important. Therefore, I'll need dummies for all days * all time intervals.

I made a small example, avoiding to trouble you with big datasets:

import pandas as pd

df = pd.DataFrame({'Day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
                    'Time': [11,15,9,15,17,10,20],
                    'Count': [100,150,150,150,180,60,50]})

df_dummies = pd.get_dummies(df.Day)

print(df_dummies)

Results in a nice dataframe with dummies:

   Fri  Mon  Sat  Sun  Thu  Tue  Wed
0    0    1    0    0    0    0    0
1    0    0    0    0    0    1    0
2    0    0    0    0    0    0    1
3    0    0    0    0    1    0    0
4    1    0    0    0    0    0    0
5    0    0    1    0    0    0    0
6    0    0    0    1    0    0    0

So what I'm after is something like this:

import pandas as pd

df = pd.DataFrame({'Day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
                    'Time': [11,15,9,15,17,10,20],
                    'Count': [100,150,150,150,180,60,50]})

df_dummies = pd.get_dummies(df.Day * df.Time)

print(df_dummies)

With a result like this:

   Fri_9  Fri_15  Mon_9 Mon_15 Sat_9 Sat_15 Sun_9 ...
0    0    1    0    0    0    0    0 ...
1    0    0    0    0    0    1    0 ...
2    0    0    0    0    0    0    1 ...
3    0    0    0    0    1    0    0 ...
4    1    0    0    0    0    0    0 ... 
5    0    0    1    0    0    0    0 ...
6    0    0    0    1    0    0    0 ...
7    0    0    0    0    0    0    0 ...
[...]

Is there any way in which this can be done elegantly?

Ami Tavory · Accepted Answer

I'm trying to set up a linear regression model in order to predict

Technically, you can make a dummy of the tuples:

>>> pd.get_dummies(df[['Day', 'Time']].apply(tuple, axis=1))

(Fri, 17)   (Mon, 11)   (Sat, 10)   (Sun, 20)   (Thu, 15)   (Tue, 15)   (Wed, 9)
0   0   1   0   0   0   0   0
1   0   0   0   0   0   1   0
2   0   0   0   0   0   0   1
3   0   0   0   0   1   0   0
4   1   0   0   0   0   0   0
5   0   0   1   0   0   0   0
6   0   0   0   1   0   0   0
...

however, I think this approach is not the best at the ML level. This will probably fragment the data very much, making things hard for your regressor. You might consider using a gradient-boosted decision tree, if you're after the interactions.

Create dummy variables for interdependent categories in pandas

Answers (2)

Related Questions