Create dummy variables if value is in list

Question

I'm working with the Zomato Bangalore Restaurant data set found here. One of my pre-processing steps is to create dummy variables for the types of cuisine each restaurant serves. I've used panda's explode to split the cuisines and I've created lists for the top 30 cuisines and the not top 30 cuisines. I've created a sample dataframe below.


    sample_df = pd.DataFrame({
        'name': ['Jalsa', 'Spice Elephant', 'San Churro Cafe'],
        'cuisines_lst': [
            ['North Indian', 'Chinese'],
            ['Chinese', 'North Indian', 'Thai'],
            ['Cafe', 'Mexican', 'Italian']
        ]
    })

I've created the top and not top lists. In the actual data I'm using the top 30 but for the sake of the example, it's the top 2 and not top 2.


top2 = sample_df.explode('cuisines_lst')['cuisines_lst'].value_counts().index[0:2].tolist()
not_top2 = sample_df.explode('cuisines_lst')['cuisines_lst'].value_counts().index[2:].tolist()

What I would like is to create a dummy variable for all cuisines in the top list with the suffix _bin and create a final dummy variable other if a restaurant has one of the cuisines from the not top list. The desired output looks like this:

name	cuisines_lst	Chinese_bin	North Indian_bin	Other
Jalsa	[North Indian, Chinese]	1	1	0
Spice Elephant	[Chinese, North Indian, Thai]	1	1	1
San Churro Cafe	[Cafe, Mexican, Italian]	0	0	1

ifly6 · Accepted Answer

Create the dummies, then reduce by duplicated indices to get your columns for the top 2:

a = pd.get_dummies(sample_df['cuisines_lst'].explode()) \
    .reset_index().groupby('index')[top2].sum().add_suffix('_bin')

If you want it in alphabetical order (in this case, Chinese followed by North Indian), add an intermediate step to sort columns with a.sort_index(axis=1).

Do the same for the other values, but reducing columns as well by passing axis=1 to any:

b = pd.get_dummies(sample_df['cuisines_lst'].explode()) \
    .reset_index().groupby('index')[not_top2].sum() \
    .any(axis=1).astype(int).rename('Other')

Concatenating on indices:

>>> print(pd.concat([sample_df, a, b], axis=1).to_string())
              name                   cuisines_lst  North Indian_bin  Chinese_bin  Other
0            Jalsa        [North Indian, Chinese]                 1            1      0
1   Spice Elephant  [Chinese, North Indian, Thai]                 1            1      1
2  San Churro Cafe       [Cafe, Mexican, Italian]                 0            0      1

It may be strategic if you are operating on lots of data to create an intermediate data frame containing the exploded dummies on which the group-by operation can be performed.

Create dummy variables if value is in list

Answers (1)

Related Questions