gernworm
gernworm

Reputation: 340

Create dummy variables if value is in list

I'm working with the Zomato Bangalore Restaurant data set found here. One of my pre-processing steps is to create dummy variables for the types of cuisine each restaurant serves. I've used panda's explode to split the cuisines and I've created lists for the top 30 cuisines and the not top 30 cuisines. I've created a sample dataframe below.


    sample_df = pd.DataFrame({
        'name': ['Jalsa', 'Spice Elephant', 'San Churro Cafe'],
        'cuisines_lst': [
            ['North Indian', 'Chinese'],
            ['Chinese', 'North Indian', 'Thai'],
            ['Cafe', 'Mexican', 'Italian']
        ]
    })

I've created the top and not top lists. In the actual data I'm using the top 30 but for the sake of the example, it's the top 2 and not top 2.


top2 = sample_df.explode('cuisines_lst')['cuisines_lst'].value_counts().index[0:2].tolist()
not_top2 = sample_df.explode('cuisines_lst')['cuisines_lst'].value_counts().index[2:].tolist()

What I would like is to create a dummy variable for all cuisines in the top list with the suffix _bin and create a final dummy variable other if a restaurant has one of the cuisines from the not top list. The desired output looks like this:

name cuisines_lst Chinese_bin North Indian_bin Other
Jalsa [North Indian, Chinese] 1 1 0
Spice Elephant [Chinese, North Indian, Thai] 1 1 1
San Churro Cafe [Cafe, Mexican, Italian] 0 0 1

Upvotes: 0

Views: 1205

Answers (1)

ifly6
ifly6

Reputation: 5331

Create the dummies, then reduce by duplicated indices to get your columns for the top 2:

a = pd.get_dummies(sample_df['cuisines_lst'].explode()) \
    .reset_index().groupby('index')[top2].sum().add_suffix('_bin')

If you want it in alphabetical order (in this case, Chinese followed by North Indian), add an intermediate step to sort columns with a.sort_index(axis=1).

Do the same for the other values, but reducing columns as well by passing axis=1 to any:

b = pd.get_dummies(sample_df['cuisines_lst'].explode()) \
    .reset_index().groupby('index')[not_top2].sum() \
    .any(axis=1).astype(int).rename('Other')

Concatenating on indices:

>>> print(pd.concat([sample_df, a, b], axis=1).to_string())
              name                   cuisines_lst  North Indian_bin  Chinese_bin  Other
0            Jalsa        [North Indian, Chinese]                 1            1      0
1   Spice Elephant  [Chinese, North Indian, Thai]                 1            1      1
2  San Churro Cafe       [Cafe, Mexican, Italian]                 0            0      1

It may be strategic if you are operating on lots of data to create an intermediate data frame containing the exploded dummies on which the group-by operation can be performed.

Upvotes: 1

Related Questions