Reputation: 340
I'm working with the Zomato Bangalore Restaurant data set found here. One of my pre-processing steps is to create dummy variables for the types of cuisine each restaurant serves. I've used panda's explode
to split the cuisines and I've created lists for the top 30 cuisines and the not top 30 cuisines. I've created a sample dataframe below.
sample_df = pd.DataFrame({
'name': ['Jalsa', 'Spice Elephant', 'San Churro Cafe'],
'cuisines_lst': [
['North Indian', 'Chinese'],
['Chinese', 'North Indian', 'Thai'],
['Cafe', 'Mexican', 'Italian']
]
})
I've created the top and not top lists. In the actual data I'm using the top 30 but for the sake of the example, it's the top 2 and not top 2.
top2 = sample_df.explode('cuisines_lst')['cuisines_lst'].value_counts().index[0:2].tolist()
not_top2 = sample_df.explode('cuisines_lst')['cuisines_lst'].value_counts().index[2:].tolist()
What I would like is to create a dummy variable for all cuisines in the top list with the suffix _bin
and create a final dummy variable other
if a restaurant has one of the cuisines from the not top list. The desired output looks like this:
name | cuisines_lst | Chinese_bin | North Indian_bin | Other |
---|---|---|---|---|
Jalsa | [North Indian, Chinese] | 1 | 1 | 0 |
Spice Elephant | [Chinese, North Indian, Thai] | 1 | 1 | 1 |
San Churro Cafe | [Cafe, Mexican, Italian] | 0 | 0 | 1 |
Upvotes: 0
Views: 1205
Reputation: 5331
Create the dummies, then reduce by duplicated indices to get your columns for the top 2:
a = pd.get_dummies(sample_df['cuisines_lst'].explode()) \
.reset_index().groupby('index')[top2].sum().add_suffix('_bin')
If you want it in alphabetical order (in this case, Chinese followed by North Indian), add an intermediate step to sort columns with a.sort_index(axis=1)
.
Do the same for the other values, but reducing columns as well by passing axis=1
to any
:
b = pd.get_dummies(sample_df['cuisines_lst'].explode()) \
.reset_index().groupby('index')[not_top2].sum() \
.any(axis=1).astype(int).rename('Other')
Concatenating on indices:
>>> print(pd.concat([sample_df, a, b], axis=1).to_string())
name cuisines_lst North Indian_bin Chinese_bin Other
0 Jalsa [North Indian, Chinese] 1 1 0
1 Spice Elephant [Chinese, North Indian, Thai] 1 1 1
2 San Churro Cafe [Cafe, Mexican, Italian] 0 0 1
It may be strategic if you are operating on lots of data to create an intermediate data frame containing the exploded dummies on which the group-by operation can be performed.
Upvotes: 1