Reputation: 156
I am working in pandas and I have a dataset/dataframe which looks like this.
venueId venueCategoryId venueCategory
v1 vc1 Airport
v1 vc2 Park
v1 vc1 Airport
v2 vc3 American Restaurant
v3 vc4 Italian Restaurant
v3 vc5 Restaurant
venueId represents unique venues and category represents what type of venue that is. CategoryId and CategoryName values are related.
I am trying to replace the venueCategoryId and venueCategory of a venue with the Category which was highest in that venue.
The Result I am trying to get:
venueId venueCategoryId venueCategory
v1 vc1 Airport
v2 vc3 American Restaurant
v3 vc4 Italian Restaurant
I was trying to use groupby on 'venueId' column to group and find the count and replace with highest count. But I couldn't figure out how to venueCategoryId and venueCategory as a dependent column in pandas and find max of 2 column and replace them.
Upvotes: 0
Views: 86
Reputation: 164673
Here's one solution via pd.Series.mode
. There is likely a more Pandorable method.
# combine id and category series
df['venueIdCat'] = list(zip(df.venueCategoryId, df.venueCategory))
# groupby venueId and extract mode
res = df.groupby('venueId')['venueIdCat'].apply(pd.Series.mode).reset_index()
# unsplit id and category
res = res.join(pd.DataFrame(res['venueIdCat'].values.tolist(),
columns=['venueCategoryId', 'venueCategory']))
# select required columns
res = res[['venueId', 'venueCategoryId', 'venueCategory']]
print(res)
venueId venueCategoryId venueCategory
0 v1 vc1 Airport
1 v2 vc3 AmericanRestaurant
2 v3 vc4 ItalianRestaurant
Alternative solution using collections.Counter
:
from collections import Counter
# combine id and category series
df['venueIdCat'] = list(zip(df.venueCategoryId, df.venueCategory))
# groupby venueId and extract mode
res = df.groupby('venueId')['venueIdCat'].apply(lambda x: Counter(x).most_common()[0][0]).reset_index()
# unsplit id and category
res = res.join(pd.DataFrame(res['venueIdCat'].values.tolist(),
columns=['venueCategoryId', 'venueCategory']))
# select required columns
res = res[['venueId', 'venueCategoryId', 'venueCategory']]
print(res)
venueId venueCategoryId venueCategory
0 v1 vc1 Airport
1 v2 vc3 AmericanRestaurant
2 v3 vc4 ItalianRestaurant
Upvotes: 1