Kevin Stephen Biswas
Kevin Stephen Biswas

Reputation: 156

Pandas - Find Max Count of 2 Dependent Attribute and Replace Duplicate Rows with that Value

I am working in pandas and I have a dataset/dataframe which looks like this.

venueId                     venueCategoryId             venueCategory
v1                          vc1                         Airport
v1                          vc2                         Park     
v1                          vc1                         Airport
v2                          vc3                         American Restaurant
v3                          vc4                         Italian Restaurant
v3                          vc5                         Restaurant

venueId represents unique venues and category represents what type of venue that is. CategoryId and CategoryName values are related.

I am trying to replace the venueCategoryId and venueCategory of a venue with the Category which was highest in that venue.

The Result I am trying to get:

venueId                     venueCategoryId             venueCategory
v1                          vc1                         Airport
v2                          vc3                         American Restaurant
v3                          vc4                         Italian Restaurant

I was trying to use groupby on 'venueId' column to group and find the count and replace with highest count. But I couldn't figure out how to venueCategoryId and venueCategory as a dependent column in pandas and find max of 2 column and replace them.

Upvotes: 0

Views: 86

Answers (1)

jpp
jpp

Reputation: 164673

Here's one solution via pd.Series.mode. There is likely a more Pandorable method.

# combine id and category series
df['venueIdCat'] = list(zip(df.venueCategoryId, df.venueCategory))

# groupby venueId and extract mode
res = df.groupby('venueId')['venueIdCat'].apply(pd.Series.mode).reset_index()

# unsplit id and category
res = res.join(pd.DataFrame(res['venueIdCat'].values.tolist(),
                            columns=['venueCategoryId', 'venueCategory']))

# select required columns
res = res[['venueId', 'venueCategoryId', 'venueCategory']]

print(res)

  venueId venueCategoryId       venueCategory
0      v1             vc1             Airport
1      v2             vc3  AmericanRestaurant
2      v3             vc4   ItalianRestaurant

Alternative solution using collections.Counter:

from collections import Counter

# combine id and category series
df['venueIdCat'] = list(zip(df.venueCategoryId, df.venueCategory))

# groupby venueId and extract mode
res = df.groupby('venueId')['venueIdCat'].apply(lambda x: Counter(x).most_common()[0][0]).reset_index()

# unsplit id and category
res = res.join(pd.DataFrame(res['venueIdCat'].values.tolist(),
                            columns=['venueCategoryId', 'venueCategory']))

# select required columns
res = res[['venueId', 'venueCategoryId', 'venueCategory']]

print(res)

  venueId venueCategoryId       venueCategory
0      v1             vc1             Airport
1      v2             vc3  AmericanRestaurant
2      v3             vc4   ItalianRestaurant

Upvotes: 1

Related Questions