Pepka
Pepka

Reputation: 93

Most common element for pandas series

I have this dataset:

Dataset

The cuisine countries in it keep reoccurring and what I like to have as output is the list of let's say 5 food ingredients that are the most popular for every country.

The code until now:

import pandas as pd
from collections import Counter


filename="food.json"
food_dataset = pd.read_json(filename)

#getting seperate columns
country = food_dataset.loc[:,"country"]
ingredients = food_dataset.loc[:,"ingredients"]


Counter = Counter(ingredients) 

most_occur = Counter.most_common(3) 

print(most_occur)

Upvotes: 1

Views: 372

Answers (1)

jezrael
jezrael

Reputation: 862661

Solution for pandas 0.25+ DataFrame.explode with GroupBy.apply and lambd function with first 5 index by created by counter by Series.value_counts:

food_dataset = pd.DataFrame({'cuisine':['greek','southern_us'],
                             'ingredients':[list('andnsndnfndn'),
                                            list('ndnsndnfnsnd')]})
print (food_dataset)
       cuisine                           ingredients
0        greek  [a, n, d, n, s, n, d, n, f, n, d, n]
1  southern_us  [n, d, n, s, n, d, n, f, n, s, n, d]

N = 3
df = (food_dataset.explode("ingredients")
                  .groupby('cuisine')['ingredients']
                  .apply(lambda x: x.value_counts().index[:N].tolist())
                  .reset_index())
print (df)
       cuisine ingredients
0        greek   [n, d, a]
1  southern_us   [n, d, s]

Alternative solution:

food_dataset['top'] = (food_dataset['ingredients']
                          .apply(lambda x: [y[0] for y in Counter(x).most_common(N)]))
print (food_dataset)
       cuisine                           ingredients        top
0        greek  [a, n, d, n, s, n, d, n, f, n, d, n]  [n, d, a]
1  southern_us  [n, d, n, s, n, d, n, f, n, s, n, d]  [n, d, s]



df = (food_dataset.explode("ingredients")
                  .groupby('cuisine')['ingredients']
                  .apply(lambda x: [y[0] for y in Counter(x).most_common(N)])
                  .reset_index())
print (df)
       cuisine ingredients
0        greek   [n, d, a]
1  southern_us   [n, d, s]

Solution if each values in cousine column are unique:

food_dataset['top'] = (food_dataset['ingredients']
                          .apply(lambda x: [y[0] for y in Counter(x).most_common(N)]))
print (food_dataset)
       cuisine                           ingredients        top
0        greek  [a, n, d, n, s, n, d, n, f, n, d, n]  [n, d, a]
1  southern_us  [n, d, n, s, n, d, n, f, n, s, n, d]  [n, d, s]

Upvotes: 1

Related Questions