GroupBy and then Mean/Median/Mode using str.contains Criteria

Question

I have the following dataset example:

Name | Year | Score | 2nd Score | % of People | Country | Fruit | Export Countries | Language | Transit Duration | Quality | Taste | Freshness | Packaging
Andes, The | 2021 | 8 | 8.8 | 87% | The Netherlands | The Apple | United States,United Kingdom | English,Japanese,French | 148.0 | 1.0 | 0.0 | 0.0 | 0.0
Phil | 2021 | 8 | 8.4 | 87% | Spain | The Banana | United Kingdom, Germany | English,German,French,Italian | 165.0 | 1.0 | 0.0 | 0.0 | 0.0
Sarah | 2021 | 9 | 8.3 | 89% | Greece | The Plum | Germany,United States | English,German,French,Italian | 153.0 | 1.0 | 0.0 | 0.0 | 0.0

I want to be able to take the 'Language' column and for any entry that contains 'German', do a mean/median/mode on the Score column. The expected output for Mean would be something like:

English  8.33  
German  8.5
French  8.33
Italian  8.5
Japanese  8

I've got a syntax that will split them out into into their own columns and then I know I could do a logic on that way, however I would like to learn if the way that I want to analyse the dataset is even possible.

EDIT: Code as requested

[{'Name': 'Andes, The',
  'Year': 2021,
  'Score': '8',
  '2nd Score': 8.8,
  '% of People': '87%',
  'Country': 'The Netherlands',
  'Fruit': 'The Apple',
  'Export Countries': 'United States,United Kingdom',
  'Language': 'English,Japanese,French',
  'Transit Duration': 148.0,
  'Quality': 1.0,
  'Taste': 0.0,
  'Freshness': 0.0,
  'Packaging': 0.0},
 {'Name': 'Phil',
  'Year': 2021,
  'Score': '8',
  '2nd Score': 8.8,
  '% of People': '87%',
  'Country': 'Spain',
  'Fruit': 'The Banana',
  'Export Countries': 'United Kingdom, Germany',
  'Language': 'English,German,French,Italian',
  'Transit Duration': 118.0,
  'Quality': 1.0,
  'Taste': 0.0,
  'Freshness': 0.0,
  'Packaging': 0.0},
{'Name': 'Sarah',
  'Year': 2021,
  'Score': '9',
  '2nd Score': 8.8,
  '% of People': '89%',
  'Country': 'Greece',
  'Fruit': 'The Plum',
  'Export Countries': 'Germany,United States',
  'Language': 'English,German,French,Italian',
  'Transit Duration': 165.0,
  'Quality': 1.0,
  'Taste': 0.0,
  'Freshness': 0.0,
  'Packaging': 0.0},
{'Name': 'William',
  'Year': 2021,
  'Score': '6',
  '2nd Score': 8.8,
  '% of People': '65%',
  'Country': 'Brazil',
  'Fruit': 'Strawberries',
  'Export Countries': 'Spain,Greece',
  'Language': 'Spanish, French',
  'Transit Duration': 153.0,
  'Quality': 1.0,
  'Taste': 0.0,
  'Freshness': 0.0,
  'Packaging': 0.0},

akuiper · Accepted Answer

You can split and explode Language column to make it flat and then aggregate by each individual language:

(df[['Score', 'Language']]
  .assign(Language=lambda x: x.Language.str.split(','))
  .explode('Language')
  .groupby('Language')
  .Score.mean()
  .reset_index())

   Language     Score
0   English  8.333333
1    French  8.333333
2    German  8.500000
3   Italian  8.500000
4  Japanese  8.000000

GroupBy and then Mean/Median/Mode using str.contains Criteria

Answers (2)

Related Questions