Reputation: 63
I have the following dataset example:
Name | Year | Score | 2nd Score | % of People | Country | Fruit | Export Countries | Language | Transit Duration | Quality | Taste | Freshness | Packaging
Andes, The | 2021 | 8 | 8.8 | 87% | The Netherlands | The Apple | United States,United Kingdom | English,Japanese,French | 148.0 | 1.0 | 0.0 | 0.0 | 0.0
Phil | 2021 | 8 | 8.4 | 87% | Spain | The Banana | United Kingdom, Germany | English,German,French,Italian | 165.0 | 1.0 | 0.0 | 0.0 | 0.0
Sarah | 2021 | 9 | 8.3 | 89% | Greece | The Plum | Germany,United States | English,German,French,Italian | 153.0 | 1.0 | 0.0 | 0.0 | 0.0
I want to be able to take the 'Language' column and for any entry that contains 'German', do a mean/median/mode on the Score column. The expected output for Mean would be something like:
English 8.33
German 8.5
French 8.33
Italian 8.5
Japanese 8
I've got a syntax that will split them out into into their own columns and then I know I could do a logic on that way, however I would like to learn if the way that I want to analyse the dataset is even possible.
EDIT: Code as requested
[{'Name': 'Andes, The',
'Year': 2021,
'Score': '8',
'2nd Score': 8.8,
'% of People': '87%',
'Country': 'The Netherlands',
'Fruit': 'The Apple',
'Export Countries': 'United States,United Kingdom',
'Language': 'English,Japanese,French',
'Transit Duration': 148.0,
'Quality': 1.0,
'Taste': 0.0,
'Freshness': 0.0,
'Packaging': 0.0},
{'Name': 'Phil',
'Year': 2021,
'Score': '8',
'2nd Score': 8.8,
'% of People': '87%',
'Country': 'Spain',
'Fruit': 'The Banana',
'Export Countries': 'United Kingdom, Germany',
'Language': 'English,German,French,Italian',
'Transit Duration': 118.0,
'Quality': 1.0,
'Taste': 0.0,
'Freshness': 0.0,
'Packaging': 0.0},
{'Name': 'Sarah',
'Year': 2021,
'Score': '9',
'2nd Score': 8.8,
'% of People': '89%',
'Country': 'Greece',
'Fruit': 'The Plum',
'Export Countries': 'Germany,United States',
'Language': 'English,German,French,Italian',
'Transit Duration': 165.0,
'Quality': 1.0,
'Taste': 0.0,
'Freshness': 0.0,
'Packaging': 0.0},
{'Name': 'William',
'Year': 2021,
'Score': '6',
'2nd Score': 8.8,
'% of People': '65%',
'Country': 'Brazil',
'Fruit': 'Strawberries',
'Export Countries': 'Spain,Greece',
'Language': 'Spanish, French',
'Transit Duration': 153.0,
'Quality': 1.0,
'Taste': 0.0,
'Freshness': 0.0,
'Packaging': 0.0},
Upvotes: 1
Views: 131
Reputation: 260845
To get various aggregators on the Score column after filtering, you can use the following. NB. "mode" is not an aggregation function as there can be several modes, so I built one to only get the first mode.
def mode(s):
return pd.Series.mode(s)[0]
(df[df['Language'].str.contains('German')]
.assign(Language=lambda d: d['Language'].str.split(','))
.explode('Language')
['Score'].agg(['mean',
'median',
mode,
])
)
Output:
mean 8.5
median 8.5
mode 8.0
Name: Score, dtype: float64
Per language:
def mode(s):
return pd.Series.mode(s)[0]
(df[df['Language'].str.contains('German')]
.assign(Language=lambda d: d['Language'].str.split(','))
.explode('Language')
.groupby('Language')
['Score'].agg(['mean',
'median',
mode,
])
)
Output:
mean median mode
Language
English 8.5 8.5 8
French 8.5 8.5 8
German 8.5 8.5 8
Italian 8.5 8.5 8
Upvotes: 0
Reputation: 214977
You can split
and explode
Language
column to make it flat and then aggregate by each individual language:
(df[['Score', 'Language']]
.assign(Language=lambda x: x.Language.str.split(','))
.explode('Language')
.groupby('Language')
.Score.mean()
.reset_index())
Language Score
0 English 8.333333
1 French 8.333333
2 German 8.500000
3 Italian 8.500000
4 Japanese 8.000000
Upvotes: 1