Reputation: 357
This is a question on Udacity Data Science Nanodegree and I can't figure it out. The instructions are:
Using the dataframe's apply method, create a new Series called avg_medal_count
that indicates the average number of gold, silver, and bronze medals earned amongst countries who earned at least one medal of any kind at the 2014 Sochi Olympics.
The code I have currently is:
import numpy
from pandas import DataFrame, Series
def avg_medal_count():
countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
'Netherlands', 'Germany', 'Switzerland', 'Belarus',
'Austria', 'France', 'Poland', 'China', 'Korea',
'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']
gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]
olympic_medal_counts = {'country_name':countries,
'gold': Series(gold),
'silver': Series(silver),
'bronze': Series(bronze)}
df = DataFrame(olympic_medal_counts)
# YOUR CODE HERE
return avg_medal_count
I have tried a couple different things such as:
avg_medal_count = df.apply(numpy.mean)
, but get the error saying it could not convert the first column to numeric which makes sense because the first column is a list of countries. How can I use df.apply
on only gold, silver and bronze columns? I have tried other variations, but nothing worked. I am pretty sure that I need to use a combination of df.apply
and numpy.mean
, because that is what I just learned about. Any thoughts?
Thanks!
Upvotes: 3
Views: 18471
Reputation: 11
avg_medal_count = df.mean()
All countries have more than 1 medal in this set, so there is no need to filter it. In case you need it:
avg_medal_count = df[(df.gold + df.silver + df.bronze) > 0].mean()
Panda 0.22.0 https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html
DataFrame.mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
...
numeric_only : boolean, default None Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Upvotes: 1
Reputation: 11
avg_medal_count = df[['gold', 'silver', 'bronze']].apply(numpy.mean)
Gotta do this because the mean will only apply to the numerical columns and not the countries, which is a string...
Upvotes: 1
Reputation: 11
Neither solution above uses apply as stated in the problem. Use the following:
# YOUR CODE HERE
sub_series = {'gold': df.gold,
'silver': df.silver,
'bronze': df.bronze
}
sub_df = DataFrame(sub_series)
avg_medal_count = sub_df.apply(numpy.mean)
return avg_medal_count
The applying numpy.mean on the original df will always return errors due to the text column 'countries'
Upvotes: 1
Reputation: 988
I have just used the concept of R language in pandas to solve it and it works. Try this code under # your code here
sub_df = df[(df.gold >= 1) | (df.silver >= 1) | (df.bronze >= 1)] ### subsetting the data frame
avg_count = sub_df.mean(axis=0) ### axis 0 for column wise mean
return avg_count
In python 3 IDE (like pycharm) you should use
return print(avg_count)
then put the main function outside of the indentation to find the answer
avg_medal_count()
Upvotes: 3
Reputation: 3871
I would first modify how you import the data to:
df = DataFrame(olympic_medal_counts).set_index('country_name')
I would then calculate a new column containing the sum of the rows for the toal number of medals per country.
df['medal total'] = df.sum(axis=1)
Results:
bronze gold silver medal total
country_name
Russian Fed. 9 13 11 33
Norway 10 11 5 26
Canada 5 10 10 25
United States 12 9 7 28
Netherlands 9 8 7 24
Germany 5 8 6 19
Switzerland 2 6 3 11
Belarus 1 5 0 6
Austria 5 4 8 17
France 7 4 4 15
Poland 1 4 1 6
China 2 3 4 9
Korea 2 3 3 8
Sweden 6 2 7 15
Czech Republic 2 2 4 8
Slovenia 4 2 2 8
Japan 3 1 4 8
Finland 1 1 3 5
Great Britain 2 1 1 4
Ukraine 1 1 0 2
Slovakia 0 1 0 1
Italy 6 0 2 8
Latvia 2 0 2 4
Australia 1 0 2 3
Croatia 0 0 1 1
Kazakhstan 1 0 0 1
Finally, subset the the DataFrame for rows with medal totals greater than or equal to 1 and find the average of the columns.
df[df['medal total'] >= 1].apply(np.mean)
Results:
bronze 3.807692
gold 3.807692
silver 3.730769
medal total 11.346154
This could also be accomplished in one line using:
df[ df.sum(axis=1) >= 1 ].apply(np.mean)
Upvotes: 4