Reputation: 974
I have a table like:
country | name | medals_won | year
-----------------------------------
US | sarah | 1 | 2010
US | sarah | 2 | 2011
US | sarah | 5 | 2015
US | alice | 3 | 2010
US | alice | 4 | 2012
US | alice | 1 | 2015
AU | jones | 2 | 2013
AU | jones | 8 | 2015
I want it like:
country | name | 2010 | 2011 | 2012 | 2013 | 2014 | 2015
---------------------------------------------------------
US | sarah | 1 | 2 | 0 | 0 | 0 | 5
US | alice | 3 | 0 | 4 | 0 | 0 | 1
AU | jones | 0 | 0 | 0 | 2 | 0 | 8
I've tinkered with df.apply
, or even brute-force iteration, but you can probably guess that the tricky part is that these row values aren't strictly sequential, so this isn't a simple transpose operation (nobody won any medals in 2014, for e.g., but I want the resulting table to show that in a column full of zeros).
Upvotes: 3
Views: 2615
Reputation: 39
You can use the pivot_table() function of pandas and fill nan values with zero using pd.fillna(0)
df = pd.DataFrame({
'country' : pd.Series(['US', 'US', 'US', 'US', 'US', 'US', 'AU', 'AU']),
'name' : pd.Series(['sarah', 'sarah','sarah','alice','alice','alice','jones','jones']),
'medals_won' : pd.Series([1,2,5,3,4,1,2,8]),
'year': pd.Series([2010,2011,2015,2010,2012,2015,2013,2015])
})
pd.pivot_table(df, index=['country','name'], columns='year', aggfunc='sum').fillna(0)
Upvotes: -1
Reputation: 862601
You can use set_index
+ unstack
:
df = df.set_index(['country','name','year'])['medals_won'].unstack(fill_value=0)
print (df)
year 2010 2011 2012 2013 2015
country name
AU jones 0 0 0 2 8
US alice 3 0 4 0 1
sarah 1 2 0 0 5
If duplicates need aggregation like mean
, sum
... with pivot_table
or groupby
+ aggregate function
+ unstack
:
print (df)
country name medals_won year
0 US sarah 1 2010 <-same US sarah 2010, different 1
1 US sarah 4 2010 <-same US sarah 2010, different 4
2 US sarah 2 2011
3 US sarah 5 2015
4 US alice 3 2010
5 US alice 4 2012
6 US alice 1 2015
7 AU jones 2 2013
8 AU jones 8 2015
df = df.pivot_table(index=['country','name'],
columns='year',
values='medals_won',
fill_value=0,
aggfunc='mean')
print (df)
year 2010 2011 2012 2013 2015
country name
AU jones 0.0 0 0 2 8
US alice 3.0 0 4 0 1
sarah 2.5 2 0 0 5 <- (1+4)/2 = 2.5
Alternatively:
df = df.groupby(['country','name','year'])['medals_won'].mean().unstack(fill_value=0)
print (df)
year 2010 2011 2012 2013 2015
country name
AU jones 0.0 0.0 0.0 2.0 8.0
US alice 3.0 0.0 4.0 0.0 1.0
sarah 2.5 2.0 0.0 0.0 5.0
Last:
df = df.reset_index().rename_axis(None, axis=1)
print (df)
country name 2010 2011 2012 2013 2015
0 AU jones 0 0 0 2 8
1 US alice 3 0 4 0 1
2 US sarah 1 2 0 0 5
Upvotes: 6