Reputation: 604
I am a python programming beginner trying to figure out how a group label from groupby operation can be used as index of a new dataframe. For example,
df = pd.DataFrame({'Country': ['USA', 'USA', 'UK', 'China', 'Canada', 'Australia', 'UK', 'China', 'USA'],
'Year': [1979, 1983, 1987, 1991, 1995, 1999, 2003, 2007, 2011],
'Medals': [52, 30, 25, 41, 19, 17, 9, 14, 12]})
df:
Country Medals Year
0 USA 52 1979
1 USA 30 1983
2 UK 25 1987
3 China 41 1991
4 Canada 19 1995
5 Australia 17 1999
6 UK 9 2003
7 China 14 2007
8 USA 12 2011
c1 = df.groupby(df['Country'], as_index=True, sort=False, group_keys=True).size()
c1:
Country
USA 3
UK 2
China 2
Canada 1
Australia 1
I want to create a new dataframe with the above c1 results exactly in that format but I have not been able to do that. Below is what I get:
d1 = pd.DataFrame(np.array(c1), columns=['Frequency'])
d1:
Frequency
0 3
1 2
2 2
3 1
4 1
I want the group labels as index and not the default 0, 1, 2, 3 and 4. This is exactly what I want:
Desired Output:
Frequency
USA 3
UK 2
China 2
Canada 1
Australia 1
Please how can I achieve this? I guess if I create a label with the countries and assign it as index, it might work. However, the original data I'm practising with has so many rows that it will be impossible for me to create a label list. Any ideas will be highly appreciated.
Upvotes: 5
Views: 29395
Reputation: 2710
Edit: let's see how you like this one!
c1 = pd.DataFrame(c1.values, index=c1.index.values, columns=['Frequency'])
print(c1)
Frequency
USA 3
UK 2
China 2
Canada 1
Australia 1
c1.values
is roughly equivalent (for our purposes) to np.array(c1)
but avoids needing to import numpy.
Original response (doesn't quite work, left for posterity): You are likely looking for the set_index
method.
It should work something like this:
c1 = df.groupby(df['Country'], as_index=True, sort=False, group_keys=True).size()
c2 = c1.set_index(['Country'])
Let me know if this works for you!
Upvotes: 2
Reputation: 604
Finally, I figured out what seems to be a working solution. I realized that c1 is a series and not a dataframe, with index which is callable by c1.index. So, I improved the code by specifying the index;
d1 = pd.DataFrame(np.array(c1), index=c1.index, columns=['Frequency'])
d1:
Frequency
Country
USA 3
UK 2
China 2
Canada 1
Australia 1
I don't know if this is the best solution. Better ideas are still welcome.
Upvotes: 2