Reputation: 1494
I have a multilevel index pandas DataFrame
where the first level is year
and the second level is username
. I only have one column which is already sorted in a descending manner. I want to show the first 2 rows of each index level 0.
What I have:
count
year username
2010 b 677
a 505
c 400
d 300
...
2014 a 100
b 80
What I want:
count
year username
2010 b 677
a 505
2011 c 677
d 505
2012 e 677
f 505
2013 g 677
i 505
2014 h 677
j 505
Upvotes: 7
Views: 9021
Reputation: 395
I ran into the same problem and found a neater answer in the docs (pandas version 1.0.1): GroupBy: taking the first rows of each group. Here is the trick, assuming your dataframe is called df
:
df.groupby(level=0).head(2)
Upvotes: 7
Reputation: 291
If you have a giant data frame you may not want to do a groupby
just to get a glimpse of the data. Here is another solution for getting the first five rows of the outer index and the first two rows of the inner index.
df = pd.DataFrame({'year': [2010, 2010, 2010, 2011,2011,2011, 2012, 2012, 2013, 2013, 2014, 2014],
'username': ['b','a','a','c','c','d','e','f','g','i','h','j'],
'count': [400, 505, 678, 677, 505, 505, 677, 505, 677, 505, 677, 505]})
df = df.set_index(['year','username'])
Note that the DataFrame
has to be sorted.
df = df.sort_index(level=[0,1])
df
count
year username
2010 a 505
a 678
b 400
2011 c 677
c 505
d 505
2012 e 677
f 505
2013 g 677
i 505
2014 h 677
j 505
Now for the magic:
def head_mi(df, n1=5, n2=2):
#get top n of outer index
top_lev_0 = df.index.levels[0].values[:n1]
#get top n of inner index
top_lev_1 = [df.loc[ind].index.values[:n2] for ind in top_lev_0 ]
#top_lev_1 is a list of the inner index values
#iterate over outer index and get slice from inner index
acc = []
for count0, ind0 in enumerate(top_lev_0):
acc.append(df.loc[(top_lev_0[count0], slice(top_lev_1[count0][0], top_lev_1[count0][-1])),:])
return pd.concat(acc)
head_mi(df)
This gives:
count
year username
2010 a 505
a 678
2011 c 677
c 505
2012 e 677
f 505
2013 g 677
i 505
2014 h 677
j 505
Upvotes: 1
Reputation: 21908
Here is an answer. Maybe there is a better way to do that (with indexing ?), but I thing it works. The principle seems complex but is quite simple:
DataFrame
by year and username.DataFrame
by year which is the first level (=0
) of the indexDataFrame
obtained by the groupby
(one for each year)
sort_index(by='count')
-> the row with more counts will be at the tail of the DataFrame
top
rows (2 in this case) by using the negative slicing notation ([-top:]
). The tail
method could also be used (tail(top)
) to improve readability.droplevel(0)
# Test data
df = pd.DataFrame({'year': [2010, 2010, 2010, 2011,2011,2011, 2012, 2012, 2013, 2013, 2014, 2014],
'username': ['b','a','a','c','c','d','e','f','g','i','h','j'],
'count': [400, 505, 678, 677, 505, 505, 677, 505, 677, 505, 677, 505]})
df = df.set_index(['year','username'])
top = 2
df = df.groupby(level=0).apply(lambda df: df.sort_index(by='count')[-top:])
df.index = df.index.droplevel(0)
df
count
year username
2010 a 505
a 678
2011 d 505
c 677
2012 f 505
e 677
2013 i 505
g 677
2014 j 505
h 677
Upvotes: 6