Reputation: 7625
I have a data frame that looks like following image:
Here uid and id are indexes. This data frame was converted from a single index dataframe, so there are duplicate values for some columns. For each uid, all values for avg_diff are same, but different uid will have different values for this field. I want to get the largest 10 avg_diff values, with different uids.
Note: This is a huge data frame, so I am looking for the most optimized way.
Upvotes: 1
Views: 528
Reputation: 862741
I think you can first remove duplicates by get_level_values
and duplicated
with boolean indexing
, ~
is for invert boolean mask.
Then use DataFrame.nlargest
or sort_values
+ head
:
df = pd.DataFrame({'uid':[1,1,1,2,2,3,3], 'id':[2,3,4,5,6,1,3],
'avg_diff':[0.1,0.1,0.1,0.2,0.2,0.3,0.3]})
df = df.set_index('uid').set_index('id', drop=False, append=True)
print (df)
avg_diff id
uid id
1 2 0.1 2
3 0.1 3
4 0.1 4
2 5 0.2 5
6 0.2 6
3 1 0.3 1
3 0.3 3
mask = df.index.get_level_values('uid').duplicated()
print (~mask)
[ True False False True False True False]
df = df[~mask].nlargest(2, 'avg_diff')
print (df)
avg_diff id
uid id
3 1 0.3 1
2 5 0.2 5
Another solution:
mask = df.index.get_level_values('uid').duplicated()
print (~mask)
[ True False False True False True False]
df = df[~mask].sort_values('avg_diff', ascending=False).head(2)
print (df)
avg_diff id
uid id
3 1 0.3 1
2 5 0.2 5
Upvotes: 1
Reputation: 973
If I understood you right, you just need to drop duplicates of "uid" and then sort by avr_diff
unique_uid = data.reset_index().drop_duplicates("uid").set_index("uid")
print(unique_uid["avr_diff"].sort_values(ascending=False)[:10])
Upvotes: 0