Reputation: 899
I have the following sample CSV.
,cid1,cid2,cid3
rid1,0.1,0.4,0.3
rid2,1.0,0.1,0.5
rid3,0.2,0.5,0.1
rid4,0.3,0.4,0.8
rid5,0.2,0.3,0.7
rid6,0.9,0.2,0.1
rid7,0.4,0.8,0.9
rid8,0.6,0.5,0.7
rid9,0.3,0.9,0.4
I want to show n rows with the highest value for every column in the file. Ideally, I would like to get the following output (for n = 3).
cid1 rid2 1.0
cid1 rid6 0.9
cid1 rid8 0.6
# Blank lines separating columns are optional.
cid2 rid9 0.9
cid2 rid7 0.8
cid2 rid8 0.5
cid3 rid7 0.9
cid3 rid4 0.8
cid3 rid8 0.7
This is what I have so far.
import pandas as pd
df = pd.read_csv("input.csv", index_col=0)
n = 3
for col in df:
print(col)
print(df.sort_values(col, ascending=False)[:n][col].to_string())
which gives me almost what I want.
cid1
rid2 1.0
rid6 0.9
rid8 0.6
cid2
rid9 0.9
rid7 0.8
rid3 0.5
cid3
rid7 0.9
rid4 0.8
rid5 0.7
Also, there is a small issue with cid2
, for which rid8
and rid3
have the same value of 0.5
. In that case, it would be perfect to show them both in separate lines, but I'm aware it will probably complicate the code much further so I would be happy with the basic solution.
Upvotes: 1
Views: 955
Reputation: 403278
Are you looking for GroupBy.nlargest
?
df.unstack().groupby(level=0, group_keys=False).nlargest(3)
cid1 rid2 1.0
rid6 0.9
rid8 0.6
cid2 rid9 0.9
rid7 0.8
rid3 0.5
cid3 rid7 0.9
rid4 0.8
rid5 0.7
dtype: float64
Trying to mirror OP's "expected" result:
print(df.unstack()
.groupby(level=0, group_keys=False)
.nlargest(3)
.reset_index()
.to_string(header=None, index=False))
cid1 rid2 1.0
cid1 rid6 0.9
cid1 rid8 0.6
cid2 rid9 0.9
cid2 rid7 0.8
cid2 rid3 0.5
cid3 rid7 0.9
cid3 rid4 0.8
cid3 rid5 0.7
Upvotes: 3