sherlock85
sherlock85

Reputation: 899

Show top n rows for every column in Pandas data frame

I have the following sample CSV.

,cid1,cid2,cid3
rid1,0.1,0.4,0.3
rid2,1.0,0.1,0.5
rid3,0.2,0.5,0.1
rid4,0.3,0.4,0.8
rid5,0.2,0.3,0.7
rid6,0.9,0.2,0.1
rid7,0.4,0.8,0.9
rid8,0.6,0.5,0.7
rid9,0.3,0.9,0.4

I want to show n rows with the highest value for every column in the file. Ideally, I would like to get the following output (for n = 3).

cid1  rid2  1.0
cid1  rid6  0.9
cid1  rid8  0.6
                    # Blank lines separating columns are optional.
cid2  rid9  0.9
cid2  rid7  0.8
cid2  rid8  0.5

cid3  rid7  0.9
cid3  rid4  0.8
cid3  rid8  0.7

This is what I have so far.

import pandas as pd

df = pd.read_csv("input.csv", index_col=0)
n = 3

for col in df:
    print(col)
    print(df.sort_values(col, ascending=False)[:n][col].to_string())

which gives me almost what I want.

cid1
rid2    1.0
rid6    0.9
rid8    0.6
cid2
rid9    0.9
rid7    0.8
rid3    0.5
cid3
rid7    0.9
rid4    0.8
rid5    0.7

Also, there is a small issue with cid2, for which rid8 and rid3 have the same value of 0.5. In that case, it would be perfect to show them both in separate lines, but I'm aware it will probably complicate the code much further so I would be happy with the basic solution.

Upvotes: 1

Views: 955

Answers (1)

cs95
cs95

Reputation: 403278

Are you looking for GroupBy.nlargest?

df.unstack().groupby(level=0, group_keys=False).nlargest(3)
 
cid1  rid2    1.0
      rid6    0.9
      rid8    0.6
cid2  rid9    0.9
      rid7    0.8
      rid3    0.5
cid3  rid7    0.9
      rid4    0.8
      rid5    0.7
dtype: float64

Trying to mirror OP's "expected" result:

print(df.unstack()
        .groupby(level=0, group_keys=False)
        .nlargest(3)
        .reset_index()
        .to_string(header=None, index=False))

cid1 rid2 1.0
cid1 rid6 0.9
cid1 rid8 0.6
cid2 rid9 0.9
cid2 rid7 0.8
cid2 rid3 0.5
cid3 rid7 0.9
cid3 rid4 0.8
cid3 rid5 0.7

Upvotes: 3

Related Questions