R Walser
R Walser

Reputation: 494

Find names of n largest values in each row of dataframe

I can find the n largest values in each row of a numpy array (link) but doing so loses the column information which is what I want. Say I have some data:

import pandas as pd
import numpy as np

np.random.seed(42)
data = np.random.rand(5,5)
data = pd.DataFrame(data, columns = list('abcde'))
data

          a         b         c         d         e
0  0.374540  0.950714  0.731994  0.598658  0.156019
1  0.155995  0.058084  0.866176  0.601115  0.708073
2  0.020584  0.969910  0.832443  0.212339  0.181825
3  0.183405  0.304242  0.524756  0.431945  0.291229
4  0.611853  0.139494  0.292145  0.366362  0.456070

I want the names of the largest contributors in each row. So for n = 2 the output would be:

0  b  c
1  c  e
2  b  c
3  c  d
4  a  e

I can do it by looping over the dataframe but that would be inefficient. Is there a more pythonic way?

Upvotes: 2

Views: 602

Answers (3)

jqurious
jqurious

Reputation: 21239

Can a dense ranking be used for this?

N = 2
threshold = len(data.columns) - N
nlargest = data[data.rank(method="dense", axis=1) > threshold]
>>> nlargest
          a         b         c         d         e
0       NaN  0.950714  0.731994       NaN       NaN
1       NaN       NaN  0.866176       NaN  0.708073
2       NaN  0.969910  0.832443       NaN       NaN
3       NaN       NaN  0.524756  0.431945       NaN
4  0.611853       NaN       NaN       NaN  0.456070
>>> nlargest.stack()
0  b    0.950714
   c    0.731994
1  c    0.866176
   e    0.708073
2  b    0.969910
   c    0.832443
3  c    0.524756
   d    0.431945
4  a    0.611853
   e    0.456070
dtype: float64

Upvotes: 1

akuiper
akuiper

Reputation: 214957

Another option using numpy.argpartition to find the top n index per row and then extract column names by index:

import numpy as np
nlargest_index = np.argpartition(data.values, data.shape[1] - n)[:, -n:]
data.columns.values[nlargest_index]

#array([['c', 'b'],
#       ['e', 'c'],
#       ['c', 'b'],
#       ['d', 'c'],
#       ['e', 'a']], dtype=object)

Upvotes: 1

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

With pandas.Series.nlargest function:

df.apply(lambda x: x.nlargest(2).index.values, axis=1)

0    [b, c]
1    [c, e]
2    [b, c]
3    [c, d]
4    [a, e]

Upvotes: 3

Related Questions