Reputation: 494
I can find the n largest values in each row of a numpy array (link) but doing so loses the column information which is what I want. Say I have some data:
import pandas as pd
import numpy as np
np.random.seed(42)
data = np.random.rand(5,5)
data = pd.DataFrame(data, columns = list('abcde'))
data
a b c d e
0 0.374540 0.950714 0.731994 0.598658 0.156019
1 0.155995 0.058084 0.866176 0.601115 0.708073
2 0.020584 0.969910 0.832443 0.212339 0.181825
3 0.183405 0.304242 0.524756 0.431945 0.291229
4 0.611853 0.139494 0.292145 0.366362 0.456070
I want the names of the largest contributors in each row. So for n = 2
the output would be:
0 b c
1 c e
2 b c
3 c d
4 a e
I can do it by looping over the dataframe but that would be inefficient. Is there a more pythonic way?
Upvotes: 2
Views: 602
Reputation: 21239
Can a dense ranking be used for this?
N = 2
threshold = len(data.columns) - N
nlargest = data[data.rank(method="dense", axis=1) > threshold]
>>> nlargest
a b c d e
0 NaN 0.950714 0.731994 NaN NaN
1 NaN NaN 0.866176 NaN 0.708073
2 NaN 0.969910 0.832443 NaN NaN
3 NaN NaN 0.524756 0.431945 NaN
4 0.611853 NaN NaN NaN 0.456070
>>> nlargest.stack()
0 b 0.950714
c 0.731994
1 c 0.866176
e 0.708073
2 b 0.969910
c 0.832443
3 c 0.524756
d 0.431945
4 a 0.611853
e 0.456070
dtype: float64
Upvotes: 1
Reputation: 214957
Another option using numpy.argpartition
to find the top n index per row and then extract column names by index:
import numpy as np
nlargest_index = np.argpartition(data.values, data.shape[1] - n)[:, -n:]
data.columns.values[nlargest_index]
#array([['c', 'b'],
# ['e', 'c'],
# ['c', 'b'],
# ['d', 'c'],
# ['e', 'a']], dtype=object)
Upvotes: 1
Reputation: 92854
With pandas.Series.nlargest
function:
df.apply(lambda x: x.nlargest(2).index.values, axis=1)
0 [b, c]
1 [c, e]
2 [b, c]
3 [c, d]
4 [a, e]
Upvotes: 3