Reputation: 983
I would like to retrieve n column names corresponding to the set of columns where are the n largest values. Then, these names are to be kept in a list, only if the value is above a given threshold.
As an example, given the three columns 'Paul', 'Eric', 'Marc', let say I want at most 2 column names, and threshold value is 0.5. I would then obtain the column 'Best' as follow:
import pandas as pd
import numpy as np
start = '2020-01-01 00:00+00:00'
end = '2020-01-01 05:00+00:00'
pr1h = pd.period_range(start=start, end=end, freq='1h')
r = len(pr1h)
df = pd.DataFrame(np.random.rand(r,3), index=pr1h, columns=['Paul', 'Marc', 'Eric'])
After processing:
df
Paul Marc Eric Bests
2020-01-01 00:00 0.124974 0.525182 0.415339 ['Marc']
2020-01-01 01:00 0.991917 0.489479 0.668359 ['Paul', 'Eric']
2020-01-01 02:00 0.204156 0.610034 0.644715 ['Eric', 'Marc']
2020-01-01 03:00 0.385546 0.981641 0.089667 ['Marc']
2020-01-01 04:00 0.912330 0.711822 0.148064 ['Paul', 'Marc']
2020-01-01 05:00 0.301186 0.313572 0.323487 []
I could find on SO this question/answer that shows a way to get the index depending the rank of the value in a given row. I guess it could be a starting point (maybe not optimized in terms of speed, because run several times, but it seems a good start.
I could then have:
df1['1st_largest'] = df.columns[df.values.argsort(1)[:,-1]]
df2['2nd_largest'] = df.columns[df.values.argsort(1)[:,-2]]
My array ought not to have more than 20 to 50 columns, so I kept with argsort
instead of argpartition
.
But now, I am stuck. I have no idea how to check that value related to one of these column is above 0.5 so that I can put it into the list.
Any help will be welcome, thanks!
Upvotes: 2
Views: 1637
Reputation: 42916
Same idea as Quang, but using GroupBy.agg(list)
:
dfg = df.where(df>0.5).stack().groupby(level=0).nlargest(2)
df['Bests'] = dfg.reset_index(level=2).groupby(level=0)['level_2'].agg(list)
Paul Marc Eric Bests
2020-01-01 00:00 0.494089 0.500048 0.398106 [Marc]
2020-01-01 01:00 0.571067 0.896135 0.445951 [Marc, Paul]
2020-01-01 02:00 0.769473 0.830661 0.909551 [Eric, Marc]
2020-01-01 03:00 0.828074 0.025853 0.670196 [Paul, Eric]
2020-01-01 04:00 0.651157 0.641126 0.346411 [Paul, Marc]
2020-01-01 05:00 0.752359 0.970789 0.231323 [Marc, Paul]
Upvotes: 2
Reputation: 150745
One way is to mask the dataframe with where
and stack
:
df['Bests'] = (df.where(df.gt(0.5)) # change 0.5 to your threshold
.stack().groupby(level=0)
.apply(lambda x: x.nlargest(2).index
.get_level_values(1).to_list()
)
)
Output:
Paul Marc Eric Bests
2020-01-01 00:00 0.124974 0.525182 0.415339 [Marc]
2020-01-01 01:00 0.991917 0.489479 0.668359 [Paul, Eric]
2020-01-01 02:00 0.204156 0.610034 0.644715 [Eric, Marc]
2020-01-01 03:00 0.385546 0.981641 0.089667 [Marc]
2020-01-01 04:00 0.912330 0.711822 0.148064 [Paul, Marc]
2020-01-01 05:00 0.301186 0.313572 0.323487 NaN
Upvotes: 3