Reputation: 59933
It is pandas/Dataframe, for every row, I want to keep only the top N (N=3) values and set others to nan
,
import pandas as pd
import numpy as np
data = np.array([['','day1','day2','day3','day4','day5'],
['larry',1,4,4,3,5],
['gunnar',2,-1,3,4,4],
['tin',-2,5,5, 6,7]])
df = pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
print(df)
output is
day1 day2 day3 day4 day5
larry 1 4 4 3 5
gunnar 2 -1 3 4 4
tin -2 5 5 6 7
I want to get
day1 day2 day3 day4 day5
larry NaN 4 4 NaN 5
gunnar NaN NaN 3 4 4
tin NaN 5 NaN 6 7
Similar to pandas: Keep only top n values and set others to 0, but I need to keep only N highest available values, otherwise the average is not correct
For the result above I want to keep first 5
only
Upvotes: 8
Views: 2488
Reputation: 59519
stack
, sort
, keep the top 3 per row with a groupby
, and then unstack
. We need a mergesort
to ensure we always keep the first value per row in the case of ties. Reindex ensures we have all the original columns if there are no largest values in any of the rows for that column.
If you require Int
in the end, the best you can do is 'Int64'
, pandas nullable-integer dtype.
# df = df.apply(pd.to_numeric) # To make numeric, not object.
N = 3
(df.stack()
.sort_values(ascending=False, , kind='mergesort')
.groupby(level=0).head(N)
.unstack()
.reindex(df.columns, axis=1)
.astype('Int64'))
day1 day2 day3 day4 day5
larry <NA> 4 4 <NA> 5
gunnar <NA> <NA> 3 4 4
tin <NA> 5 <NA> 6 7
Here are some timings vs the number of rows, and proof that these three methods are equivalent (before any type-casting). @anky's solution is very fast and simple.
import perfplot
import pandas as pd
import numpy as np
def alollz_stack(df, N=3):
return (df.stack()
.sort_values(ascending=False, kind='mergesort')
.groupby(level=0).head(N)
.unstack()
.reindex(df.columns, axis=1))
def quang_nlargest(df, N=3):
return df.apply(pd.Series.nlargest, n=N, axis=1).reindex(df.columns, axis=1)
def anky_rank(df, N=3):
return df[df.iloc[:,::-1].rank(1,'first').ge(df.shape[1]-N+1)]
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.randint(0, 100, (n, 5)),
columns=[f'col{i}' for i in range(1,6)]),
kernels=[
lambda df: alollz_stack(df),
lambda df: quang_nlargest(df),
lambda df: anky_rank(df)],
labels=['Stack', 'Apply_Nlargest', 'Rank'],
n_range=[2 ** k for k in range(18)],
equality_check=lambda x,y: np.allclose(x,y, equal_nan=True),
xlabel='N_rows'
)
Upvotes: 4
Reputation: 75080
Here is another way using df.rank
on axis=1
, we reverse the columns and calculate the rank since on a duplicate you want to retain the first value.
df[df.astype(float).iloc[:,::-1].rank(1,'first').ge(3)]
day1 day2 day3 day4 day5
larry NaN 4 4 NaN 5
gunnar NaN NaN 3 4 4
tin NaN 5 NaN 6 7
however as @Allolz rightly directed, for a general usecase based on the shape of the df, one can use:
N=3
n = df.shape[1]-N+1
df[df.astype(float).iloc[:,::-1].rank(1,'first').ge(n)]
Upvotes: 4
Reputation: 150735
You can use np.unique
to sort and find the 5th largest value, and use where
:
uniques = np.unique(df)
# what happens if len(uniques) < 5?
thresh = uniques[-5]
df.where(df >= thresh)
Output:
day1 day2 day3 day4 day5
larry NaN 4.0 4 3 5
gunnar NaN NaN 3 4 4
tin NaN 5.0 5 6 7
Update: On the second look, I think you can do:
df.apply(pd.Series.nlargest, n=3,axis=1).reindex(df.columns, axis=1)
Output:
day1 day2 day3 day4 day5
larry NaN 4.0 4.0 NaN 5.0
gunnar NaN NaN 3.0 4.0 4.0
tin NaN 5.0 NaN 6.0 7.0
Upvotes: 6
Reputation: 1581
To keep, for every row, the top 3 values you can use:
df = (df * df.astype(float).apply(lambda x: x.isin(x.nlargest(3)), axis=1)).replace('', np.nan)
You can migrate nlargest
's parameter into a variable if needed.
Output:
day1 day2 day3 day4 day5
larry NaN 4 4 NaN 5
gunnar NaN NaN 3 4 4
tin NaN 5 5 6 7
Upvotes: 4