Larry Cai
Larry Cai

Reputation: 59933

How to keep the only the top N values in a dataframe

It is pandas/Dataframe, for every row, I want to keep only the top N (N=3) values and set others to nan,

import pandas as pd
import numpy as np

data = np.array([['','day1','day2','day3','day4','day5'],
                ['larry',1,4,4,3,5],
                ['gunnar',2,-1,3,4,4],
                ['tin',-2,5,5, 6,7]])
                
df = pd.DataFrame(data=data[1:,1:],
                  index=data[1:,0],
                  columns=data[0,1:])
print(df) 

output is

       day1 day2 day3 day4 day5
larry     1    4    4    3    5
gunnar    2   -1    3    4    4
tin      -2    5    5    6    7

I want to get

       day1 day2 day3 day4 day5
larry   NaN    4    4  NaN    5
gunnar  NaN  NaN    3    4    4
tin     NaN    5  NaN    6    7

Similar to pandas: Keep only top n values and set others to 0, but I need to keep only N highest available values, otherwise the average is not correct

For the result above I want to keep first 5 only

Upvotes: 8

Views: 2488

Answers (4)

ALollz
ALollz

Reputation: 59519

stack, sort, keep the top 3 per row with a groupby, and then unstack. We need a mergesort to ensure we always keep the first value per row in the case of ties. Reindex ensures we have all the original columns if there are no largest values in any of the rows for that column.

If you require Int in the end, the best you can do is 'Int64', pandas nullable-integer dtype.

# df = df.apply(pd.to_numeric)  # To make numeric, not object. 

N = 3
(df.stack()
   .sort_values(ascending=False, , kind='mergesort')
   .groupby(level=0).head(N)
   .unstack()
   .reindex(df.columns, axis=1)
   .astype('Int64'))

        day1  day2  day3  day4  day5
larry   <NA>     4     4  <NA>     5
gunnar  <NA>  <NA>     3     4     4
tin     <NA>     5  <NA>     6     7

Here are some timings vs the number of rows, and proof that these three methods are equivalent (before any type-casting). @anky's solution is very fast and simple.

import perfplot
import pandas as pd
import numpy as np


def alollz_stack(df, N=3):
    return (df.stack()
              .sort_values(ascending=False, kind='mergesort')
              .groupby(level=0).head(N)
              .unstack()
              .reindex(df.columns, axis=1))
    
def quang_nlargest(df, N=3):
    return df.apply(pd.Series.nlargest, n=N, axis=1).reindex(df.columns, axis=1)

def anky_rank(df, N=3):
    return df[df.iloc[:,::-1].rank(1,'first').ge(df.shape[1]-N+1)]
    

perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.randint(0, 100, (n, 5)),
                                 columns=[f'col{i}' for i in range(1,6)]), 
    kernels=[
        lambda df: alollz_stack(df),
        lambda df: quang_nlargest(df),
        lambda df: anky_rank(df)],
    labels=['Stack', 'Apply_Nlargest', 'Rank'],
    n_range=[2 ** k for k in range(18)],
    equality_check=lambda x,y: np.allclose(x,y, equal_nan=True),
    xlabel='N_rows'
)

enter image description here

Upvotes: 4

anky
anky

Reputation: 75080

Here is another way using df.rank on axis=1 , we reverse the columns and calculate the rank since on a duplicate you want to retain the first value.

df[df.astype(float).iloc[:,::-1].rank(1,'first').ge(3)]

       day1 day2 day3 day4 day5
larry   NaN    4    4  NaN    5
gunnar  NaN  NaN    3    4    4
tin     NaN    5  NaN    6    7

however as @Allolz rightly directed, for a general usecase based on the shape of the df, one can use:

N=3
n = df.shape[1]-N+1
df[df.astype(float).iloc[:,::-1].rank(1,'first').ge(n)]

Upvotes: 4

Quang Hoang
Quang Hoang

Reputation: 150735

You can use np.unique to sort and find the 5th largest value, and use where:

uniques = np.unique(df)

# what happens if len(uniques) < 5?
thresh = uniques[-5]
df.where(df >= thresh)

Output:

        day1  day2  day3  day4  day5
larry    NaN   4.0     4     3     5
gunnar   NaN   NaN     3     4     4
tin      NaN   5.0     5     6     7

Update: On the second look, I think you can do:

df.apply(pd.Series.nlargest, n=3,axis=1).reindex(df.columns, axis=1)

Output:

        day1  day2  day3  day4  day5
larry    NaN   4.0   4.0   NaN   5.0
gunnar   NaN   NaN   3.0   4.0   4.0
tin      NaN   5.0   NaN   6.0   7.0

Upvotes: 6

arhr
arhr

Reputation: 1581

To keep, for every row, the top 3 values you can use:

df = (df * df.astype(float).apply(lambda x: x.isin(x.nlargest(3)), axis=1)).replace('', np.nan)

You can migrate nlargest's parameter into a variable if needed.

Output:

        day1 day2 day3 day4 day5
larry    NaN    4    4  NaN    5
gunnar   NaN  NaN    3    4    4
tin      NaN    5    5    6    7

Upvotes: 4

Related Questions