lindo
lindo

Reputation: 37

How to iterate rank normalisation over all input variables in pandas dataframe

I want to rank normalise all variables in pandas DataFrame to range [0,1]. However, I can now only perform this on one variable (var1). Do you know how to iterate this code over many variables in dataframe (from var1 to varn)? The code below is what I have done for one variable var1 in small example:

import pandas as pd
#Create dataframe
data = {'year': [1990,1990,1991,1991,1991],
        'var1': [10,20,45,40,55]}
df= pd.DataFrame(data)
obsperyearvar1 = df.groupby('year')['var1'].transform('size')
df['rankvar1'] = df.groupby('year')['var1'].rank().div(obsperyearvar1)
print(df)
   year  var1  rankvar1
0  1990    10  0.500000
1  1990    20  1.000000
2  1991    45  0.666667
3  1991    40  0.333333
4  1991    55  1.000000

Thank you in advance!

Upvotes: 0

Views: 98

Answers (2)

Nk03
Nk03

Reputation: 14949

IIUC, you can try:

df = pd.concat([df, df.groupby('year').apply(pd.Series.rank, pct=True).filter(
    like='var'pct=True).add_prefix('rank')], axis=1axis =1)

Complete example:

import pandas as pd

# Create dataframe
data = {'year': [1990, 1990, 1991, 1991, 1991],
        'var1': [10, 20, 45, 40, 55],
        'var2': [10, 1, 5, 40, 5]}
df = pd.DataFrame(data)
df =  pd.concat([df, df.groupby('year').apply(pd.Series.rank, pct=True).filter(
    like='var'pct=True).add_prefix('rank')], axis=1axis =1)

OUTPUT:

   year  var1  var2  rankvar1  rankvar2
0  1990    10    10  0.500000       1.0
1  1990    20     1  1.000000       0.5
2  1991    45     5  0.666667       0.5
3  1991    40    40  0.333333       1.0
4  1991    55     5  1.000000       0.5

Upvotes: 1

Henry Ecker
Henry Ecker

Reputation: 35636

Another option via join + groupby rank:

new_df = df.join(df.groupby('year').rank(pct=True).add_prefix('rank'))

new_df:

   year  var1  rankvar1
0  1990    10  0.500000
1  1990    20  1.000000
2  1991    45  0.666667
3  1991    40  0.333333
4  1991    55  1.000000

Sample Data Thanks to @Nk03:

import pandas as pd

# Create dataframe
data = {'year': [1990, 1990, 1991, 1991, 1991],
        'var1': [10, 20, 45, 40, 55],
        'var2': [10, 1, 5, 40, 5]}
df = pd.DataFrame(data)

new_df = df.join(df.groupby('year').rank(pct=True).add_prefix('rank'))

print(new_df)

new_df:

   year  var1  var2  rankvar1  rankvar2
0  1990    10    10  0.500000       1.0
1  1990    20     1  1.000000       0.5
2  1991    45     5  0.666667       0.5
3  1991    40    40  0.333333       1.0
4  1991    55     5  1.000000       0.5

Upvotes: 1

Related Questions