data_person
data_person

Reputation: 4490

Replace pandas column with sorted index

I have a sample DF, trying to replace the list of column values with ascending sorted index:

DF:

df = pd.DataFrame(np.random.randint(0,10,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])

    a   b   c    d1      d2       date
0   2   7   9   Apple   Orange  2002-01-01
1   6   0   9   Mango   lemon   2002-01-01
2   8   0   0   Apple   lemon   2002-01-01
3   4   4   4   Mango   Orange  2002-01-01
4   5   0   8   Mango   lemon   2002-02-01
5   6   1   6   Mango   Orange  2002-02-01
6   7   2   7   Apple   lemon   2002-02-01

Step 1:

Group the DF by "date" column, sample group on "2002-01-01"


        a   b   c    d1      d2       date
    0   2   7   9   Apple   Orange  2002-01-01
    1   6   0   9   Mango   lemon   2002-01-01
    2   8   0   0   Apple   lemon   2002-01-01
    3   4   4   4   Mango   Orange  2002-01-01

Step 2:

In this group, replace the values of columns ["d1","d2"] with index (not the DF index) of sorted mean values based on c.

For example in the above group mean(c, d1="Apple") = [9+0]/2 => 4.5 and mean(c, d1="Mango") = [9+4]/2 => 6.5 so the ascending sorted index is Apple:0 and Mango:1

so the value of column d1 will be replaced like the following:

            a   b   c   d1       d2       date
        0   2   7   9   0      Orange   2002-01-01
        1   6   0   9   1      lemon    2002-01-01
        2   8   0   0   0      lemon    2002-01-01
        3   4   4   4   1      Orange   2002-01-01

Apply this for the entire df. I have a brute force approach of iterating through the groups and each rows, any suggestions for a more pandas based solution will be helpful in improving the efficiency.

Upvotes: 7

Views: 312

Answers (2)

Andy L.
Andy L.

Reputation: 25259

You may use pivot_table and groupby.rank to create rank. After that use map to assign values back

df1 = df.pivot_table('c', ['date','d1']).groupby(level=0).rank(method='dense')-1
df['d1'] = df[['date','d1']].agg(tuple, axis=1).map(df1.c).astype('int')

Out[255]:
   a  b  c  d1      d2        date
0  2  7  9   0  Orange  2002-01-01
1  6  0  9   1   lemon  2002-01-01
2  8  0  0   0   lemon  2002-01-01
3  4  4  4   1  Orange  2002-01-01
4  5  0  8   0   lemon  2002-02-01
5  6  1  6   0  Orange  2002-02-01
6  7  2  7   0   lemon  2002-02-01

Note: group 2002-02-01 has the same mean value 7 for both Mango and Apple so the rank is all 0

Upvotes: 1

eNc
eNc

Reputation: 1081

Is this what you are looking for in column d1? You could apply some similar technique to d2 as well. Its not the most elegant solution though.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,10,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])

df['mean_value'] = df.groupby(['date', 'd1'])['c'].transform(lambda x: np.mean(x))
df['rank_value'] = (df.groupby(['date'])['mean_value'].rank(ascending=True, method='dense') - 1).astype(int)
df['d1'] = df['rank_value']
df.drop(labels=['rank_value', 'mean_value'], axis=1, inplace=True)

df

   a  b  c  d1      d2       date
0  3  1  4   1  Orange 2002-01-01
1  9  7  5   0   lemon 2002-01-01
2  9  9  5   1   lemon 2002-01-01
3  8  1  2   0  Orange 2002-01-01
4  8  0  1   0   lemon 2002-02-01
5  1  8  3   0  Orange 2002-02-01
6  8  0  4   1   lemon 2002-02-01

Upvotes: 2

Related Questions