Replace pandas column with sorted index

Question

I have a sample DF, trying to replace the list of column values with ascending sorted index:

DF:

df = pd.DataFrame(np.random.randint(0,10,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])

    a   b   c    d1      d2       date
0   2   7   9   Apple   Orange  2002-01-01
1   6   0   9   Mango   lemon   2002-01-01
2   8   0   0   Apple   lemon   2002-01-01
3   4   4   4   Mango   Orange  2002-01-01
4   5   0   8   Mango   lemon   2002-02-01
5   6   1   6   Mango   Orange  2002-02-01
6   7   2   7   Apple   lemon   2002-02-01

Step 1:

Group the DF by "date" column, sample group on "2002-01-01"


        a   b   c    d1      d2       date
    0   2   7   9   Apple   Orange  2002-01-01
    1   6   0   9   Mango   lemon   2002-01-01
    2   8   0   0   Apple   lemon   2002-01-01
    3   4   4   4   Mango   Orange  2002-01-01

Step 2:

In this group, replace the values of columns ["d1","d2"] with index (not the DF index) of sorted mean values based on c.

For example in the above group mean(c, d1="Apple") = [9+0]/2 => 4.5 and mean(c, d1="Mango") = [9+4]/2 => 6.5 so the ascending sorted index is Apple:0 and Mango:1

so the value of column d1 will be replaced like the following:

            a   b   c   d1       d2       date
        0   2   7   9   0      Orange   2002-01-01
        1   6   0   9   1      lemon    2002-01-01
        2   8   0   0   0      lemon    2002-01-01
        3   4   4   4   1      Orange   2002-01-01

Apply this for the entire df. I have a brute force approach of iterating through the groups and each rows, any suggestions for a more pandas based solution will be helpful in improving the efficiency.

Andy L. · Accepted Answer

You may use pivot_table and groupby.rank to create rank. After that use map to assign values back

df1 = df.pivot_table('c', ['date','d1']).groupby(level=0).rank(method='dense')-1
df['d1'] = df[['date','d1']].agg(tuple, axis=1).map(df1.c).astype('int')

Out[255]:
   a  b  c  d1      d2        date
0  2  7  9   0  Orange  2002-01-01
1  6  0  9   1   lemon  2002-01-01
2  8  0  0   0   lemon  2002-01-01
3  4  4  4   1  Orange  2002-01-01
4  5  0  8   0   lemon  2002-02-01
5  6  1  6   0  Orange  2002-02-01
6  7  2  7   0   lemon  2002-02-01

Note: group 2002-02-01 has the same mean value 7 for both Mango and Apple so the rank is all 0

Replace pandas column with sorted index

Answers (2)

Related Questions