data_person
data_person

Reputation: 4470

Incremental GroupBy in pandas

I am trying to do incremental groupby and ranking in pandas.

Sample DF:

df = pd.DataFrame(np.random.randint(0,20,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df


    a   b   c     d1      d2     date
0   7   1   4   Apple   Orange  2002-01-01
1   3   7   6   Mango   lemon   2002-01-01
2   9   6   9   Apple   lemon   2002-01-01
3   0   5   8   Mango   Orange  2002-01-01
4   4   6   7   Mango   lemon   2002-02-01
5   4   3   8   Mango   Orange  2002-02-01
6   0   2   8   Apple   lemon   2002-02-01

Trying to incremental Group by d1 and rank every row of column d1 based on another column c.

For loc [0,"d1"], the value Apple will be rank 0 since there is only row and no comparison exits.

For loc [1,"d1"] , the value Mango will be 1 because considering the first two rows the corresponding value of Apple in column C, i.e value of [0,"c"] is 4 (Apple) and the value of [1,"C"] is 6 (for Mango) so Mango has higher rank in this sliced DF

For loc [2,"d1"] , the value Apple will be 1 because considering the first three rows the corresponding value of Apple in column C, i.e value of [0,"c"] is 4 (Apple) and the value of [1,"C"] is 6 (for Mango), the value of [2,"c"] is 9 (Apple) so the average of 2 values for Apple is (4+9)/2 =6.5 and the value for Mangois 6 so Apple will be rank 1.

Following the same pattern incrementally and updating the value of column d1 at the last index of the incrementally sliced DF.

Expected values for column d1:

0
1
1
1 => since for Apple (4+9)/2 and for Mango (6+8)/2
1 => since for Apple (4+9)/2 and for Mango (6+8+7)/3
1 => since for Apple (4+9)/2 and for Mango (6+8+7+8)/4
0 => since for Apple (4+9+8)/2 and for Mango (6+8+7+8)/4

I can do this in a for loop by slicing iteratively df[:i], but for large DFs this is taking forever, any suggestions on a more pandas based approach will be great.

Applying the first solution to the following random DF:

    a   b   c     d1       d2    date
0   7   1   19  Apple   Orange  2002-01-01
1   3   7   17  Mango   lemon   2002-01-01
2   9   6   4   Apple   lemon   2002-01-01
3   0   5   15  Apple   Orange  2002-01-01
4   4   6   8   Mango   lemon   2002-02-01
5   4   3   1   Mango   Orange  2002-02-01
6   2   2   14  Apple   lemon   2002-02-01
7   5   15  10  Mango   Orange  2002-01-01
8   1   2   10  Apple   lemon   2002-02-01
9   2   1   12  Apple   Orange  2002-02-01

I get the following values for d1:

0
0
0
1
0
0
1
0
1
0
      

the last value is wrong because at that point the value of Apple is 12.33 (19+4+15+14+10+12)/6 and Mango is 9 (17+8+1+10)/4 so the last value of d1 should be 1.

Upvotes: 0

Views: 714

Answers (1)

Scott Boston
Scott Boston

Reputation: 153460

Updated for second dataframe:

   a   b   c     d1       d2    date
0   7   1   19  Apple   Orange  2002-01-01
1   3   7   17  Mango   lemon   2002-01-01
2   9   6   4   Apple   lemon   2002-01-01
3   0   5   15  Apple   Orange  2002-01-01
4   4   6   8   Mango   lemon   2002-02-01
5   4   3   1   Mango   Orange  2002-02-01
6   2   2   14  Apple   lemon   2002-02-01
7   5   15  10  Mango   Orange  2002-01-01
8   1   2   10  Apple   lemon   2002-02-01
9   2   1   12  Apple   Orange  2002-02-01

s = df.groupby('d1')['c'].expanding().mean().sort_index(level=1)

Outputs:

Apple  0    19.000000
Mango  1    17.000000
Apple  2    11.500000
       3    12.666667
Mango  4    12.500000
       5     8.666667
Apple  6    13.000000
Mango  7     9.000000
Apple  8    12.400000
       9    12.333333

What do we need to do at this point? Are these averages correct?

And if I use s.diff().ge(0) to compare averages you get:

Apple  0    0
Mango  1    0
Apple  2    0
       3    1
Mango  4    0
       5    0
Apple  6    1
Mango  7    0
Apple  8    1
       9    0

IIUC,

Look at this:

df.groupby('d1')['c'].expanding().mean().sort_index(level=1)

Output:

Apple  0    4.00  #4
Mango  1    6.00  #6
Apple  2    6.50  #9+4 / 2
Mango  3    7.00  #6 + 8 / 2
       4    7.00  #6 + 8 + 7 / 3
       5    7.25  #6 + 8 + 7 + 8 / 4
Apple  6    7.00  #4 + 9 + 8 / 3
Name: c, dtype: float64

Now, let's compare to previous row:

df.groupby('d1')['c'].expanding().mean().sort_index(level=1).diff().ge(0).astype(int)

Output:

d1      
Apple  0    0
Mango  1    1
Apple  2    1
Mango  3    1
       4    1
       5    1
Apple  6    0
Name: c, dtype: int32

Or maybe you need to compare Mango to last value of apple....

df.groupby('d1')['c'].expanding().mean().sort_index(level=1).unstack(0).ffill()

Output:

d1  Apple  Mango
0     4.0    NaN
1     4.0   6.00
2     6.5   6.00
3     6.5   7.00
4     6.5   7.00
5     6.5   7.25
6     7.0   7.25

However, I can't match you expected output:

df.groupby('d1')['c'].expanding().mean().sort_index(level=1).unstack(0).ffill().eval('rank= Mango >= Apple')

Output:

d1  Apple  Mango   rank
0     4.0    NaN  False
1     4.0   6.00   True
2     6.5   6.00  False
3     6.5   7.00   True
4     6.5   7.00   True
5     6.5   7.25   True
6     7.0   7.25   True

Upvotes: 1

Related Questions