Incremental GroupBy in pandas

Question

I am trying to do incremental groupby and ranking in pandas.

Sample DF:

df = pd.DataFrame(np.random.randint(0,20,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df


    a   b   c     d1      d2     date
0   7   1   4   Apple   Orange  2002-01-01
1   3   7   6   Mango   lemon   2002-01-01
2   9   6   9   Apple   lemon   2002-01-01
3   0   5   8   Mango   Orange  2002-01-01
4   4   6   7   Mango   lemon   2002-02-01
5   4   3   8   Mango   Orange  2002-02-01
6   0   2   8   Apple   lemon   2002-02-01

Trying to incremental Group by d1 and rank every row of column d1 based on another column c.

For loc [0,"d1"], the value Apple will be rank 0 since there is only row and no comparison exits.

For loc [1,"d1"] , the value Mango will be 1 because considering the first two rows the corresponding value of Apple in column C, i.e value of [0,"c"] is 4 (Apple) and the value of [1,"C"] is 6 (for Mango) so Mango has higher rank in this sliced DF

For loc [2,"d1"] , the value Apple will be 1 because considering the first three rows the corresponding value of Apple in column C, i.e value of [0,"c"] is 4 (Apple) and the value of [1,"C"] is 6 (for Mango), the value of [2,"c"] is 9 (Apple) so the average of 2 values for Apple is (4+9)/2 =6.5 and the value for Mangois 6 so Apple will be rank 1.

Following the same pattern incrementally and updating the value of column d1 at the last index of the incrementally sliced DF.

Expected values for column d1:

0
1
1
1 => since for Apple (4+9)/2 and for Mango (6+8)/2
1 => since for Apple (4+9)/2 and for Mango (6+8+7)/3
1 => since for Apple (4+9)/2 and for Mango (6+8+7+8)/4
0 => since for Apple (4+9+8)/2 and for Mango (6+8+7+8)/4

I can do this in a for loop by slicing iteratively df[:i], but for large DFs this is taking forever, any suggestions on a more pandas based approach will be great.

Applying the first solution to the following random DF:

    a   b   c     d1       d2    date
0   7   1   19  Apple   Orange  2002-01-01
1   3   7   17  Mango   lemon   2002-01-01
2   9   6   4   Apple   lemon   2002-01-01
3   0   5   15  Apple   Orange  2002-01-01
4   4   6   8   Mango   lemon   2002-02-01
5   4   3   1   Mango   Orange  2002-02-01
6   2   2   14  Apple   lemon   2002-02-01
7   5   15  10  Mango   Orange  2002-01-01
8   1   2   10  Apple   lemon   2002-02-01
9   2   1   12  Apple   Orange  2002-02-01

I get the following values for d1:

the last value is wrong because at that point the value of Apple is 12.33 (19+4+15+14+10+12)/6 and Mango is 9 (17+8+1+10)/4 so the last value of d1 should be 1.

Incremental GroupBy in pandas

Answers (1)

Updated for second dataframe:

Related Questions