Reputation: 4470
I am trying to do incremental groupby and ranking in pandas.
Sample DF:
df = pd.DataFrame(np.random.randint(0,20,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df
a b c d1 d2 date
0 7 1 4 Apple Orange 2002-01-01
1 3 7 6 Mango lemon 2002-01-01
2 9 6 9 Apple lemon 2002-01-01
3 0 5 8 Mango Orange 2002-01-01
4 4 6 7 Mango lemon 2002-02-01
5 4 3 8 Mango Orange 2002-02-01
6 0 2 8 Apple lemon 2002-02-01
Trying to incremental Group by d1
and rank every row of column d1
based on another column c
.
For loc [0,"d1"]
, the value Apple
will be rank 0
since there is only row and no comparison exits.
For loc [1,"d1"]
, the value Mango
will be 1
because considering the first two rows the corresponding value of Apple
in column C
, i.e value of [0,"c"]
is 4
(Apple) and the value of [1,"C"]
is 6
(for Mango) so Mango has higher rank in this sliced DF
For loc [2,"d1"]
, the value Apple
will be 1
because considering the first three rows the corresponding value of Apple
in column C
, i.e value of [0,"c"]
is 4
(Apple) and the value of [1,"C"]
is 6
(for Mango), the value of [2,"c"]
is 9
(Apple) so the average of 2 values for Apple
is (4+9)/2 =6.5
and the value for Mango
is 6
so Apple
will be rank 1
.
Following the same pattern incrementally and updating the value of column d1
at the last index of the incrementally sliced DF.
Expected values for column d1
:
0
1
1
1 => since for Apple (4+9)/2 and for Mango (6+8)/2
1 => since for Apple (4+9)/2 and for Mango (6+8+7)/3
1 => since for Apple (4+9)/2 and for Mango (6+8+7+8)/4
0 => since for Apple (4+9+8)/2 and for Mango (6+8+7+8)/4
I can do this in a for loop by slicing iteratively df[:i]
, but for large DFs this is taking forever, any suggestions on a more pandas based approach will be great.
Applying the first solution to the following random DF:
a b c d1 d2 date
0 7 1 19 Apple Orange 2002-01-01
1 3 7 17 Mango lemon 2002-01-01
2 9 6 4 Apple lemon 2002-01-01
3 0 5 15 Apple Orange 2002-01-01
4 4 6 8 Mango lemon 2002-02-01
5 4 3 1 Mango Orange 2002-02-01
6 2 2 14 Apple lemon 2002-02-01
7 5 15 10 Mango Orange 2002-01-01
8 1 2 10 Apple lemon 2002-02-01
9 2 1 12 Apple Orange 2002-02-01
I get the following values for d1
:
0
0
0
1
0
0
1
0
1
0
the last value is wrong because at that point the value of Apple
is 12.33
(19+4+15+14+10+12)/6 and Mango
is 9
(17+8+1+10)/4 so the last value of d1
should be 1
.
Upvotes: 0
Views: 714
Reputation: 153460
a b c d1 d2 date
0 7 1 19 Apple Orange 2002-01-01
1 3 7 17 Mango lemon 2002-01-01
2 9 6 4 Apple lemon 2002-01-01
3 0 5 15 Apple Orange 2002-01-01
4 4 6 8 Mango lemon 2002-02-01
5 4 3 1 Mango Orange 2002-02-01
6 2 2 14 Apple lemon 2002-02-01
7 5 15 10 Mango Orange 2002-01-01
8 1 2 10 Apple lemon 2002-02-01
9 2 1 12 Apple Orange 2002-02-01
s = df.groupby('d1')['c'].expanding().mean().sort_index(level=1)
Outputs:
Apple 0 19.000000
Mango 1 17.000000
Apple 2 11.500000
3 12.666667
Mango 4 12.500000
5 8.666667
Apple 6 13.000000
Mango 7 9.000000
Apple 8 12.400000
9 12.333333
What do we need to do at this point? Are these averages correct?
And if I use s.diff().ge(0)
to compare averages you get:
Apple 0 0
Mango 1 0
Apple 2 0
3 1
Mango 4 0
5 0
Apple 6 1
Mango 7 0
Apple 8 1
9 0
IIUC,
Look at this:
df.groupby('d1')['c'].expanding().mean().sort_index(level=1)
Output:
Apple 0 4.00 #4
Mango 1 6.00 #6
Apple 2 6.50 #9+4 / 2
Mango 3 7.00 #6 + 8 / 2
4 7.00 #6 + 8 + 7 / 3
5 7.25 #6 + 8 + 7 + 8 / 4
Apple 6 7.00 #4 + 9 + 8 / 3
Name: c, dtype: float64
Now, let's compare to previous row:
df.groupby('d1')['c'].expanding().mean().sort_index(level=1).diff().ge(0).astype(int)
Output:
d1
Apple 0 0
Mango 1 1
Apple 2 1
Mango 3 1
4 1
5 1
Apple 6 0
Name: c, dtype: int32
Or maybe you need to compare Mango to last value of apple....
df.groupby('d1')['c'].expanding().mean().sort_index(level=1).unstack(0).ffill()
Output:
d1 Apple Mango
0 4.0 NaN
1 4.0 6.00
2 6.5 6.00
3 6.5 7.00
4 6.5 7.00
5 6.5 7.25
6 7.0 7.25
However, I can't match you expected output:
df.groupby('d1')['c'].expanding().mean().sort_index(level=1).unstack(0).ffill().eval('rank= Mango >= Apple')
Output:
d1 Apple Mango rank
0 4.0 NaN False
1 4.0 6.00 True
2 6.5 6.00 False
3 6.5 7.00 True
4 6.5 7.00 True
5 6.5 7.25 True
6 7.0 7.25 True
Upvotes: 1