Reputation: 439

How to calculate difference between non-consecutive rows in Python DataFrame?

I have following pandas DataFrame and trying to create a new "Value_Diff" column where it calculates the difference between current "value" if lable=0 with the previous value where label=1. If label=1 it sets the "Value_Diff" equals to 0. This process needs to be repeated for each group and if first label in the group is equal to 0 it should leave the "Value_Diff" equal to 0 until it reaches first label=1 and then follow the same logic ( group C in this example)

I can write a for loop and if statement for each individual group to do this, however was wondering if there is a better way to do this with using groupby, lambda or any other function.

here is the input:

 group  Date         Value   label   
 A     2020-03-01   -117    1           
 A     2020-03-02   -121    0             
 A     2020-03-03   -122    0           
 A     2020-03-04   -122    1           
 B     2020-03-05   -118    1           
 B     2020-03-06   -122    0           
 B     2020-03-07   -124    0           
 B     2020-03-08   -126    0           
 B     2020-03-09   -126    1           
 C     2020-03-10   -130    0           
 C     2020-03-11   -140    0           
 C     2020-03-12   -150    1           
 C     2020-03-13   -160    0

Answer should look like this:

 group  Date        Value  label   Value_Diff
 A     2020-03-01   -117    1         0
 A     2020-03-02   -121    0         4 (-117-(-121)=4)
 A     2020-03-03   -122    0         1
 A     2020-03-04   -122    1         0
 B     2020-03-05   -118    1         0
 B     2020-03-06   -122    0         4
 B     2020-03-07   -124    0         2
 B     2020-03-08   -126    0         2
 B     2020-03-09   -126    1         0
 C     2020-03-10   -130    0         0
 C     2020-03-11   -140    0         0  
 C     2020-03-12   -150    1         0  
 C     2020-03-13   -160    0         10

Sorry my first output didn't actually reflect what I wanted, because @BENY provided the solution to this output I'll leave this here to help others with the same question. Here is how the actual output should look like.

 group  Date        Value  label   Value_Diff
 A     2020-03-01   -117    1         0
 A     2020-03-02   -121    0         4 (-117-(-121)=4)
 A     2020-03-03   -122    0         5 (-117-(-122)=5)
 A     2020-03-04   -122    1         0
 B     2020-03-05   -118    1         0
 B     2020-03-06   -122    0         4 (-122-(-118)=4)
 B     2020-03-07   -124    0         6 (-124-(-118)=6)
 B     2020-03-08   -126    0         8 (-126-(-118)=8)
 B     2020-03-09   -126    1         0
 C     2020-03-10   -130    0         0
 C     2020-03-11   -140    0         0  
 C     2020-03-12   -150    1         0  
 C     2020-03-13   -160    0         10 (-150-(-160)=10)

Upvotes: 2

Answers (2)

Bill

Reputation: 11613

I find it's best to do these things one step at a time. I think this does (the second version) of what you want:

df['Value_Last'] = np.nan
df.loc[df.label == 1, 'Value_Last'] = df.loc[df.label == 1, 'Value']
df_grouped = df.groupby('group')
df['Value_Last'] = df_grouped['Value_Last'].ffill()
df['Value_Diff'] = (df['Value_Last'] - df['Value']).fillna(0)


   group        Date  Value  label  Value_Last  Value_Diff
0      A  2020-03-01   -117      1      -117.0         0.0
1      A  2020-03-02   -121      0      -117.0         4.0
2      A  2020-03-03   -122      0      -117.0         5.0
3      A  2020-03-04   -122      1      -122.0         0.0
4      B  2020-03-05   -118      1      -118.0         0.0
5      B  2020-03-06   -122      0      -118.0         4.0
6      B  2020-03-07   -124      0      -118.0         6.0
7      B  2020-03-08   -126      0      -118.0         8.0
8      B  2020-03-09   -126      1      -126.0         0.0
9      C  2020-03-10   -130      0         NaN         0.0
10     C  2020-03-11   -140      0         NaN         0.0
11     C  2020-03-12   -150      1      -150.0         0.0
12     C  2020-03-13   -160      0      -150.0        10.0

PS: If the 'Value' column contains only integers then you can use .astype(int) on the end of the last line to convert 'Value_Diff' back to integers.

Upvotes: 1

BENY

Reputation: 323226

We can try with cummax create the 1st condition, then use groupby with diff

s = df.groupby('group').label.cummax()
df['new'] = -df.groupby('group').Value.diff().fillna(0).where(s==1 & df.label.ne(1),0)
df
Out[135]: 
   group        Date  Value  label   new
0      A  2020-03-01   -117      1  -0.0
1      A  2020-03-02   -121      0   4.0
2      A  2020-03-03   -122      0   1.0
3      A  2020-03-04   -122      1  -0.0
4      B  2020-03-05   -118      1  -0.0
5      B  2020-03-06   -122      0   4.0
6      B  2020-03-07   -124      0   2.0
7      B  2020-03-08   -126      0   2.0
8      B  2020-03-09   -126      1  -0.0
9      C  2020-03-10   -130      0  -0.0
10     C  2020-03-11   -140      0  -0.0
11     C  2020-03-12   -150      1  -0.0
12     C  2020-03-13   -160      0  10.0

Upvotes: 2

How to calculate difference between non-consecutive rows in Python DataFrame?

Answers (2)

Related Questions