ANmike
ANmike

Reputation: 267

Finding difference between two columns based on condition in another column

Below is my data framedf, which has 3 variables and approx 100k data points,

   ID   A   B   C
   1    35  0   0
   2    28  0   0
   3    36  0   0
   4    99  0   0 
   5    25  1   0
   6    65  1   0
   7    98  1   0
   8    95  1   0
   9    67  0   65
  10    95  0   65 
  11    94  0   65
  12    4   0   65
  13    2   -1  0
  14    62  -1  0
  15    95  -1  0
  16    25  -1  0
  17    36  0   19.5
  18    3   0   19.5

Now, I would like to create another variables D which contains the difference between the Last data point of A where B == 1 or -1and the immediate next value in C where B == 1 or -1.

Expected output

    ID  A   B   C      D
     1  35  0   0      0
     2  28  0   0      0
     3  36  0   0      0
     4  99  0   0      0
     5  25  1   0      0
     6  65  1   0      0
     7  98  1   0      0
     8  95  1   0     30
     9  67  0   65     0
    10  95  0   65     0
    11  94  0   65     0
    12  4   0   65     0
    13  2   -1  0      0
    14  62  -1  0      0
    15  95  -1  0      0
    16  25  -1  0    5.5
    17  36  0   19.5    0
    18  3   0   19.5    0

Note
Eg. D8 = A8-c9, similarly D16 = A16 - C17 and this continues till the end of the data frame.

Update on the expected output

    ID  A   B   C       D
     1  35  0   0       0
     2  28  0   0       0
     3  36  0   0       0
     4  99  0   0       0
     5  25  1   0       0
     6  65  1   0       0
     7  98  1   0       0
     8  95  1   0       30
     9  67  0   65      0
    10  95  0   65      0
    11  94  1   65      0
    12  4   0   65      0
    13  2   -1  0       0
    14  62  -1  0       0
    15  95  -1  0       0
    16  25  -1  0      5.5
    17  36  0   19.5    0
    18  3   0   19.5    0  
    19  5   0   19.5    0  
    20  68  1   19.5    0 
    21  17  0   0       0 

Updated question

Another condition is that if the value in the column C is same for the next 4 values as then the program should not consider to subtract the values A and C Eg, B11 ==1 it has the change of value to 1 but program should not consider that data point to find the difference between A11 and C12 because value of C is still within the count of 4 when B has changed. Similarly for the Data point at B20 .

Upvotes: 2

Views: 587

Answers (2)

count
count

Reputation: 1338

Or with base R you could try:

  dat$new <- 0

  ind <- cumsum(rle(dat$B)$lengths)[rle(dat$B)$values%in%c(1,-1)]

  dat[ind,"new"] <- dat[ind,"A"] - lead(dat[,"C"])[ind]

Upvotes: 1

akrun
akrun

Reputation: 887891

We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), shift the 'C' column to get the next value ('C1'), grouped by the run-length-id of 'B' and specifying i (abs(B)==1), we assign 'D' as the difference between the last values of 'A' and 'C1' (A[.N] - C1[.N]), assign the columns that are not needed to NULL, and change the values in 'D' to 0 except for the last element determined by the grouping variable (rleid(B))

library(data.table)
setDT(df1)[, C1 := shift(C, type = 'lead')][abs(B)==1, 
             D := A[.N] - C1[.N], .(grp = rleid(B))][, C1 := NULL]
df1[df1[,.I[seq_len(.N) != .N] , rleid(B)]$V1, D := NA][is.na(D), D := 0][]
#    ID  A  B    C    D
# 1:  1 35  0  0.0  0.0
# 2:  2 28  0  0.0  0.0
# 3:  3 36  0  0.0  0.0
# 4:  4 99  0  0.0  0.0
# 5:  5 25  1  0.0  0.0
# 6:  6 65  1  0.0  0.0
# 7:  7 98  1  0.0  0.0
# 8:  8 95  1  0.0 30.0
# 9:  9 67  0 65.0  0.0
#10: 10 95  0 65.0  0.0
#11: 11 94  0 65.0  0.0
#12: 12  4  0 65.0  0.0
#13: 13  2 -1  0.0  0.0
#14: 14 62 -1  0.0  0.0
#15: 15 95 -1  0.0  0.0
#16: 16 25 -1  0.0  5.5
#17: 17 36  0 19.5  0.0
#18: 18  3  0 19.5  0.0

Upvotes: 3

Related Questions