Bo_0
Bo_0

Reputation: 25

Find the first incidence (row) of a value that is x amount greater/less than the current value (iterated through each row in a data frame)

I've been trying my best, but not quite getting there. I'm trying to iterate through the value in a vector (df$sample) and find the first proceeding incidence of a value that is 20% less than the current value. I am trying to find this for each row (sample) and print the date of the found value to a new column.

Here's my df:

    date       sample
591 2020-02-14 0.008470
590 2020-02-15 0.008460
589 2020-02-16 0.007681
588 2020-02-17 0.007144
587 2020-02-18 0.007262
586 2020-02-19 0.007300
585 2020-02-20 0.006604
584 2020-02-21 0.006843
583 2020-02-22 0.006687
582 2020-02-23 0.006991
581 2020-02-24 0.007333
580 2020-02-25 0.006738
579 2020-02-26 0.006279
...

My shotty attempts have been to use Position() or which(). I thought maybe I could wrap either of them in a for loop, but my attempts are not quite right.

for(i in length(df){

df$conc20 <- Position(function(x) x < df$sample[i]*0.80, df$sample)
}

or

for(i in length(df){

df$conc20 <- min(which(df$sample < df$sample[i]*0.8)

}

I even found a dply example that got close to what I was looking for.

Ideally:

    date       sample   conc20
591 2020-02-14 0.008470 2020-02-25
590 2020-02-15 0.008460 ...
589 2020-02-16 0.007681
588 2020-02-17 0.007144
587 2020-02-18 0.007262
...

Any clarification I'm happy to provide. I really do appreciate the help!

Upvotes: 1

Views: 456

Answers (3)

Uwe
Uwe

Reputation: 42564

If I understand correctly, this can be solved by a non-equi self join using two helper columns:

library(data.table)
setDT(df)[, rn := .I][, threshold := 0.8 * sample][
  , conc20 := df[df, on = .(rn > rn, sample < threshold), mult = "first", x.date]][
    , c("rn", "threshold") := NULL][]
          date   sample     conc20
 1: 2020-02-14 0.008470 2020-02-20
 2: 2020-02-15 0.008460 2020-02-20
 3: 2020-02-16 0.007681 2020-02-27
 4: 2020-02-17 0.007144 2020-02-27
 5: 2020-02-18 0.007262 2020-02-27
 6: 2020-02-19 0.007300 2020-02-27
 7: 2020-02-20 0.006604       <NA>
 8: 2020-02-21 0.006843 2020-02-27
 9: 2020-02-22 0.006687 2020-02-27
10: 2020-02-23 0.006991 2020-02-27
11: 2020-02-24 0.007333 2020-02-27
12: 2020-02-25 0.006738 2020-02-27
13: 2020-02-26 0.006279       <NA>
14: 2020-02-27 0.005300       <NA>

Explanation

The first condition in the on = clause ensures that only succeeding rows are considered, the second condition looks for sample < threshold where threshold has been defined beforehand as 80% of sample. The helper column rn contains row numbers (created via the special symbol .I). In addition, mult = "first" tells to pick the first occurrence in case of multiple matches.

The result is appended as additional column conc20 by reference, i.e., without copying the whole dataset. Finally, the two helper columns are removed by reference.

Note that chaining is used.

For demonstration, the result of the non-equi self join including all helper columns can be shown:

setDT(df)[, rn := .I][, threshold := 0.8 * sample][
  df, on = .(rn > rn, sample < threshold), mult = "first"]
          date    sample rn threshold     i.date i.sample
 1: 2020-02-20 0.0067760  1 0.0052832 2020-02-14 0.008470
 2: 2020-02-20 0.0067680  2 0.0052832 2020-02-15 0.008460
 3: 2020-02-27 0.0061448  3 0.0042400 2020-02-16 0.007681
 4: 2020-02-27 0.0057152  4 0.0042400 2020-02-17 0.007144
 5: 2020-02-27 0.0058096  5 0.0042400 2020-02-18 0.007262
 6: 2020-02-27 0.0058400  6 0.0042400 2020-02-19 0.007300
 7:       <NA> 0.0052832  7        NA 2020-02-20 0.006604
 8: 2020-02-27 0.0054744  8 0.0042400 2020-02-21 0.006843
 9: 2020-02-27 0.0053496  9 0.0042400 2020-02-22 0.006687
10: 2020-02-27 0.0055928 10 0.0042400 2020-02-23 0.006991
11: 2020-02-27 0.0058664 11 0.0042400 2020-02-24 0.007333
12: 2020-02-27 0.0053904 12 0.0042400 2020-02-25 0.006738
13:       <NA> 0.0050232 13        NA 2020-02-26 0.006279
14:       <NA> 0.0042400 14        NA 2020-02-27 0.005300

Data

library(data.table)
df <- fread("
i   date       sample
591 2020-02-14 0.008470
590 2020-02-15 0.008460
589 2020-02-16 0.007681
588 2020-02-17 0.007144
587 2020-02-18 0.007262
586 2020-02-19 0.007300
585 2020-02-20 0.006604
584 2020-02-21 0.006843
583 2020-02-22 0.006687
582 2020-02-23 0.006991
581 2020-02-24 0.007333
580 2020-02-25 0.006738
579 2020-02-26 0.006279
580 2020-02-27 0.005300
", drop = 1L)

Upvotes: 1

Jarn Sch&#246;ber
Jarn Sch&#246;ber

Reputation: 319

Quite messy, but this should do the trick

library(dplyr)
df<- read.csv( sep = " ",  text=
                 "row date sample
591 2020-02-14 0.008470
590 2020-02-15 0.008460
589 2020-02-16 0.007681
588 2020-02-17 0.007144
587 2020-02-18 0.007262
586 2020-02-19 0.007300
585 2020-02-20 0.006604
584 2020-02-21 0.006843
583 2020-02-22 0.006687
582 2020-02-23 0.006991
581 2020-02-24 0.007333
580 2020-02-25 0.006738
579 2020-02-26 0.006279", 
               
)

x <- 1.05

df <- df %>%
  mutate(id =  1:n()) %>% 
  rowwise %>% 
  mutate(greater_row = 
           first(which(sample*x <
                         df$sample[id:nrow(df)]) + 
                   id-1))
df$greater_row <- df$date[df$greater_row]

This should allow you to set x to any factor you want you want

Upvotes: 1

R.S.
R.S.

Reputation: 2140

Edited Answer

df<- read.csv( sep = " ",  text=
                 "row date sample
591 2020-02-14 0.008470
590 2020-02-15 0.008460
589 2020-02-16 0.007681
588 2020-02-17 0.007144
587 2020-02-18 0.007262
586 2020-02-19 0.007300
585 2020-02-20 0.006604
584 2020-02-21 0.006843
583 2020-02-22 0.006687
582 2020-02-23 0.006991
581 2020-02-24 0.007333
580 2020-02-25 0.006738
579 2020-02-26 0.006279",                    
)
df$date=as.Date(as.character(df$date))
df   

#there is no row 20% below, so I am just using 2% below 
# and multiplying 0.98 instead of 0.8

# Finding cross-over before current row    
f_crossover_before<- function(  i  ){
  cutoff= 0.98* df$sample[i]
  res<- max(which( df$sample[1:i]<= cutoff), -1)
  ifelse ( (res>0) , res , NA )  # sapply cannot return dates !
}

# Finding cross-over after  current row   
f_crossover_after<- function(  i  ){
  cutoff<- 0.98* df$sample[i]
  res<- min( i+which( df$sample[(i+1):nrow(df)]<= cutoff), 
        .Machine$integer.max )
  ifelse ( (res<.Machine$integer.max) , res , NA )
}



# A column for  comparison. Only for visual inspection 
df$cutoff<- df$sample*0.98 


df$crossover_before<- sapply( seq_along(df$sample) ,  FUN = f_crossover_before )
df$crossover_before<- df$date[df$crossover_before]

df$crossover_after<- sapply( seq_along(df$sample) ,  FUN = f_crossover_after)
df$crossover_after<- df$date[df$crossover_after]




#View(df)

Output :

#   row       date   sample     cutoff crossover_before crossover_after
#1  591 2020-02-14 0.008470 0.00830060             <NA>      2020-02-16
#2  590 2020-02-15 0.008460 0.00829080             <NA>      2020-02-16
#3  589 2020-02-16 0.007681 0.00752738             <NA>      2020-02-17
#4  588 2020-02-17 0.007144 0.00700112             <NA>      2020-02-20
#5  587 2020-02-18 0.007262 0.00711676             <NA>      2020-02-20
#6  586 2020-02-19 0.007300 0.00715400       2020-02-17      2020-02-20
#7  585 2020-02-20 0.006604 0.00647192             <NA>      2020-02-26
#8  584 2020-02-21 0.006843 0.00670614       2020-02-20      2020-02-22
#9  583 2020-02-22 0.006687 0.00655326             <NA>      2020-02-26
#10 582 2020-02-23 0.006991 0.00685118       2020-02-22      2020-02-25
#11 581 2020-02-24 0.007333 0.00718634       2020-02-23      2020-02-25
#12 580 2020-02-25 0.006738 0.00660324             <NA>      2020-02-26
#13 579 2020-02-26 0.006279 0.00615342             <NA>            <NA>

Upvotes: 1

Related Questions