Reputation: 57
I am trying to loop over each group of location column for each consecutive date column(date are date type d.m.y) and find the latest date(row) that not meet the criteria ( daily_percentage for 3 consecutive days <10).
Here is an example of df:
date(d.m.y) | location | daily_percentage |
---|---|---|
02.01.2021 | x | 3.2 |
02.01.2021 | y | 15.6 |
02.01.2021 | z | 1.2 |
02.01.2021 | m | 7.6 |
02.01.2021 | n | 8 |
03.01.2021 | x | 10.5 |
03.01.2021 | y | 3.4 |
03.01.2021 | z | 7 |
03.01.2021 | m | 3.6 |
03.01.2021 | n | 6. |
04.01.2021 | x | 11.6 |
04.01.2021 | y | 3.5 |
04.01.2021 | z | 1.1 |
04.01.2021 | m | 5.8 |
04.01.2021 | n | 3.2 |
05.01.2021 | x | 0.4 |
05.01.2021 | y | 2 |
05.01.2021 | z | 12 |
05.01.2021 | m | 9.1 |
05.01.2021 | n | 17.9 |
06.01.2021 | x | 4.9 |
06.01.2021 | y | 15.8 |
06.01.2021 | z | 9 |
06.01.2021 | m | 16 |
06.01.2021 | n | 3 |
So the output should print the rows that do not meet criteria: Output should be :
date(d.m.y) | location | daily_percentage |
---|---|---|
05.01.2021 | m | 9.1 |
05.01.2021 | y | 2 |
04.01.2021 | z | 1.1 |
04.01.2021 | n | 3.2 |
02.01.2021 | x | 3.2 |
the code that I wrote does not work properly with looping though each group. ''' location_names <- unique(df$location)
df$last_three_days <- FALSE
first_row_with_false <- 0
three_consequtive_out_of_order <- FALSE
for (locat in location_names){
for(row in 5:nrow(df)) { # ignoring the first few rows
if (three_consequtive_out_of_order == FALSE) {
today_below_10 <- (df[row,]$daily_percentage) < 10
yesterday_below_10 <- (df[row-1,]$daily_percentage) < 10
day_minus_2_below_10 <- (df[row-2,]$daily_percentage) < 10
three_consequtive <- today_below_10 & yesterday_below_10 & day_minus_2_below_10
df[row,]$last_three_days <- three_consequtive
if ((!is.na(three_consequtive)) & (three_consequtive == FALSE) ) {
first_row_with_false <- row
three_consequtive_out_of_order <- TRUE
}
}
}
}
cutoff_date <- df[first_row_with_false-1,]$date
my code (without this part: ''' three_consequtive_out_of_order <- FALSE for (locat in location_names)) '''
will produce desired output if we assume column location is not grouped.
Is there any efficient way to write this code that works on grouped data? Sorry for very naive question, I am newbie to for loop and R.
TIA!
P.S. date column is date type(d.m.y).
Upvotes: 0
Views: 82
Reputation: 28705
Mostly the same as @rawr, but avoiding data.table::rleid
df <- structure(list(date.d.m.y. = structure(c(18629, 18629, 18629, 18629, 18629, 18630, 18630, 18630, 18630, 18630, 18631, 18631, 18631, 18631, 18631, 18632, 18632, 18632, 18632, 18632, 18633, 18633, 18633, 18633, 18633), class = "Date"), location = c("x", "y", "z", "m", "n", "x", "y", "z", "m", "n", "x", "y", "z", "m", "n", "x", "y", "z", "m", "n", "x", "y", "z", "m", "n"), daily_percentage = c(3.2, 15.6, 1.2, 7.6, 8, 10.5, 3.4, 7, 3.6, 6, 11.6, 3.5, 1.1, 5.8, 3.2, 0.4, 2, 12, 9.1, 17.9, 4.9, 15.8, 9, 16, 3)), row.names = c(NA, -25L), class = "data.frame")
f <- function(data, nday = 3, pct = 10) {
i <-
with(rle(data$daily_percentage < pct), {
n_fails <- values & lengths >= nday
if (!any(n_fails)) 0 # return nothing if no matching row
else cumsum(lengths)[tail(which(n_fails), 1)]
})
data[i,]
}
do.call(rbind, by(df, list(df$location), FUN = f))
#> date.d.m.y. location daily_percentage
#> m 2021-01-05 m 9.1
#> n 2021-01-04 n 3.2
#> y 2021-01-05 y 2.0
#> z 2021-01-04 z 1.1
Created on 2022-01-27 by the reprex package (v2.0.1)
Upvotes: 1
Reputation: 20811
I know for loops are easy to grasp and applicable in other languages, but as a new r user I would highly suggest that you try to reduce complex tasks down to the simplest part and write a function that only does that one thing.
So, for example, consider you have only one location and write the code that does the one thing:
First, the function that does one thing
f <- function(data, nday = 3, pct = 10) {
ii <- data$daily_percentage < pct
## turn sequences of true/false into group IDs
id <- data.table::rleid(ii)
## only care about those that are at least nday long and fail pct criterion
idx <- (ave(id, id, FUN = length) >= nday) & ii
## take the last one that is true
idx <- tail(which(idx), 1L)
## if none are found, take the first row that failed pct criterion
if (!length(idx))
idx <- which(ii)[1L]
data[idx, ]
}
I am also assuming you have formatted your data properly--as it appears now, your dates are character strings.
So take your one-thing function, split your data into the parts that will work with the function, apply it, and combine the results
df <- structure(list(date.d.m.y. = structure(c(18629, 18629, 18629, 18629, 18629, 18630, 18630, 18630, 18630, 18630, 18631, 18631, 18631, 18631, 18631, 18632, 18632, 18632, 18632, 18632, 18633, 18633, 18633, 18633, 18633), class = "Date"), location = c("x", "y", "z", "m", "n", "x", "y", "z", "m", "n", "x", "y", "z", "m", "n", "x", "y", "z", "m", "n", "x", "y", "z", "m", "n"), daily_percentage = c(3.2, 15.6, 1.2, 7.6, 8, 10.5, 3.4, 7, 3.6, 6, 11.6, 3.5, 1.1, 5.8, 3.2, 0.4, 2, 12, 9.1, 17.9, 4.9, 15.8, 9, 16, 3)), row.names = c(NA, -25L), class = "data.frame")
sp <- split(df, df$location)
sp <- lapply(sp, function(x) f(x, 3, 10))
do.call('rbind', sp)
# date.d.m.y. location daily_percentage
# m 2021-01-05 m 9.1
# n 2021-01-04 n 3.2
# x 2021-01-02 x 3.2
# y 2021-01-05 y 2.0
# z 2021-01-04 z 1.1
Upvotes: 2