Reputation: 57

For loop over column groups dataframe R

I am trying to loop over each group of location column for each consecutive date column(date are date type d.m.y) and find the latest date(row) that not meet the criteria ( daily_percentage for 3 consecutive days <10).

Here is an example of df:

date(d.m.y)	location	daily_percentage
02.01.2021	x	3.2
02.01.2021	y	15.6
02.01.2021	z	1.2
02.01.2021	m	7.6
02.01.2021	n	8
03.01.2021	x	10.5
03.01.2021	y	3.4
03.01.2021	z	7
03.01.2021	m	3.6
03.01.2021	n	6.
04.01.2021	x	11.6
04.01.2021	y	3.5
04.01.2021	z	1.1
04.01.2021	m	5.8
04.01.2021	n	3.2
05.01.2021	x	0.4
05.01.2021	y	2
05.01.2021	z	12
05.01.2021	m	9.1
05.01.2021	n	17.9
06.01.2021	x	4.9
06.01.2021	y	15.8
06.01.2021	z	9
06.01.2021	m	16
06.01.2021	n	3

So the output should print the rows that do not meet criteria: Output should be :

date(d.m.y)	location	daily_percentage
05.01.2021	m	9.1
05.01.2021	y	2
04.01.2021	z	1.1
04.01.2021	n	3.2
02.01.2021	x	3.2

the code that I wrote does not work properly with looping though each group. ''' location_names <- unique(df$location)

df$last_three_days <- FALSE
first_row_with_false <- 0
three_consequtive_out_of_order <- FALSE
for (locat in location_names){


for(row in 5:nrow(df)) { # ignoring the first few rows

  if (three_consequtive_out_of_order == FALSE) {

     today_below_10 <- (df[row,]$daily_percentage) < 10

    yesterday_below_10 <- (df[row-1,]$daily_percentage) < 10

    day_minus_2_below_10 <- (df[row-2,]$daily_percentage) < 10

    three_consequtive <- today_below_10 & yesterday_below_10 & day_minus_2_below_10

    df[row,]$last_three_days <- three_consequtive

    if ((!is.na(three_consequtive)) & (three_consequtive == FALSE) ) {

      first_row_with_false <- row

      three_consequtive_out_of_order <- TRUE

    }
   }
  }
}

cutoff_date <- df[first_row_with_false-1,]$date

my code (without this part: ''' three_consequtive_out_of_order <- FALSE for (locat in location_names)) '''

will produce desired output if we assume column location is not grouped.

Is there any efficient way to write this code that works on grouped data? Sorry for very naive question, I am newbie to for loop and R.

TIA!

P.S. date column is date type(d.m.y).

Upvotes: 0

Answers (2)

IceCreamToucan

Reputation: 28705

Mostly the same as @rawr, but avoiding data.table::rleid

df <- structure(list(date.d.m.y. = structure(c(18629, 18629, 18629, 18629, 18629, 18630, 18630, 18630, 18630, 18630, 18631, 18631, 18631, 18631, 18631, 18632, 18632, 18632, 18632, 18632, 18633, 18633, 18633, 18633, 18633), class = "Date"), location = c("x", "y", "z", "m", "n", "x", "y", "z", "m", "n", "x", "y", "z", "m", "n", "x", "y", "z", "m", "n", "x", "y", "z", "m", "n"), daily_percentage = c(3.2, 15.6, 1.2, 7.6, 8, 10.5, 3.4, 7, 3.6, 6, 11.6, 3.5, 1.1, 5.8, 3.2, 0.4, 2, 12, 9.1, 17.9, 4.9, 15.8, 9, 16, 3)), row.names = c(NA, -25L), class = "data.frame")

f <- function(data, nday = 3, pct = 10) {
  i <- 
    with(rle(data$daily_percentage < pct), {
      n_fails <- values & lengths >= nday
      if (!any(n_fails)) 0 # return nothing if no matching row
      else cumsum(lengths)[tail(which(n_fails), 1)]
    })
  data[i,]
}

do.call(rbind, by(df, list(df$location), FUN = f))
#>   date.d.m.y. location daily_percentage
#> m  2021-01-05        m              9.1
#> n  2021-01-04        n              3.2
#> y  2021-01-05        y              2.0
#> z  2021-01-04        z              1.1

^{Created on 2022-01-27 by the reprex package (v2.0.1)}

Upvotes: 1

rawr

Reputation: 20811

I know for loops are easy to grasp and applicable in other languages, but as a new r user I would highly suggest that you try to reduce complex tasks down to the simplest part and write a function that only does that one thing.

So, for example, consider you have only one location and write the code that does the one thing:

test for at least three < 10% in a row and on consecutive days; if that happens, take the last row
if none of that happens, take the first row that is < 10% (you did not explicitly state this but it seems to be what you want based on the output)

First, the function that does one thing

f <- function(data, nday = 3, pct = 10) {
  ii <- data$daily_percentage < pct
  ## turn sequences of true/false into group IDs
  id <- data.table::rleid(ii)
  ## only care about those that are at least nday long and fail pct criterion
  idx <- (ave(id, id, FUN = length) >= nday) & ii
  ## take the last one that is true
  idx <- tail(which(idx), 1L)
  
  ## if none are found, take the first row that failed pct criterion
  if (!length(idx))
    idx <- which(ii)[1L]
  
  data[idx, ]
}

I am also assuming you have formatted your data properly--as it appears now, your dates are character strings.

So take your one-thing function, split your data into the parts that will work with the function, apply it, and combine the results

df <- structure(list(date.d.m.y. = structure(c(18629, 18629, 18629, 18629, 18629, 18630, 18630, 18630, 18630, 18630, 18631, 18631, 18631, 18631, 18631, 18632, 18632, 18632, 18632, 18632, 18633, 18633, 18633, 18633, 18633), class = "Date"), location = c("x", "y", "z", "m", "n", "x", "y", "z", "m", "n", "x", "y", "z", "m", "n", "x", "y", "z", "m", "n", "x", "y", "z", "m", "n"), daily_percentage = c(3.2, 15.6, 1.2, 7.6, 8, 10.5, 3.4, 7, 3.6, 6, 11.6, 3.5, 1.1, 5.8, 3.2, 0.4, 2, 12, 9.1, 17.9, 4.9, 15.8, 9, 16, 3)), row.names = c(NA, -25L), class = "data.frame")
sp <- split(df, df$location)
sp <- lapply(sp, function(x) f(x, 3, 10))
do.call('rbind', sp)

#   date.d.m.y. location daily_percentage
# m  2021-01-05        m              9.1
# n  2021-01-04        n              3.2
# x  2021-01-02        x              3.2
# y  2021-01-05        y              2.0
# z  2021-01-04        z              1.1

Upvotes: 2

For loop over column groups dataframe R

Answers (2)

Related Questions