Divide R-dataframe-column into blocks of n-rows and provide an id starting at the last value of another group

Question

This title is really horrible, but I just cant find a solution to this. I have a dataframe that looks this this:

structure(list(PIFF_ID = c(371700, 371700, 371700, 371700, 371700, 
371700, 371700, 371700, 371700, 371700, 371700, 371700, 371700, 
371700, 371700, 371700, 371700, 371700, 371700, 371700, 450200, 
450200, 450200, 450200, 450200, 450200, 450200, 450200, 450200, 
450200, 450200, 450200, 450200, 450200, 450200, 450200, 450200, 
450200, 450200, 450200, 1600200, 1600200, 1600200, 1600200, 1600200, 
1600200, 1600200, 1600200, 1600200, 1600200, 1600200, 1600200, 
1600200, 1600200, 1600200, 1600200, 1600200, 1600200, 1600200, 
1600200, 2060200, 2060200, 2060200, 2060200, 2060200, 2060200, 
2060200, 2060200, 2060200, 2060200, 2060200, 2060200, 2060200, 
2060200, 2060200, 2060200, 2060200, 2060200, 2060200, 2060200
), days_before_event = c(180L, 179L, 178L, 177L, 176L, 175L, 
174L, 173L, 172L, 171L, 170L, 169L, 168L, 167L, 166L, 165L, 164L, 
163L, 162L, 161L, 180L, 179L, 178L, 177L, 176L, 175L, 174L, 173L, 
172L, 171L, 170L, 169L, 168L, 167L, 166L, 165L, 164L, 163L, 162L, 
161L, 180L, 179L, 178L, 177L, 176L, 175L, 174L, 173L, 172L, 171L, 
170L, 169L, 168L, 167L, 166L, 165L, 164L, 163L, 162L, 161L, 180L, 
179L, 178L, 177L, 176L, 175L, 174L, 173L, 172L, 171L, 170L, 169L, 
168L, 167L, 166L, 165L, 164L, 163L, 162L, 161L)), row.names = c(NA, 
-80L), groups = structure(list(PIFF_ID = c(371700, 450200, 1600200, 
2060200), .rows = structure(list(1:20, 21:40, 41:60, 61:80), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

What I want to do, and I cannot figure out how, is the following. For each group of PIFF_ID I want to add a column that joins together 18-consequent rows and gives them an id. This is supposed to start (having the lowest ID for each PIFF_ID) at the lowest value of days_before_event.

So the result should look more or less like this (which I created manually):

df_manual = data.frame(
  PIFF_ID = rep(c("a", "b"), each = 36),
  days_before_event = rep(36:1, 2),
  window_18_days = rep(2:1, each = 18)
)

   PIFF_ID days_before_event window_18_days
1        a                36              2
2        a                35              2
3        a                34              2
4        a                33              2
5        a                32              2
6        a                31              2
7        a                30              2
8        a                29              2
9        a                28              2
10       a                27              2
11       a                26              2
12       a                25              2
13       a                24              2
14       a                23              2
15       a                22              2
16       a                21              2
17       a                20              2
18       a                19              2
19       a                18              1
20       a                17              1
21       a                16              1
22       a                15              1
23       a                14              1
24       a                13              1
25       a                12              1
26       a                11              1
27       a                10              1
28       a                 9              1
29       a                 8              1
30       a                 7              1
31       a                 6              1
32       a                 5              1
33       a                 4              1
34       a                 3              1
35       a                 2              1
36       a                 1              1

In my real dataset the row-numbers are also not always a divisible without remainder by 18. So the last rows could end with a window_18_days-value of anything between 1 and 18.

PPK · Accepted Answer

This is one possible solution using only base R

Basically I split the data frame by PIFF_ID and then see in how many 18 line chunks I can split the data. Then I repeat the numbers from 1 to the number of chunks 18 times and add it to the result. In the end everything is merged back into a df

This should also work if one of your PIFF_ID goups contains not a multiple of 18 data points

added_automatic_annotation <- bind_rows(lapply(unique(test$PIFF_ID), function(x){
  # selecting the current PIFF_ID 
  sub_df <- test[test$PIFF_ID == x, ]
  # how often to repeat each label (allows for the final group to be shorter than 18)
  reps <- sapply(split(sub_df$PIFF_ID, ceiling(seq_along(sub_df$PIFF_ID)/18)), function(x) length(x))
  # sorting the df 
  sub_df <- sub_df[order(sub_df$days_before_event, decreasing = T), ]
  # add the anntotation by repeating each level 18 or less times (starting at the end because of the decending order imposed above)
  sub_df$automatic_label <- rep(c(length(reps):1), times = reps)
  return(sub_df)
}))

Divide R-dataframe-column into blocks of n-rows and provide an id starting at the last value of another group

Answers (1)

Related Questions