Michael Matta
Michael Matta

Reputation: 384

Identify the highest number of consecutive numbers in a dataframe and add rows with NA

Here is a reproducible example of the situation I need help for. I have a database (db1) in which weekly ratings of behavioral outcomes are recorded. The variable "Week" corresponds to the number of the week from the beginning of the year (e.g., Week = 1 indicates the week between January 1st and 7th, and so on...) and the variable "Score" to the value obtained by the subject on the criterion measure. In the real data set, I have several participants and a different number of ratings for each subject; however, in this example there is only one subject to make things easier.

library(magrittr)

x1 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y1 <- c(34, 21, 45, 32, 56, 45, 23, 48)
db1 <- cbind(x1, y1) %>% as.data.frame() %>% setNames(c("Week", "Score"))
db1
#  Week Score
#1   14    34
#2   18    21
#3   19    45
#4   20    32
#5   21    56
#6   23    45
#7   24    23
#8   25    48

What I need to do is to identify the highest number of ratings occurred in consecutive weeks in the database. In the example, the highest number is 4 because the ratings were consecutive from week 18 to 21. Here I added a column for demonstration, but it might not be necessary for the solution.

x2 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y2 <- c(34, 21, 45, 32, 56, 45, 23, 48)
z2 <- c(1, 1, 2, 3, 4, 1, 2, 3)
db2 <- cbind(x2, y2, z2) %>% as.data.frame() %>% setNames(c("Week", "Score", "Consecutive"))
db2
#  Week Score Consecutive
#1   14    34           1
#2   18    21           1
#3   19    45           2
#4   20    32           3
#5   21    56           4
#6   23    45           1
#7   24    23           2
#8   25    48           3

Finally, because every subject has to have a total of five consecutive ratings, I need to add a row with a missing datum where the highest number of consecutive weeks is below five (so that I can impute the missing data later on). However, there might be ratings before and after the sequence. If that is the case, I want to add the row based on the minimal distance between the first or last week of the longest series of consecutive weeks from the other existing rating. In the example, that means that the row with missing datum will be added after 21 because there are 4 missing weeks between week 14 and 18 whereas only 1 between week 21 and 23.

x3 <- c(14, 18, 19, 20, 21, 22, 23, 24, 25)
y3 <- c(34, 21, 45, 32, 56, NA, 45, 23, 48)
z3 <- c(1, 1, 2, 3, 4, 5, 1, 2, 3)
db3 <- cbind(x3, y3, z3) %>% as.data.frame() %>% setNames(c("Week", "Score", "Consecutive"))
db3
#  Week Score Consecutive
#1   14    34           1
#2   18    21           1
#3   19    45           2
#4   20    32           3
#5   21    56           4
#6   22    NA           5
#7   23    45           1
#8   24    23           2
#9   25    48           3

For your information, this is not going to be part of the main statistical analyses but rather one of several ways I want to use to test the sensitivity of my model. So do not worry about whether it makes sense from a methodological point of view. In addition, if possible, a tidyverse solution would be greatly appreciated.

Thanks so much to anyone who will take the time.

Upvotes: 1

Views: 99

Answers (3)

AnilGoyal
AnilGoyal

Reputation: 26218

The code is relatively easier, if you want to do it just for max group and if more than one, just for one.

db1 %>% mutate(consecutive = accumulate(diff(Week), .init = 1, ~if(.y == 1) { .x +1} else {1}),
               dummy = max(consecutive) == consecutive & max(consecutive) < 5) %>%
  group_by(grp = cumsum(consecutive == 1)) %>%
  filter(sum(dummy) > 0) %>% #filter out group(s) with max consecutive
  ungroup() %>% select(-dummy) %>%
  filter(grp == min(grp)) %>% # filter out first such group, if there are more than 1 
  complete(consecutive = 1:5) %>%
  select(-grp) %>%
  mutate(Week = first(Week) + consecutive -1)

# A tibble: 5 x 3
  consecutive  Week Score
        <dbl> <dbl> <dbl>
1           1    18    21
2           2    19    45
3           3    20    32
4           4    21    56
5           5    22    NA

OLD ANSWER Another tidyverse strategy (this can be modified to suit your additional column requirements which you have not given in sample)


library(tidyverse)

db1
#>   Week Score
#> 1   14    34
#> 2   18    21
#> 3   19    45
#> 4   20    32
#> 5   21    56
#> 6   23    45
#> 7   24    23
#> 8   25    48

library(data.table)

db1 %>% mutate(consecutive = accumulate(diff(Week), .init = 1, ~if(.y == 1) { .x +1} else {1}),
               dummy = max(consecutive) == consecutive & max(consecutive) < 5,
               dummy2 = rleid(dummy)) %>%
  group_split(dummy2, .keep = F) %>%
  map_if( ~.x$dummy[[1]], ~.x %>% complete(consecutive = seq(max(consecutive), 5, 1), fill = list(Week = 1)) %>%
            mutate(Week = cumsum(Week))) %>%
  map_dfr(~.x %>% select(-dummy))

#> # A tibble: 9 x 3
#>    Week Score consecutive
#>   <dbl> <dbl>       <dbl>
#> 1    14    34           1
#> 2    18    21           1
#> 3    19    45           2
#> 4    20    32           3
#> 5    21    56           4
#> 6    22    NA           5
#> 7    23    45           1
#> 8    24    23           2
#> 9    25    48           3

Created on 2021-06-10 by the reprex package (v2.0.0)

Upvotes: 2

Anoushiravan R
Anoushiravan R

Reputation: 21918

You can also use the following solution. Midway through this solution before we use add_row to add your additional rows, we can filter the whole data set for we use group_split I filtered the whole data set to keep only those groups with the maximum observations which means they have longer consecutive Weeks than others. So after we split by grouping variable we may end of with 2 or more groups of equal consecutive Weeks so then you can choose whichever your like based on your preference:

library(dplyr)
library(purrr)
library(tibble)

db1 %>%
  mutate(Consecutive = +(Week - lag(Week, default = first(Week)) == 1), 
         grp = cumsum(Consecutive == 0)) %>%
  group_by(grp) %>%
  mutate(Consecutive = row_number()) %>%
  group_by(grp, .drop = TRUE) %>%
  add_count() %>%
  ungroup() -> db2     # We create our grouping variable `grp` here


db2 %>%
  filter(n == max(n)) %>%
  group_split(grp) %>%
  map_dfr(~ add_row(.x, Week = .x$Week[.x$n[1]] + seq(1, 5 - .x$n[1], 1), 
                  Consecutive = .x$Consecutive[.x$n[1]] + seq(1, 5 - .x$n[1], 1), 
                  grp = .x$grp[1])) %>%
  bind_rows(db2 %>%
              filter(n != max(n))) %>%
  select(-c(grp, n)) %>%
  arrange(Week)


# A tibble: 9 x 3
   Week Score Consecutive
  <dbl> <dbl>       <dbl>
1    14    34           1
2    18    21           1
3    19    45           2
4    20    32           3
5    21    56           4
6    22    NA           5
7    23    45           1
8    24    23           2
9    25    48           3

Upvotes: 1

Yuriy Saraykin
Yuriy Saraykin

Reputation: 8880

if I understand correctly

library(data.table)


library(tidyverse)
x1 <- c(14, 18, 19, 20, 21, 23, 24, 25)
y1 <- c(34, 21, 45, 32, 56, 45, 23, 48)
db1 <- cbind(x1, y1) %>% as.data.frame() %>% setNames(c("Week", "Score"))

db1 %>%
  mutate(grp = cumsum(c(0, diff(Week)) > 1)) %>%
  group_by(grp) %>%
  mutate(n_grp = n()) %>% 
  ungroup() %>% 
  filter(n_grp == max(n_grp, na.rm = TRUE)) %>% 
  complete(grp,
           n_grp,
           nesting(Week = seq(from = first(Week), length = 5))) %>% 
  select(-c(grp, n_grp)) %>% 
  rows_upsert(db1, by = c("Week", "Score"))
#> # A tibble: 9 x 2
#>    Week Score
#>   <dbl> <dbl>
#> 1    18    21
#> 2    19    45
#> 3    20    32
#> 4    21    56
#> 5    22    NA
#> 6    14    34
#> 7    23    45
#> 8    24    23
#> 9    25    48

Created on 2021-06-10 by the reprex package (v2.0.0)

Upvotes: 1

Related Questions