R: Keep first values in time series after NA and set 3 consecutive values to NA

Question

I have a data frame with many time series in the columns (example for one series in col1). I would like to go through each time series and keep the first value which appears after a NA and set only the consecutive 3 values to NA (example of the desired result in col2). Ideally, this should work for all or specified (e.g. [2:30]) columns in the data frame.

This answer does not take the condition of only the 3 consecutive values to NA into account.

Sample data

df <- data.frame(
col1 = c(7.00, NA, NA, 1.00, 2.00, 5.00, NA, 5.00, 7.00, NA, NA, 1.00, NA, 2.00, NA, NA, 1.00, 3.00, 4.00, 5.00, 6.00, 7.00, NA, 7.00, NA),
col2 = c(7.00, NA, NA, 1.00, NA, NA, NA, 5.00, NA, NA, NA, 1.00, NA, 2.00, NA, NA, 1.00, NA, NA, NA, 6.00, NA, NA, 7.0, NA),
                     stringsAsFactors = FALSE)

Thanks for your help.

Jon Spring · Accepted Answer

Here's an approach using dplyr.

First I make a new group each time we have a valid value that follows an NA. Then within each group, I set the 2nd through 4th rows to be NA, otherwise uses col1.

library(dplyr)

df %>%
  mutate(new_grp = if_else(!is.na(col1) & is.na(lag(col1)), 1, 0),
         grp = cumsum(new_grp)) %>%
  group_by(grp) %>%
  # Modified below per OP clarification: treat each group of 4 entries
  #   following an NA like the first four, with one value and then 3 NAs.
  #   Uses modulo 4, where any row with a remainder of 2, 3, or 0 (ie row 4, etc.) will get NA
  mutate(col2b = case_when(row_number() %% 4 %in% c(2:3, 0)  ~ NA_real_,
                          TRUE ~ col1)) %>%
  ungroup()

Output:

row col1 col2 col2b
1   7   7   7
2   NA  NA  NA
3   NA  NA  NA
4   1   1   1
5   2   NA  NA
6   5   NA  NA
7   NA  NA  NA
8   5   5   5
9   7   NA  NA
10  NA  NA  NA
11  NA  NA  NA
12  1   1   1
13  NA  NA  NA
14  2   2   2
15  NA  NA  NA
16  NA  NA  NA
17  1   1   1
18  3   NA  NA
19  4   NA  NA
20  5   NA  NA
21  6   6   6
22  7   NA  NA   # Modified per OP clarification
23  NA  NA  NA
24  7   7   7
25  NA  NA  NA

Edit: Applying to multiple / all columns

If all your columns are of the same type, this should work to convert all the columns at once. It works by using tidyr to gather the data from wide to "long" format, then performs the same calculation as before, and then spreads back to a wide format.

df %>%
  mutate(row = row_number()) %>%
  tidyr::gather(col, value, -row) %>%
  group_by(col) %>%

  mutate(new_grp = if_else(!is.na(value) & is.na(lag(value)), 1, 0),
         grp = cumsum(new_grp)) %>%
  group_by(col, grp) %>%
  mutate(value = case_when(row_number() %% 4 %in% c(2:3, 0)  ~ NA_real_,
                           TRUE ~ value)) %>%
  ungroup() %>%
  tidyr::spread(col, value) %>%
  select(-row, -new_grp, -grp)

If your data frame uses different types, I think it gets more complicated, unless there's a simpler alternative I'm missing. I started with an approach that uses "tidyeval" to let you programmatically change one specified column using a function. The last step after this could use purrr to apply the function to all columns.

NA_2to4 <- function(df_name, col_to_change) {
  col_quo <- enquo(col_to_change)
  df_name %>%
    mutate(new_grp = if_else(!is.na(!!col_quo) & is.na(lag(!!col_quo)), 1, 0),
         grp = cumsum(new_grp)) %>%
    group_by(grp) %>%
    mutate(!!col_quo := case_when(row_number() %% 4 %in% c(2:3, 0)  ~ NA_real_,
                             TRUE ~ !!col_quo)) %>%
    ungroup() %>%
    select(-new_grp, -grp)
}

Here's how you could apply that to specific columns:

df %>% 
  NA_2to4(colA) %>%
  NA_2to4(colB)

I presume there's a way to use purr::map to apply to all columns, but I'm not sure the syntax in this moment.

R: Keep first values in time series after NA and set 3 consecutive values to NA

Answers (1)

Edit: Applying to multiple / all columns

Related Questions