Reputation: 131
I have a data frame with many time series in the columns (example for one series in col1). I would like to go through each time series and keep the first value which appears after a NA and set only the consecutive 3 values to NA (example of the desired result in col2). Ideally, this should work for all or specified (e.g. [2:30]) columns in the data frame.
This answer does not take the condition of only the 3 consecutive values to NA into account.
Sample data
df <- data.frame(
col1 = c(7.00, NA, NA, 1.00, 2.00, 5.00, NA, 5.00, 7.00, NA, NA, 1.00, NA, 2.00, NA, NA, 1.00, 3.00, 4.00, 5.00, 6.00, 7.00, NA, 7.00, NA),
col2 = c(7.00, NA, NA, 1.00, NA, NA, NA, 5.00, NA, NA, NA, 1.00, NA, 2.00, NA, NA, 1.00, NA, NA, NA, 6.00, NA, NA, 7.0, NA),
stringsAsFactors = FALSE)
Thanks for your help.
Upvotes: 0
Views: 310
Reputation: 66520
Here's an approach using dplyr
.
First I make a new group each time we have a valid value that follows an NA
. Then within each group, I set the 2nd through 4th rows to be NA
, otherwise uses col1
.
library(dplyr)
df %>%
mutate(new_grp = if_else(!is.na(col1) & is.na(lag(col1)), 1, 0),
grp = cumsum(new_grp)) %>%
group_by(grp) %>%
# Modified below per OP clarification: treat each group of 4 entries
# following an NA like the first four, with one value and then 3 NAs.
# Uses modulo 4, where any row with a remainder of 2, 3, or 0 (ie row 4, etc.) will get NA
mutate(col2b = case_when(row_number() %% 4 %in% c(2:3, 0) ~ NA_real_,
TRUE ~ col1)) %>%
ungroup()
Output:
row col1 col2 col2b
1 7 7 7
2 NA NA NA
3 NA NA NA
4 1 1 1
5 2 NA NA
6 5 NA NA
7 NA NA NA
8 5 5 5
9 7 NA NA
10 NA NA NA
11 NA NA NA
12 1 1 1
13 NA NA NA
14 2 2 2
15 NA NA NA
16 NA NA NA
17 1 1 1
18 3 NA NA
19 4 NA NA
20 5 NA NA
21 6 6 6
22 7 NA NA # Modified per OP clarification
23 NA NA NA
24 7 7 7
25 NA NA NA
If all your columns are of the same type, this should work to convert all the columns at once. It works by using tidyr
to gather the data from wide to "long" format, then performs the same calculation as before, and then spreads back to a wide format.
df %>%
mutate(row = row_number()) %>%
tidyr::gather(col, value, -row) %>%
group_by(col) %>%
mutate(new_grp = if_else(!is.na(value) & is.na(lag(value)), 1, 0),
grp = cumsum(new_grp)) %>%
group_by(col, grp) %>%
mutate(value = case_when(row_number() %% 4 %in% c(2:3, 0) ~ NA_real_,
TRUE ~ value)) %>%
ungroup() %>%
tidyr::spread(col, value) %>%
select(-row, -new_grp, -grp)
If your data frame uses different types, I think it gets more complicated, unless there's a simpler alternative I'm missing. I started with an approach that uses "tidyeval" to let you programmatically change one specified column using a function. The last step after this could use purrr
to apply the function to all columns.
NA_2to4 <- function(df_name, col_to_change) {
col_quo <- enquo(col_to_change)
df_name %>%
mutate(new_grp = if_else(!is.na(!!col_quo) & is.na(lag(!!col_quo)), 1, 0),
grp = cumsum(new_grp)) %>%
group_by(grp) %>%
mutate(!!col_quo := case_when(row_number() %% 4 %in% c(2:3, 0) ~ NA_real_,
TRUE ~ !!col_quo)) %>%
ungroup() %>%
select(-new_grp, -grp)
}
Here's how you could apply that to specific columns:
df %>%
NA_2to4(colA) %>%
NA_2to4(colB)
I presume there's a way to use purr::map
to apply to all columns, but I'm not sure the syntax in this moment.
Upvotes: 1