Reputation: 43
I have a dataset that somehow is similar to the following example. I am trying to create a variable the same as the "expected" variable. That is if the date_time is "2020-01-02 12:00", create a variable including the lag, lead, and the level values of date_time.
Thank you in advance.
df <- tibble(
date_time = c("2020-01-01 10:00","2020-01-01 11:00","2020-01-01 12:00","2020-01-01 13:00","2020-01-01 14:00","2020-01-02 10:00","2020-01-02 11:00","2020-01-02 12:00","2020-01-03 10:00","2020-01-03 11:00"),
date = c("2020-01-01","2020-01-01","2020-01-01","2020-01-01","2020-01-01","2020-01-02","2020-01-02","2020-01-02","2020-01-03","2020-01-03"),
x= seq.default(1,5.5,.5),
date_time_NA = c(NA,"2020-01-01 11:00",NA,NA,NA,NA,NA,"2020-01-02 12:00",NA,NA),
expected = c("2020-01-01 10:00","2020-01-01 11:00","2020-01-01 12:00",NA,NA,NA,"2020-01-02 11:00","2020-01-02 12:00","2020-01-03 10:00",NA)
)
df
I changed the example above. So I am trying to tell R to find where date_time_NA is not missing and then create a variable including the lag, lead, and the level values of date_time (similar to expected).
I came up with a solution using the codes kindly shared by @akrun. It might not be a very smart solution; however, I found it intuitive. The main idea is to use ifelse to find the locations where the date_time_NA is not missing. Then we can use row_number() again and add 1 so it can find the lead and deduct 1, so it can find the lag location and replace them with the values of date_time.
df %>%
mutate(na_row = ifelse(!is.na(date_time_NA),row_number(),NA),
row_level_lag = ifelse(row_number() %in% c(na_row), date_time,NA),
row_level_now = ifelse(row_number() %in% c(na_row + 1), date_time,NA),
row_level_lead = ifelse(row_number() %in% c(na_row - 1), date_time,NA),
date_time_expected = glue("{row_level_lag} {row_level_now} {row_level_lead}"),
date_time_expected = stringr::str_replace_all(string = date_time_expected,pattern = "NA",replacement = ""))
Upvotes: 1
Views: 235
Reputation: 887901
We can use match
to get the index of the concerned 'date_time' value, then get the index of previous, and next value by adding a sequence from -1:1
. Use that to create a 'new' column with the subset value of 'date_time'
i1 <- match("2020-01-02 12:00", df$date_time)
i2 <- i1 + (-1:1)
df$new <- NA_character_
df$new[i2] <- df$date_time[i2]
-output
df
# A tibble: 10 x 5
# date_time date x expected new
# <chr> <chr> <dbl> <chr> <chr>
# 1 2020-01-01 10:00 2020-01-01 1 <NA> <NA>
# 2 2020-01-01 11:00 2020-01-01 1.5 <NA> <NA>
# 3 2020-01-01 12:00 2020-01-01 2 <NA> <NA>
# 4 2020-01-01 13:00 2020-01-01 2.5 <NA> <NA>
# 5 2020-01-01 14:00 2020-01-01 3 <NA> <NA>
# 6 2020-01-02 10:00 2020-01-02 3.5 <NA> <NA>
# 7 2020-01-02 11:00 2020-01-02 4 2020-01-02 11:00 2020-01-02 11:00
# 8 2020-01-02 12:00 2020-01-02 4.5 2020-01-02 12:00 2020-01-02 12:00
# 9 2020-01-03 10:00 2020-01-03 5 2020-01-03 10:00 2020-01-03 10:00
#10 2020-01-03 11:00 2020-01-03 5.5 <NA> <NA>
With dplyr
, we can use replace
to replace the values in 'date_time' that are not the match
ed index for the 'date_time' along with the previous and next index index to NA
library(dplyr)
df %>%
mutate(new = replace(date_time,
!row_number() %in% (match("2020-01-02 12:00",
df$date_time) + (-1:1)), NA_character_ ))
-output
# A tibble: 10 x 5
# date_time date x expected new
# <chr> <chr> <dbl> <chr> <chr>
# 1 2020-01-01 10:00 2020-01-01 1 <NA> <NA>
# 2 2020-01-01 11:00 2020-01-01 1.5 <NA> <NA>
# 3 2020-01-01 12:00 2020-01-01 2 <NA> <NA>
# 4 2020-01-01 13:00 2020-01-01 2.5 <NA> <NA>
# 5 2020-01-01 14:00 2020-01-01 3 <NA> <NA>
# 6 2020-01-02 10:00 2020-01-02 3.5 <NA> <NA>
# 7 2020-01-02 11:00 2020-01-02 4 2020-01-02 11:00 2020-01-02 11:00
# 8 2020-01-02 12:00 2020-01-02 4.5 2020-01-02 12:00 2020-01-02 12:00
# 9 2020-01-03 10:00 2020-01-03 5 2020-01-03 10:00 2020-01-03 10:00
#10 2020-01-03 11:00 2020-01-03 5.5 <NA> <NA>
If there are more than one element to match
, then we create a grouping column and apply the same code
library(tidyr)
df %>%
mutate(grp = lead(cumsum(!is.na(date_time_NA)))) %>%
fill(grp) %>%
group_by(grp) %>%
mutate(new = replace(date_time,
!row_number() %in% (match(date_time_NA[complete.cases(date_time_NA)][1],
date_time) + (-1:1)), NA_character_ )) %>%
ungroup %>%
select(-grp)
# A tibble: 10 x 6
# date_time date x date_time_NA expected new
# <chr> <chr> <dbl> <chr> <chr> <chr>
# 1 2020-01-01 10:00 2020-01-01 1 <NA> 2020-01-01 10:00 2020-01-01 10:00
# 2 2020-01-01 11:00 2020-01-01 1.5 2020-01-01 11:00 2020-01-01 11:00 2020-01-01 11:00
# 3 2020-01-01 12:00 2020-01-01 2 <NA> 2020-01-01 12:00 2020-01-01 12:00
# 4 2020-01-01 13:00 2020-01-01 2.5 <NA> <NA> <NA>
# 5 2020-01-01 14:00 2020-01-01 3 <NA> <NA> <NA>
# 6 2020-01-02 10:00 2020-01-02 3.5 <NA> <NA> <NA>
# 7 2020-01-02 11:00 2020-01-02 4 <NA> 2020-01-02 11:00 2020-01-02 11:00
# 8 2020-01-02 12:00 2020-01-02 4.5 2020-01-02 12:00 2020-01-02 12:00 2020-01-02 12:00
# 9 2020-01-03 10:00 2020-01-03 5 <NA> 2020-01-03 10:00 2020-01-03 10:00
#10 2020-01-03 11:00 2020-01-03 5.5 <NA> <NA> <NA>
Upvotes: 1