Reputation: 19
I have a dataset where I need to change any values that are 0, but have had a non-zero value in the past 48 hours to a specific string. I'm guessing I might need to convert all but the first column from dbl to chr before doing this?
Time colA colB colC colD
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2021-11-21 10:00:00 8 0 9 176
2 2021-11-11 11:00:00 21 0 22 416
3 2021-11-21 11:00:00 19 0 20 373
4 2021-11-11 12:00:00 40 13 28 566
5 2021-11-21 12:00:00 26 0 27 527
6 2021-11-11 13:00:00 50 20 32 651
7 2021-11-11 10:00:00 11 0 12 216
8 2021-11-21 13:00:00 30 0 31 617
9 2021-11-11 14:00:00 51 0 32 675
10 2021-11-21 14:00:00 31 0 32 644
Apologies that the data isn't sorted by time already, working on fixing that. For example, here's the output that I'd like to get to:
Time colA colB colC colD
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2021-11-21 10:00:00 8 0 9 176
2 2021-11-11 11:00:00 21 0 22 416
3 2021-11-21 11:00:00 19 0 20 373
4 2021-11-11 12:00:00 40 13 28 566
5 2021-11-21 12:00:00 26 0 27 527
6 2021-11-11 13:00:00 50 20 32 651
7 2021-11-11 10:00:00 11 0 12 216
8 2021-11-21 13:00:00 30 0 31 617
9 2021-11-11 14:00:00 51 STRING1 32 675
10 2021-11-21 14:00:00 31 0 32 644
Since colB has a value of 0 at 2021-11-11 14:00:00, but atleast 1 of the previous values in the 48 hours before that != 0, it gets changed to "STRING1"
Apologies if this is confusing, I'm trying to automate something I typically do manually in Excel. Thanks in advance
Upvotes: 1
Views: 35
Reputation: 1030
Here's a tidyverse solution. I'll first create some sample data (note that I do indeed set up the other columns as characters):
data = tribble(
~ time, ~ colA, ~ colB,
"2021-11-21 12:00:00", 1, 0,
"2021-11-22 00:00:00", 0, 1,
"2021-11-24 12:00:00", 0, 0,
"2021-11-25 12:00:00", 1, 1,
"2021-11-26 12:00:00", 0, 0,
) %>%
mutate(
time = ymd_hms(time),
across(-time, as.character)
)
# A tibble: 5 x 3
time colA colB
<dttm> <chr> <chr>
1 2021-11-21 12:00:00 1 0
2 2021-11-22 00:00:00 0 1
3 2021-11-24 12:00:00 0 0
4 2021-11-25 12:00:00 1 1
5 2021-11-26 12:00:00 0 0
The challenge of this problem is, for each time, we need to know which other rows to look for to determine the new value of each column. To do this, I will use purrr:pmap()
and the .data
object available in dplyr
. I'll start by demonstrating how you could "look back" at rows with a time within the last 48 hours:
data %>%
mutate(
across(
.cols = -time,
function(col) {
pmap_chr(list(time), function(t) {
eligible = .data$time >= t - hours(48) & .data$time < t
paste(col[eligible], collapse = ",")
})
},
.names = "{.col}_previous"
)
)
Output:
# A tibble: 5 x 5
time colA colB colA_previous colB_previous
<dttm> <chr> <chr> <chr> <chr>
1 2021-11-21 12:00:00 1 0 "" ""
2 2021-11-22 00:00:00 0 1 "1" "0"
3 2021-11-24 12:00:00 0 0 "" ""
4 2021-11-25 12:00:00 1 1 "0" "0"
5 2021-11-26 12:00:00 0 0 "0,1" "0,1"
As you can see, this looks promising. We are ready to develop the solution using this as the key idea.
data %>%
mutate(
across(
.cols = -time,
function(col) {
modify_ind = pmap_lgl(list(time), function(t) {
eligible = .data$time >= t - hours(48) & .data$time < t
any(col[eligible] != "0")
})
ifelse(modify_ind & col == "0", "STRING1", col)
}
)
)
Output:
# A tibble: 5 x 3
time colA colB
<dttm> <chr> <chr>
1 2021-11-21 12:00:00 1 0
2 2021-11-22 00:00:00 STRING1 1
3 2021-11-24 12:00:00 0 0
4 2021-11-25 12:00:00 1 1
5 2021-11-26 12:00:00 STRING1 STRING1
Some notes:
modify_ind
that builds on the idea shown earlier: access the column filtered on times within the last 48 hours, and then check if there were any nonzero values. I then modify the column whenever its modify_ind
is TRUE
and the value of the column is "0"
..names
parameter to across()
as shown earlier.across(.cols = -time)
, use something like across(.cols = c("colA", "colB"))
.Upvotes: 1