Reputation: 7928
I am looking for a way to, within id
groups, count unique occurrences of value shifts in TF
in the data datatbl
.
I want to count both forward and backwards from when TF
changes between 1
and 0
or o
and 1
. The counting is to be stored in a new variable PM##
, so that the PM##
s holds each unique shift in TF
, in both plus and minus. The MWE below leads to an outcome with 7 PM, but my production data can have 15 or more shifts. If a TF
values does not change between NA
's I want to mark it 0
.
This question is similar to a question I previously asked, but the last part about TF
standing alone is new. Both Uwe and Psidom provided elegant answers to the initial question using data.table
here and using tidyverse
here. after conferencing with Uwe, I am posting this slightly modified version of my question.
If this question violates any SO policies please let me know and I'll be happy to reopen my initial question or append this an bounty-issue.
To illustrate my question with a minimal working example. I have data like this,
# install.packages(c("tidyverse"), dependencies = TRUE)
library(tibble)
tbl <- tibble(id = c(rep(10L, 17L), rep(0L, 13L), rep(1L, 10L)),
TF = c(NA, NA, 0, NA, 0, NA, 1, 1, 1, 1, 1, NA, 1, 0, 1, 0, 1, NA, 0L, NA, 0L,
0L, 1L, 1L, 1L, 0L, 0L, NA, NA, 0L, NA, 0L, 0L, 0L, 1L, 1L, 1L, 0L, NA, 1L))
tbl %>% print(n=18)
#> # A tibble: 40 x 2
#> id TF
#> <int> <dbl>
#> 1 10 NA
#> 2 10 NA
#> 3 10 0
#> 4 10 NA
#> 5 10 0
#> 6 10 NA
#> 7 10 1
#> 8 10 1
#> 9 10 1
#> 10 10 1
#> 11 10 1
#> 12 10 NA
#> 13 10 1
#> 14 10 0
#> 15 10 1
#> 16 10 0
#> 17 10 1
#> 18 0 NA
#> # ... with 22 more rows
tblPM <- structure(list(id = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), TF = c(NA, NA, 0, NA, 0, NA, 1, 1, 1, 1, 1,
NA, 1, 0, 1, 0, 1, NA, 0, NA, 0, 0, 1, 1, 1, 0, 0,
NA, NA, 0, NA, 0, 0, 0, 1, 1, 1, 0, NA, 1), PM01 = c(NA,
NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -3L,
-2L, -1L, 1L, 2L, 3L, NA, NA, NA), PM02 = c(NA, NA, NA, NA, 0L,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -2L,
-1L, 1L, 2L, 3L, NA, NA, NA, NA, NA, NA, NA, NA, NA, -3L, -2L,
-1L, 1L, NA, NA), PM03 = c(NA, NA, NA, NA, NA, NA, 0L, 0L, 0L,
0L, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -3L, -2L,
-1L, 1L, 2L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
0L), PM04 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
-1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), PM05 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -1L, 1L, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), PM06 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, -1L, 1L, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA), PM07 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, -1L, 1L, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
)), .Names = c("id", "TF", "PM01", "PM02", "PM03", "PM04", "PM05",
"PM06", "PM07"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -40L
))
tblPM %>% print(n=18)
#> # A tibble: 40 x 9
#> id TF PM01 PM02 PM03 PM04 PM05 PM06 PM07
#> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
#> 1 10 NA NA NA NA NA NA NA NA
#> 2 10 NA NA NA NA NA NA NA NA
#> 3 10 0 0 NA NA NA NA NA NA
#> 4 10 NA NA NA NA NA NA NA NA
#> 5 10 0 NA 0 NA NA NA NA NA
#> 6 10 NA NA NA NA NA NA NA NA
#> 7 10 1 NA NA 0 NA NA NA NA
#> 8 10 1 NA NA 0 NA NA NA NA
#> 9 10 1 NA NA 0 NA NA NA NA
#> 10 10 1 NA NA 0 NA NA NA NA
#> 11 10 1 NA NA 0 NA NA NA NA
#> 12 10 NA NA NA NA NA NA NA NA
#> 13 10 1 NA NA NA -1 NA NA NA
#> 14 10 0 NA NA NA 1 -1 NA NA
#> 15 10 1 NA NA NA NA 1 -1 NA
#> 16 10 0 NA NA NA NA NA 1 -1
#> 17 10 1 NA NA NA NA NA NA 1
#> 18 0 NA NA NA NA NA NA NA NA
#> # ... with 22 more rows
identical([some solution], tblPM)
#> [1] TRUE
update w/ microbenchmark
2018-01-24 14:20:18Z,
Thanks to Fierr and Chris for taking the time to tease out the logic and submit an answer. Inspired my this setup I've computed a small microbenchmark comparison of thier functions. I put Fierrs answer into the function
tidyverse_Fierr()and Chris' answer into
dt_Chris()` (if someone want the exact functions please let me know and I'll add them here.
After some minor tweaks they both come out identical when match with tblPM
, i.e.
identical(tblPM, tidyverse_Fierr(tbl))
#> [1] TRUE
identical(tblPM, dt_Chris(tbl))
#> [1] TRUE
Now to the quick microbenchmark,
df_test <- bind_rows(rep(list(tbl), 111))
microbenchmark::microbenchmark(tidyverse_Fierr(df_test), dt_Chris(df_test), times = 3*1)
#> Unit: milliseconds
#> expr min mean median uq max neval cld
#> tidyverse_Fierr(df_test) 19503.366 20171.268 20080.99 20505.219 20929.4489 3 b
#> dt_Chris(df_test) 199.165 233.924 203.72 251.304 298.8887 3 a
Interestingly the tidy_method comes out way faster in this kinda similar comparison.
Upvotes: 3
Views: 492
Reputation: 6372
Here is a script approach - given the amount of custom treatment for each case (TF = NA, uniqueN(TF) = 1, uniqueN(TF) = 2, I think this is likely clearer to implement vs. a dplyr chain. Should be fairly quick as it is all data.table based. Open to suggestions on how to improve!
This will expand automatically as the number of PM columns required increases - as I commented below, I would recommend getting rid of the 0 prefix in the column, as there may be a case where you get to 10^2..n columns which would bump to PM001.
library(data.table)
tbl3 <- data.table(id = c(rep(10L, 17L), rep(0L, 13L), rep(1L, 10L)),
TF = c(NA, NA, 0L, NA, 0L, NA, 1L, 1L, 1L, 1L, 1L, NA, 1L, 0L, 1L, 0L, 1L, NA, 0L, NA, 0L,
0L, 1L, 1L, 1L, 0L, 0L, NA, NA, 0L, NA, 0L, 0L, 0L, 1L, 1L, 1L, 0L, NA, 1L))
# create index to untimately join back to
tbl3[, row_idx := .I]
# all transformations on a replicated data.table
tbl3_tmp <- copy(tbl3)
# identify where the NA breaks occur - this splits each id into subgroups (id_group)
tbl3_tmp[, P_TF := shift(TF, 1, "lag", fill = NA), by = .(id)]
tbl3_tmp[, TF_break := is.na(TF) | is.na(P_TF)]
tbl3_tmp[, id_group := cumsum(TF_break), by = .(id)]
tbl3_tmp[, `:=`(TF_break = NULL, P_TF = NULL)] # above can be consolidated to one line which would make this line unneccesary - expanded for easier understanding
tbl3_tmp <- tbl3_tmp[!is.na(TF)] # NA rows can be safely ignored now - these will be all NA, and will be handled with the left join below
# find where subpatterns exist (runs of 0..1 or 1..0)
tbl3_tmp[, subpattern_break := TF != shift(TF, 1, "lag", fill = NA), by = .(id, id_group)]
tbl3_tmp[, subbreaks := sum(subpattern_break, na.rm = TRUE), by = .(id, id_group)] # if there are no breaks, we need to treat separately
# two cases: zero subbreaks and multiple subbreaks.
tbl3_zeros <- tbl3_tmp[subbreaks == 0]
tbl3_nonzeros <- tbl3_tmp[subbreaks > 0]
# for 1+ subbreaks, we need to double the rows - this allows us to easily create the PM_field both "forwards" and "backwards"
tbl3_nonzeros[is.na(subpattern_break), subpattern_break := TRUE]
tbl3_nonzeros[, subbreak_index := cumsum(subpattern_break), by = .(id, id_group)]
tbl3_nonzeros <- rbindlist(list(tbl3_nonzeros,tbl3_nonzeros), idcol = "base") # double the row
tbl3_nonzeros[base == 1 & subbreak_index %% 2 == 1, subbreak_index := subbreak_index + 1L] # round to nearest even
tbl3_nonzeros[base == 2 & subbreak_index %% 2 == 0, subbreak_index := subbreak_index + 1L] # round to nearest odd
# this creates an index when the subbreak starts - allows us to sequence PM properly
tbl3_nonzeros[,subbreak_start := min(row_idx), by = .(id, id_group, subbreak_index)]
# exclude the ends if there is only one unique TF value - might be able to get this to one line
tbl3_nonzeros[, TF_count := uniqueN(TF), by = .(id, id_group, subbreak_index)]
tbl3_nonzeros <- tbl3_nonzeros[TF_count > 1]
# create a 1..N column, subtract the index where the break occurs ,then add 1 to all 0+ values.
tbl3_nonzeros[,PM_field := 1:.N, by = .(id, id_group, subbreak_index)]
tbl3_nonzeros[, PM_field := PM_field - PM_field[which(diff(TF)!=0)[1]+1], by = .(id, id_group, subbreak_index)]
tbl3_nonzeros[PM_field >= 0, PM_field := PM_field + 1L] # base 1 after the break
# create subbreaks for zero groups
tbl3_zeros[,subbreak_start := min(row_idx), by = .(id, id_group)]
# bring zero and non zero case together
tbl3_zeros <- tbl3_zeros[, .(id, id_group, subbreak_start,row_idx = row_idx, PM_field = 0L)]
tbl3_nonzeros <- tbl3_nonzeros[,.(id, id_group, subbreak_start, row_idx, PM_field)]
tbl3_tmp <- rbindlist(list(tbl3_zeros, tbl3_nonzeros))
# Create header
tbl3_tmp <- tbl3_tmp[order(subbreak_start, PM_field)]
tbl3_tmp[, PM_header := paste0("PM0",cumsum(c(1,diff(subbreak_start)!=0)),sep = ""), by = .(id)] # I would remove 0 in PM0 here (kept for identical check)- inefficient to check if this will be 1, 2, 3 etc digits This could also be solved with; `paste0("PM", sprintf("%02d", cumsum(c(1, diff(subbreak_start) != 0))))`
# long to wide
tbl3_tmp <- dcast(tbl3_tmp, row_idx ~ PM_header, value.var = "PM_field", fun.aggregate = sum, fill = NA)
# merge back to initial dataframe
tblPM_frombase <- merge(tbl3, tbl3_tmp, by = "row_idx", all.x = TRUE)[, row_idx := NULL]
identical(tblPM, tblPM_frombase)
[1] TRUE
Upvotes: 2
Reputation: 195
Liked the challenge to uncover the logic of this one. The approach is based on tidyverse. Suggestions on tidying it even more are welcome!
library(data.table)
library(purrr)
library(dplyr)
library(tibble)
tbl <- tibble(id = c(rep(10L, 17L), rep(0L, 13L), rep(1L, 10L)),
TF = c(NA, NA, 0, NA, 0, NA, 1, 1, 1, 1, 1, NA, 1, 0, 1, 0, 1, NA, 0L, NA, 0L,
0L, 1L, 1L, 1L, 0L, 0L, NA, NA, 0L, NA, 0L, 0L, 0L, 1L, 1L, 1L, 0L, NA, 1L))
tbl <- mutate(tbl, rn = 1:n())
lookup_table <- tbl %>%
group_by(id) %>%
mutate(rl = rleid(TF)) %>%
group_by(id, rl, TF) %>%
summarise(n=n()) %>%
group_by(id) %>%
mutate(lag = lag(TF, order_by=id),
lead = lead(TF, order_by=id),
test = ifelse(is.na(lag) & is.na(lead), 1, 0)) %>%
select(id, rl, test)
tmp <- tbl %>%
group_by(id) %>%
mutate(rl = rleid(TF),
rl_nona = ifelse(is.na(TF), NA, rleid(rl)),
rl_nona = match(rl_nona, unique(na.omit(rl_nona)))) %>% # Re-indexing
left_join(lookup_table, by = c("id" = "id", "rl" = "rl")) %>%
mutate(TF_new = ifelse(test == 1, NA, TF),
rl_gap = ifelse(is.na(TF_new), NA, rleid(TF_new)),
rl_gap = match(rl_gap, unique(na.omit(rl_gap))), # Re-indexing
up_pos = ifelse(min(rl_gap, na.rm=TRUE)==rl_gap, NA, rl_gap),
down_pos = ifelse(max(rl_gap, na.rm=TRUE)==rl_gap, NA, rl_gap)) %>%
group_by(id, rl_gap) %>%
mutate(up = ifelse(is.na(up_pos), 0, seq_len(n())),
down = ifelse(is.na(down_pos), 0, -rev(seq_len(n())))) %>%
group_by(id) %>%
mutate(zero_pos = ifelse(test == 1 & rl_nona > max(rl_gap, na.rm = TRUE), rl_nona - 1, rl_nona)) # Correct placement of zeroes
up <- dcast(tmp, rn ~ rl_nona, value.var = 'up' , fill = 0)
down <- dcast(tmp, rn ~ rl_nona, value.var = 'down', fill = 0)
res <- (down[, 2:max(tmp$rl_nona, na.rm=TRUE)] + up[, 3:(max(tmp$rl_nona, na.rm=TRUE)+1)]) %>%
mutate_all(funs(replace(., which(.==0), NA))) %>%
bind_cols(rn = tmp$rn, test = tmp$test, zero_pos = tmp$zero_pos) %>%
right_join(tbl, by = "rn") %>%
mutate(`PM01` = ifelse(test == 1 & zero_pos == 1, 0, `1`)) %>%
mutate(`PM02` = ifelse(test == 1 & zero_pos == 2, 0, `2`)) %>%
mutate(`PM03` = ifelse(test == 1 & zero_pos == 3, 0, `3`)) %>%
mutate(`PM04` = ifelse(test == 1 & zero_pos == 4, 0, `4`)) %>%
mutate(`PM05` = ifelse(test == 1 & zero_pos == 5, 0, `5`)) %>%
mutate(`PM06` = ifelse(test == 1 & zero_pos == 6, 0, `6`)) %>%
mutate(`PM07` = ifelse(test == 1 & zero_pos == 7, 0, `7`)) %>%
select(id, TF, everything(), -rn, -test, -zero_pos, -c(1:7)) %>%
mutate_if(is.numeric, as.integer) %>%
as.tibble()
identical(tblPM, res)
Upvotes: 1