Reputation: 198
I've tried to find a solution to this here, but nothing seems to address exactly my case. Sorry if I missed it.
I have a data frame with each row providing a position in one of various categories (e.g., row one corresponds position 2 within "category" 'A', but row 3 corresponds to position 4 within category 'B'). Each of those categories is to be split into a different set of tiles/intervals, and I would like to find a way to assign the positions within the original data frame into their corresponding tile/interval. For instance, given the input data and category breaks that follow:
library(tidyverse)
test_df <- tribble(
~category, ~pos,
'A', 2,
'A', 5,
'B', 4,
'B', 8
)
breaks <- tribble(
~category, ~start, ~end,
'A', 0, 4,
'A', 4, 7,
'A', 7, 10,
'B', 0, 3,
'B', 3, 5,
'B', 5, 10
)
The result I would like to obtain would be something like:
category pos tile
<chr> <dbl> <chr>
1 A 2 (0, 4]
2 A 5 (4, 7]
3 B 4 (3, 5]
4 B 8 (5, 10]
I would normally use cut
for similar tasks, but, as far as I'm aware, there's no way of defining different break points per group. The only way I have found to leverage group_by
to create distinct intervals with cut
, is by fixing the number of cuts to perform (which is not applicable in this case).
The best way I can come up to address my problem is this:
bind_rows(
lapply(
X=unique(test_df$category),
FUN=function(x) {
test_df %>%
filter(category==x) %>%
mutate(tile=cut(
pos,
breaks=c(0, filter(breaks, category==x)$end, Inf)))
} ) )
which provides the expected output, but doesn't feel elegant to me (and I am not sure how it would perform with literally millions of rows on the input).
Any suggestion on how to streamline it? Any way of keeping it "piped"?
Cheers,
Fran
Upvotes: 4
Views: 1319
Reputation:
one solution:
## rearrange "breaks"
breaks <-
breaks %>%
pivot_longer(cols = start:end) %>%
distinct(category, value) %>%
group_by(category) %>%
summarise(breaks = list(value))
## join and cut:
test_df %>%
left_join(breaks) %>%
rowwise %>%
mutate(tile = cut(pos, unlist(breaks))) %>%
ungroup ## reduce memory size of result object
output:
## # A tibble: 4 x 4
## # Rowwise:
## category pos breaks tile
## <chr> <dbl> <list> <fct>
## 1 A 2 <dbl [4]> (0,4]
## 2 A 5 <dbl [4]> (4,7]
## 3 B 4 <dbl [4]> (3,5]
## 4 B 8 <dbl [4]> (5,10]
edit
A faster (about 5 times) though less readable approach with a slightly smaller result object:
library(purrr)
## boil dataframe "breaks" down to a list of break vectors:
breaks_list <-
breaks %>%
## one tibble with columns 'start' and 'end' per 'category':
nest_by(category) %>%
## dataframe into list of tibbles, named with category
pull(data, category) %>%
## tibbles into vector of breaks
map(~ .x %>% as.matrix %>% c %>% unique %>% sort)
## get tile by indexing into "break_list" via mapping:
test_df %>%
mutate(
tile = map2(pos, category,
~ cut(.x, breaks_list[[.y]])
) %>% unlist
)
Upvotes: 4