Cut numeric variable in data frame with different breaks according to group (preferably with dplyr)

I've tried to find a solution to this here, but nothing seems to address exactly my case. Sorry if I missed it.

I have a data frame with each row providing a position in one of various categories (e.g., row one corresponds position 2 within "category" 'A', but row 3 corresponds to position 4 within category 'B'). Each of those categories is to be split into a different set of tiles/intervals, and I would like to find a way to assign the positions within the original data frame into their corresponding tile/interval. For instance, given the input data and category breaks that follow:

library(tidyverse)

test_df <- tribble(
  ~category, ~pos,
  'A', 2,
  'A', 5,
  'B', 4,
  'B', 8
)

breaks <- tribble(
  ~category, ~start, ~end,
  'A', 0, 4,
  'A', 4, 7,
  'A', 7, 10,
  'B', 0, 3,
  'B', 3, 5,
  'B', 5, 10
)

The result I would like to obtain would be something like:

  category   pos tile   
  <chr>    <dbl> <chr>  
1 A            2 (0, 4] 
2 A            5 (4, 7] 
3 B            4 (3, 5] 
4 B            8 (5, 10]

I would normally use cut for similar tasks, but, as far as I'm aware, there's no way of defining different break points per group. The only way I have found to leverage group_by to create distinct intervals with cut, is by fixing the number of cuts to perform (which is not applicable in this case).

The best way I can come up to address my problem is this:

bind_rows(
  lapply(
    X=unique(test_df$category),
    FUN=function(x) {
      test_df %>%
        filter(category==x) %>%
        mutate(tile=cut(
          pos,
          breaks=c(0, filter(breaks, category==x)$end, Inf)))
    } ) )

which provides the expected output, but doesn't feel elegant to me (and I am not sure how it would perform with literally millions of rows on the input).

Any suggestion on how to streamline it? Any way of keeping it "piped"?

Cheers,

Fran

Upvotes: 4

Views: 1319

Answers (1)

user18309711
user18309711

Reputation:

one solution:

## rearrange "breaks"
breaks <- 
  breaks %>%
  pivot_longer(cols = start:end) %>%
  distinct(category, value) %>% 
  group_by(category) %>%
  summarise(breaks =  list(value))

## join and cut:
test_df %>%
  left_join(breaks) %>%
  rowwise %>%
  mutate(tile = cut(pos, unlist(breaks))) %>%
  ungroup ## reduce memory size of result object

output:


## # A tibble: 4 x 4
## # Rowwise: 
##   category   pos breaks    tile  
##   <chr>    <dbl> <list>    <fct> 
## 1 A            2 <dbl [4]> (0,4] 
## 2 A            5 <dbl [4]> (4,7] 
## 3 B            4 <dbl [4]> (3,5] 
## 4 B            8 <dbl [4]> (5,10]

edit

A faster (about 5 times) though less readable approach with a slightly smaller result object:

library(purrr)

## boil dataframe "breaks" down to a list of break vectors:

breaks_list <- 
  breaks %>%
  ## one tibble with columns 'start' and 'end' per 'category':
  nest_by(category) %>%
  ## dataframe into list of tibbles, named with category
  pull(data, category) %>%
  ## tibbles into vector of breaks
  map(~ .x %>% as.matrix %>% c %>% unique %>% sort) 

## get tile by indexing into "break_list" via mapping: 
test_df %>% 
mutate(
  tile = map2(pos, category,
              ~ cut(.x, breaks_list[[.y]])
              ) %>% unlist
)

Upvotes: 4

Related Questions