Cut numeric variable in data frame with different breaks according to group (preferably with dplyr)

Question

I've tried to find a solution to this here, but nothing seems to address exactly my case. Sorry if I missed it.

I have a data frame with each row providing a position in one of various categories (e.g., row one corresponds position 2 within "category" 'A', but row 3 corresponds to position 4 within category 'B'). Each of those categories is to be split into a different set of tiles/intervals, and I would like to find a way to assign the positions within the original data frame into their corresponding tile/interval. For instance, given the input data and category breaks that follow:

library(tidyverse)

test_df <- tribble(
  ~category, ~pos,
  'A', 2,
  'A', 5,
  'B', 4,
  'B', 8
)

breaks <- tribble(
  ~category, ~start, ~end,
  'A', 0, 4,
  'A', 4, 7,
  'A', 7, 10,
  'B', 0, 3,
  'B', 3, 5,
  'B', 5, 10
)

The result I would like to obtain would be something like:

  category   pos tile   
         
1 A            2 (0, 4] 
2 A            5 (4, 7] 
3 B            4 (3, 5] 
4 B            8 (5, 10]

I would normally use cut for similar tasks, but, as far as I'm aware, there's no way of defining different break points per group. The only way I have found to leverage group_by to create distinct intervals with cut, is by fixing the number of cuts to perform (which is not applicable in this case).

The best way I can come up to address my problem is this:

bind_rows(
  lapply(
    X=unique(test_df$category),
    FUN=function(x) {
      test_df %>%
        filter(category==x) %>%
        mutate(tile=cut(
          pos,
          breaks=c(0, filter(breaks, category==x)$end, Inf)))
    } ) )

which provides the expected output, but doesn't feel elegant to me (and I am not sure how it would perform with literally millions of rows on the input).

Any suggestion on how to streamline it? Any way of keeping it "piped"?

Cheers,

Fran

Cut numeric variable in data frame with different breaks according to group (preferably with dplyr)

Answers (1)

Related Questions