datgoaltho
datgoaltho

Reputation: 125

Cut() with custom bins

I have the following two dataframes:

data <- data.frame(
  yyyymm = c(202401, 202401,202401,202401,202402,202402,202402,202402),
  id = c(1,2,3,4,1,2,3,4),
  value = c(1,200,400,99999,2,400,200,180),
  quintile = c(1,3,4,5,1,4,3,2) #this is the desired output
)

breakpoints <- data.frame(
  yyyymm = c(202401,202402),
  q1 = c(-Inf,-Inf),
  q2 = c(100,105),
  q3 = c(200,185),
  q4 = c(250,240),
  q5 = c(500,500)
)

where 'data' is panel format and contains one data point for each id in each month. 'breakpoints' contains monthly breakpoints in long format which i want to use as bin breakpoints. For example, quintile 1 in the first month is [-Inf, 100).

I would like to calculate the bin that each 'id' in each 'yyyymm' belongs to according to the breakpoints which are also set monthly. For example in '202401' and for id=2, quintile=3.

Ideally this would be done with dplyr to fit in with rest of the code and would not require case_when or similar since the real breakpoint data has 20 breakpoints.

I tried to left join the breakpoints to 'data' and pass off the breakpoints to the cut function as follows:

data %>%
  left_join(breakpoints, by='yyyymm') %>%
  mutate(
    quintile_calculation = cut(value, breaks=c(q1,q2,q3,q4,q5)))

but i get error:

Error in `mutate()`:
ℹ In argument: `quintile_calculation = cut(value, breaks = c(q1, q2, q3, q4, q5))`.
Caused by error in `cut.default()`:
! 'breaks' are not unique

Upvotes: 3

Views: 72

Answers (2)

jay.sf
jay.sf

Reputation: 73592

Your problem is, if you want quintiles, i.e. five bins, you'll need six breakpoints. Therefore, use a sequence from 0 to 1 of length 6 in quantile as probs and cut there.

> data |> 
+   transform(
+     quintile_1=cut(value, 
+                   quantile(value, seq.int(0, 1, length.out=6)), 
+                   labels=1:5, include.lowest=TRUE)
+     )
  yyyymm id value quintile quintile_1
1 202401  1     1        1          1
2 202401  2   200        3          3
3 202401  3   400        4          4
4 202401  4 99999        5          5
5 202402  1     2        1          1
6 202402  2   400        4          4
7 202402  3   200        3          3
8 202402  4   180        2          2

Not sure why want to merge or join something; you can do it in one step.

Upvotes: 1

Gregor Thomas
Gregor Thomas

Reputation: 146090

Your problem is that after you join the data, it looks like this:

data %>%
   left_join(breakpoints, by='yyyymm')
#   yyyymm id value quintile   q1  q2  q3  q4  q5
# 1 202401  1     1        1 -Inf 100 200 250 500
# 2 202401  2   200        3 -Inf 100 200 250 500
# 3 202401  3   400        4 -Inf 100 200 250 500
# 4 202401  4 99999        5 -Inf 100 200 250 500
# 5 202402  1     2        1 -Inf 105 185 240 500
# 6 202402  2   400        4 -Inf 105 185 240 500
# 7 202402  3   200        3 -Inf 105 185 240 500
# 8 202402  4   180        2 -Inf 105 185 240 500

So the values in q1, q2, etc. are repeated on each row, but the breaks argument needs to be unique. A simple fix to make them unique is the function unique(). You also want to do the cut by group defined by yyyymm, so you'll need to use either group_by or (my preference) the .by argument.

data %>%
  left_join(breakpoints, by='yyyymm') %>%
  mutate(
    qtile = cut(value, breaks = unique(c(q1,q2,q3,q4,q5))),
    .by = yyyymm
  )
#   yyyymm id value quintile   q1  q2  q3  q4  q5      qtile
# 1 202401  1     1        1 -Inf 100 200 250 500 (-Inf,100]
# 2 202401  2   200        3 -Inf 100 200 250 500  (100,200]
# 3 202401  3   400        4 -Inf 100 200 250 500  (250,500]
# 4 202401  4 99999        5 -Inf 100 200 250 500       <NA>
# 5 202402  1     2        1 -Inf 105 185 240 500 (-Inf,105]
# 6 202402  2   400        4 -Inf 105 185 240 500  (240,500]
# 7 202402  3   200        3 -Inf 105 185 240 500  (185,240]
# 8 202402  4   180        2 -Inf 105 185 240 500  (105,185]

To get the integer output you want from cut, you'll also want to pass labels = FALSE to cut, and perhaps you want right = TRUE, and n breaks defines n - 1 quantiles, so if you want 5 groups you'll need 6 breaks.

You may find findInterval easier to use to get what you want, the defaults align with what you've already put together:

data %>%
  left_join(breakpoints, by='yyyymm') %>%
  mutate(
    qtile = findInterval(value, vec = unique(c(q1,q2,q3,q4,q5))),
    .by = yyyymm
  ) |>
  select(-matches("^q\\d+$"))
#   yyyymm id value quintile qtile
# 1 202401  1     1        1     1
# 2 202401  2   200        3     3
# 3 202401  3   400        4     4
# 4 202401  4 99999        5     5
# 5 202402  1     2        1     1
# 6 202402  2   400        4     4
# 7 202402  3   200        3     3
# 8 202402  4   180        2     2

Upvotes: 3

Related Questions