MLE
MLE

Reputation: 1043

Label quantile by group with varying group sizes

Within my group (the "name" variable), I want cut the value into quartile. And create a quartile label column for variable "value". Since the group size varies, for the quartile range for different group changes as well. But below code, only cut the quartile by the overall value, resulting the same quartile range for all groups.

dt<-data.frame(name=c(rep('a',8),rep('b',4),rep('c',5)),value=c(1:8,1:4,1:5))
dt
dt.2<-dt%>% group_by(name)%>% mutate(newcol=
cut(value,breaks=quantile(value,probs=seq(0,1,0.25),na.rm=TRUE),include.lowest=TRUE))
dt.2
str(dt.2)

Data:

   name value
1     a     1
2     a     2
3     a     3
4     a     4
5     a     5
6     a     6
7     a     7
8     a     8
9     b     1
10    b     2
11    b     3
12    b     4
13    c     1
14    c     2
15    c     3
16    c     4
17    c     5

output from above code. Update: the problem is not that newcol is factor but the necol has the same quartile range across all the different group. For example name b, the value is 1-4 but the quartile range has 3-5, which is derived from min(value) to max(value) regardless of the group.

 name value newcol
   <fctr> <int> <fctr>
1       a     1  [1,2]
2       a     2  [1,2]
3       a     3  (2,3]
4       a     4  (3,5]
5       a     5  (3,5]
6       a     6  (5,8]
7       a     7  (5,8]
8       a     8  (5,8]
9       b     1  [1,2]
10      b     2  [1,2]
11      b     3  (2,3]
12      b     4  (3,5]
13      c     1  [1,2]
14      c     2  [1,2]
15      c     3  (2,3]
16      c     4  (3,5]
17      c     5  (3,5]

Desired output

   name value newcol/quartile label

1       a     1  1
2       a     2  1
3       a     3  2
4       a     4  2
5       a     5  3
6       a     6  3
7       a     7  4
8       a     8  4
9       b     1  1
10      b     2  2
11      b     3  3
12      b     4  4
13      c     1  1
14      c     2  2
15      c     3  3
16      c     4  4
17      c     5  4

Upvotes: 1

Views: 1680

Answers (1)

bouncyball
bouncyball

Reputation: 10761

Here's a way you can do it, following the split-apply-combine framework.

dt<-data.frame(name=c(rep('a',8),rep('b',4),rep('c',5)),value=c(1:8,1:4,1:5))

split_dt <- lapply(split(dt, dt$name), 
                   transform, 
                   quantlabel = as.numeric(
cut(value, breaks = quantile(value, probs = seq(0,1,.25)), include.lowest = T)))

dt <- unsplit(split_dt, dt$name)

   name value quantlabel
1     a     1          1
2     a     2          1
3     a     3          2
4     a     4          2
5     a     5          3
6     a     6          3
7     a     7          4
8     a     8          4
9     b     1          1
10    b     2          2
11    b     3          3
12    b     4          4
13    c     1          1
14    c     2          1
15    c     3          2
16    c     4          3
17    c     5          4

edit: there's a data.table way

following this post, we can use the data.table package, if performance is a concern:

library(data.table)
dt<-data.frame(name=c(rep('a',8),rep('b',4),rep('c',5)),value=c(1:8,1:4,1:5))
dt.t <- as.data.table(dt)
dt.t[,quantlabels := as.numeric(cut(value, breaks = quantile(value, probs = seq(0,1,.25)), include.lowest = T)), name]

    name value quantlabels
 1:    a     1           1
 2:    a     2           1
 3:    a     3           2
 4:    a     4           2
 5:    a     5           3
 6:    a     6           3
 7:    a     7           4
 8:    a     8           4
 9:    b     1           1
10:    b     2           2
11:    b     3           3
12:    b     4           4
13:    c     1           1
14:    c     2           1
15:    c     3           2
16:    c     4           3
17:    c     5           4

edit: and there's a dplyr way

We can follow @akrun's advice and use as.numeric (which is what we've done for the other solutions):

dt %>%
    group_by(name) %>%
    mutate(quantlabel = 
               as.numeric(
                   cut(value, 
                       breaks = quantile(value, probs = seq(0,1,.25)), 
                       include.lowest = T)))

Note that if you instead wanted the labels themselves, use as.character:

dt %>%
    group_by(name) %>%
    mutate(quantlabel = as.character(cut(value, breaks = quantile(value, probs = seq(0,1,.25)), include.lowest = T)))

Source: local data frame [17 x 3]
Groups: name [3]

     name value quantlabel
   <fctr> <int>      <chr>
1       a     1   [1,2.75]
2       a     2   [1,2.75]
3       a     3 (2.75,4.5]
4       a     4 (2.75,4.5]
5       a     5 (4.5,6.25]
6       a     6 (4.5,6.25]
7       a     7   (6.25,8]
8       a     8   (6.25,8]
9       b     1   [1,1.75]
10      b     2 (1.75,2.5]
11      b     3 (2.5,3.25]
12      b     4   (3.25,4]
13      c     1      [1,2]
14      c     2      [1,2]
15      c     3      (2,3]
16      c     4      (3,4]
17      c     5      (4,5]

Upvotes: 2

Related Questions