Reputation: 1043
Within my group (the "name" variable), I want cut the value into quartile. And create a quartile label column for variable "value". Since the group size varies, for the quartile range for different group changes as well. But below code, only cut the quartile by the overall value, resulting the same quartile range for all groups.
dt<-data.frame(name=c(rep('a',8),rep('b',4),rep('c',5)),value=c(1:8,1:4,1:5))
dt
dt.2<-dt%>% group_by(name)%>% mutate(newcol=
cut(value,breaks=quantile(value,probs=seq(0,1,0.25),na.rm=TRUE),include.lowest=TRUE))
dt.2
str(dt.2)
Data:
name value
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 a 7
8 a 8
9 b 1
10 b 2
11 b 3
12 b 4
13 c 1
14 c 2
15 c 3
16 c 4
17 c 5
output from above code. Update: the problem is not that newcol is factor but the necol has the same quartile range across all the different group. For example name b, the value is 1-4 but the quartile range has 3-5, which is derived from min(value) to max(value) regardless of the group.
name value newcol
<fctr> <int> <fctr>
1 a 1 [1,2]
2 a 2 [1,2]
3 a 3 (2,3]
4 a 4 (3,5]
5 a 5 (3,5]
6 a 6 (5,8]
7 a 7 (5,8]
8 a 8 (5,8]
9 b 1 [1,2]
10 b 2 [1,2]
11 b 3 (2,3]
12 b 4 (3,5]
13 c 1 [1,2]
14 c 2 [1,2]
15 c 3 (2,3]
16 c 4 (3,5]
17 c 5 (3,5]
Desired output
name value newcol/quartile label
1 a 1 1
2 a 2 1
3 a 3 2
4 a 4 2
5 a 5 3
6 a 6 3
7 a 7 4
8 a 8 4
9 b 1 1
10 b 2 2
11 b 3 3
12 b 4 4
13 c 1 1
14 c 2 2
15 c 3 3
16 c 4 4
17 c 5 4
Upvotes: 1
Views: 1680
Reputation: 10761
Here's a way you can do it, following the split-apply-combine framework.
dt<-data.frame(name=c(rep('a',8),rep('b',4),rep('c',5)),value=c(1:8,1:4,1:5))
split_dt <- lapply(split(dt, dt$name),
transform,
quantlabel = as.numeric(
cut(value, breaks = quantile(value, probs = seq(0,1,.25)), include.lowest = T)))
dt <- unsplit(split_dt, dt$name)
name value quantlabel
1 a 1 1
2 a 2 1
3 a 3 2
4 a 4 2
5 a 5 3
6 a 6 3
7 a 7 4
8 a 8 4
9 b 1 1
10 b 2 2
11 b 3 3
12 b 4 4
13 c 1 1
14 c 2 1
15 c 3 2
16 c 4 3
17 c 5 4
edit: there's a data.table
way
following this post, we can use the data.table
package, if performance is a concern:
library(data.table)
dt<-data.frame(name=c(rep('a',8),rep('b',4),rep('c',5)),value=c(1:8,1:4,1:5))
dt.t <- as.data.table(dt)
dt.t[,quantlabels := as.numeric(cut(value, breaks = quantile(value, probs = seq(0,1,.25)), include.lowest = T)), name]
name value quantlabels
1: a 1 1
2: a 2 1
3: a 3 2
4: a 4 2
5: a 5 3
6: a 6 3
7: a 7 4
8: a 8 4
9: b 1 1
10: b 2 2
11: b 3 3
12: b 4 4
13: c 1 1
14: c 2 1
15: c 3 2
16: c 4 3
17: c 5 4
edit: and there's a dplyr
way
We can follow @akrun's advice and use as.numeric
(which is what we've done for the other solutions):
dt %>%
group_by(name) %>%
mutate(quantlabel =
as.numeric(
cut(value,
breaks = quantile(value, probs = seq(0,1,.25)),
include.lowest = T)))
Note that if you instead wanted the labels themselves, use as.character
:
dt %>%
group_by(name) %>%
mutate(quantlabel = as.character(cut(value, breaks = quantile(value, probs = seq(0,1,.25)), include.lowest = T)))
Source: local data frame [17 x 3]
Groups: name [3]
name value quantlabel
<fctr> <int> <chr>
1 a 1 [1,2.75]
2 a 2 [1,2.75]
3 a 3 (2.75,4.5]
4 a 4 (2.75,4.5]
5 a 5 (4.5,6.25]
6 a 6 (4.5,6.25]
7 a 7 (6.25,8]
8 a 8 (6.25,8]
9 b 1 [1,1.75]
10 b 2 (1.75,2.5]
11 b 3 (2.5,3.25]
12 b 4 (3.25,4]
13 c 1 [1,2]
14 c 2 [1,2]
15 c 3 (2,3]
16 c 4 (3,4]
17 c 5 (4,5]
Upvotes: 2