Moysey Abramowitz
Moysey Abramowitz

Reputation: 380

Quartile sorter with externally specified quartile breakpoints in R data.table

I want to sort observations into quartiles on the variable "varbl". Since my data is pretty big (2Gb), I am trying to implement it via data.table. The problem is that I need to use external quartile breaks, which are group-specific. The group variable is "prd" or "prd1".

My data and breakpoints are as follows:

data <- data.table(id = c(1,2,3,4,5,1,2,3,4,5), prd1 = c(1,1,1,1,1,2,2,2,2,2), varbl = c(-1.6, -0.7, 0.1, 1.2, -0.5, -0.8, 0.4, 1.2, 1.9, 4))
bks <- data.table(prd=c(1,2), br0 = c(-5,-5), br1=c(-1,0), br2=c(0, 0.5), br3=c(1, 3), br4=c(5,5))
 > data
    id prd1 varbl
 1:  1    1  -1.6
 2:  2    1  -0.7
 3:  3    1   0.1
 4:  4    1   1.2
 5:  5    1  -0.5
 6:  1    2  -0.8
 7:  2    2   0.4
 8:  3    2   1.2
 9:  4    2   1.9
10:  5    2   4.0
> bks
   prd br0 br1 br2 br3 br4
1:   1  -5  -1 0.0   1   5
2:   2  -5   0 0.5   3   5

The desired output is:

> output
    id prd1 varbl ntile
 1:  1    1  -1.6     1
 2:  2    1  -0.7     2
 3:  3    1   0.1     3
 4:  4    1   1.2     4
 5:  5    1  -0.5     2
 6:  1    2  -0.8     1
 7:  2    2   0.4     2
 8:  3    2   1.2     3
 9:  4    2   1.9     3
10:  5    2   4.0     4

I tried the following code, but it fails, since I can not subset bks on the same prd as the current prd1 from data:

data[, ntile := cut(varbl, breaks = bks[prd==prd1], include.lowest=TRUE, labels = 1:4)]

As another attempt, I tried to join data and bks first (I would prefer not to as it will increase the size of data from 2Gb to 4Gb) and then sort observations into quantiles. It fails, since I can not understand how to use column names to construct a vector of breakpoints for every row. None of the attempts worked.

setnames(data, "prd1", "prd")
data <- data[bks, on="prd", nomatch=0]
data[, ntile := cut(varbl, breaks = .(br0, br1, br2, br3, br4), include.lowest=TRUE, labels=1:4)]
data[, ntile := cut(varbl, breaks = colnames(bks)[-1], include.lowest=TRUE, labels=1:4)]
data[, ntile := cut(varbl, breaks = c("br0", "br1", "br2", "br3", "br4"), include.lowest=TRUE, labels=1:4)]

Upvotes: 2

Views: 114

Answers (1)

thelatemail
thelatemail

Reputation: 93843

Rearranging bks a little means you can do this as a join:

bks <- bks[, data.frame(embed(unlist(.SD),2)[,2:1]), by=prd]
bks[, grp := seq_len(.N), by=prd]

#   prd   X1   X2 grp
#1:   1 -5.0 -1.0   1
#2:   1 -1.0  0.0   2
#3:   1  0.0  1.0   3
#4:   1  1.0  5.0   4
#5:   2 -5.0  0.0   1
#6:   2  0.0  0.5   2
#7:   2  0.5  3.0   3
#8:   2  3.0  5.0   4

data[bks, on=c("prd1"="prd","varbl>=X1","varbl<X2"), grp := i.grp]

#    id prd1 varbl  grp
# 1:  1    1  -1.6    1
# 2:  2    1  -0.7    2
# 3:  3    1   0.1    3
# 4:  4    1   1.2    4
# 5:  5    1  -0.5    2
# 6:  1    2  -0.8    1
# 7:  2    2   0.4    2
# 8:  3    2   1.2    3
# 9:  4    2   1.9    3
#10:  5    2   4.0    4

Upvotes: 3

Related Questions