Reputation: 523
I'm trying to group data with pre-determined breakpoints by year. I can easily do this with the cut()
function for a given year, but I'm struggling to get it to work within the data.table
function. Here is the test data:
set.seed(4)
YR = data.table(yr=1962:2015)
ID = data.table(id=10001:11000)
DT <- YR[,as.list(ID), by = yr] # intentional cartesian join
# now add data
DT[,`:=` (ratio = rep(sample(10),each=2700)+rnorm(nrow(DT)))]
That gives three columns: yr, id, and ratio, with the latter being the data I want to group.
Now here are the yearly breakpoints:
DTy <- data.table(matrix(rep(1:10,each=nrow(YR)),nrow(YR),10)+rnorm(54)/10)
DTy[,yr := 1962:2015]
So for each year from 1962 to 2015 there is a set of cutoffs. To use 1962 as an example, here is what I'd like to do:
group <- cut(DT[yr==1962,ratio],DTy[yr==1962, -c("yr")], labels = FALSE)
And here is what it should look like.
ratio <- DT[yr==1962,ratio]
DTy[yr==1962,-c("yr")]
test <- data.table(ratio,group)
test[,yr:=1962]
> test
ratio group yr
1: 6.689275 6 1962
2: 4.718753 4 1962
3: 5.786855 5 1962
4: 7.896540 7 1962
5: 7.776863 7 1962
---
996: 6.176614 6 1962
997: 7.689046 7 1962
998: 4.652658 4 1962
999: 7.075622 7 1962
1000: 5.543791 5 1962
I tried this:
# merge two datasets together
newDT <- merge(x = DT, y = DTy, by = c("yr"))
# get names of columns with breakpoints
cnames <- names(newDT)[newDT[,grep("^V", names(newDT))]]
# apply the cut function by year.
newDT[,groupt := lapply(.SD, cut, as.vector(unique(.SD[,cnames,with=FALSE])), labels = FALSE, include.lowest = TRUE), by = .(year), .SDcols = c("ratio", cnames)]
But I get this error:
Error in `[.data.table`(newDT, , `:=`(groupt, lapply(.SD, cut, as.vector(unique(.SD[, :
column or expression 1 of 'by' or 'keyby' is type closure. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]
I'm honestly not sure what that means. I'm trying to use .SD like a data.table but need different columns for different parts of the cut()
function and I'm not sure how to pass them via lapply
.
Upvotes: 2
Views: 1444
Reputation: 17289
Here is what I got by simply modifying your code:
newDT <- merge(x = DT, y = DTy, by = c("yr"))
# get names of columns with breakpoints
cnames <- names(newDT)[grep("^V", names(newDT))]
# apply the cut function by year.
res <- newDT[, group := cut(ratio, unlist(.SD[1]), labels = F),
by = .(yr), .SDcols = cnames][
, .(yr, id, ratio, group)]
# yr id ratio cutted
# 1: 1962 10001 6.689275 6
# 2: 1962 10002 4.718753 4
# 3: 1962 10003 5.786855 5
# 4: 1962 10004 7.896540 7
# 5: 1962 10005 7.776863 7
# ---
# 53996: 2015 10996 10.613272 NA
# 53997: 2015 10997 11.260932 NA
# 53998: 2015 10998 8.591909 8
# 53999: 2015 10999 9.143039 9
# 54000: 2015 11000 7.470945 7
Upvotes: 1
Reputation: 38500
Here is one method using a join and by=.EACHI
:
DT[DTy, on="yr",
cutted := cut(ratio, c(i.V1, i.V2, i.V3, i.V4, i.V5, i.V6, i.V7, i.V8, i.V9, i.V10),
labels=FALSE), by=.EACHI]
Here, the data.table with the breaks is joined to the main data.table by year, then a new variable is assigned with :=
using cut. This is applied to separately each group to which the values are joined using by=.EAHCHI
.
This returns
head(DT)
yr id ratio cutted
1: 1962 10001 6.689275 6
2: 1962 10002 4.718753 4
3: 1962 10003 5.786855 5
4: 1962 10004 7.896540 7
5: 1962 10005 7.776863 7
6: 1962 10006 6.566604 6
You can use mget
and ls
with a pattern search to eliminate the need to enumerate the variables used in the breaks argument in cut
.
DT[DTy, on="yr", cutted := cut(ratio, c(mget(ls(pattern="^i\\.V\\d+$"))), labels=FALSE),
by=.EACHI]
Upvotes: 1