Reputation: 328
I'm new with data.table but I've managed to reduce a computation in a dataset of 600K rows from thousands of seconds (using *ply loops) to 1.7sec. Basically I need the row with the lowest value in the column class in the groups of the same group and start. I'm using
DT[, list(class=min(class)), by=list(group, start)]
But to do that I created DT with only these 3 columns from a data.frame with more columns. So, to merge my results with the original data.frame I'm thinking of using the row.name, so I created DT with row.name=TRUE and this is an example of what I have:
group start class rn
1: A 4943 4 1
2: A 5030 0 2
3: A 5030 4 3
4: A 5030 2 4
5: A 5083 4 5
6: A 5083 3 6
7: B 5041 0 7
8: B 5041 1 8
9: B 5083 4 9
...
My desired result is only the rn corresponding to the minimium class value:
group start class rn
1: A 4943 4 1
2: A 5030 0 2
3: A 5083 3 6
4: B 5041 0 7
5: B 5083 4 9
...
But if I use:
DT[, list(class=min(class)), by=list(group, start, rn)]
or
DT[, list(class=min(class), rn), by=list(group, start)]
I get all the rows, not only the rows with class minimum.
Extra question
I'd be possible to get a count of the cases of each class type in the group using data.table sintax using my command?
group start class rn class0 class1 class2 class3 class4
1: A 4943 4 1 0 0 0 0 1
2: A 5030 0 2 1 0 1 0 1
3: A 5083 3 6 0 0 0 1 1
4: B 5041 0 7 1 1 0 0 0
5: B 5083 4 9 0 0 0 0 1
...
Upvotes: 1
Views: 382
Reputation: 118779
For your first question, you're basically calling min
on each group. This is not necessary. If you sort the column class
as well (by setting the key
), then you can use mult="first"
feature to just pick the smallest element directly. That is,
setkey(dt, group, start, class)
dt[CJ(unique(group), unique(start)), mult="first", nomatch=0]
group start class rn
1: A 4943 4 1
2: A 5030 0 2
3: A 5083 3 6
4: B 5041 0 7
5: B 5083 4 9
Alternatively if you don't want to use CJ
here, then you can do this:
setkey(dt, group, start, class)
dt[, list(class=class[1], rn=rn[1]), by=list(group, start)]
Edit 2:
Here's a complete answer:
dt.out <- dt[, c(list(class = class[1], rn=rn[1]),
{tt <- rep(0,5); tt[class+1] <- 1; as.list(tt)}), by=list(group, start)]
setnames(dt.out, 5:9, paste0("Class", 0:4))
group start class rn Class0 Class1 Class2 Class3 Class4
1: A 4943 4 1 0 0 0 0 1
2: A 5030 0 2 1 0 1 0 1
3: A 5083 3 6 0 0 0 1 1
4: B 5041 0 7 1 1 0 0 0
5: B 5083 4 9 0 0 0 0 1
Upvotes: 2