Reputation: 168
I have been using R for a long time now, and I have working code for a specific little task. But I was wondering if there is any faster solution.
The problem is simple: I have a data frame tbl
with two columns ID
and NrBlocks
. IDs are not unique and may appear multiple times, but with the same or different NrBlocks
. The table has more columns in reality but these details are irrelevant here. All I want is the sum of the NrBlocks
values for each unique ID
.
Working code (before renaming, I hope I did not introduce typos due to the simplification here):
uniqueIDs = unique(tbl$ID) #Precompute once
sapply(1:length(uniqueIDs),
FUN = function(x){
sum(tbl[which(tbl$ID == uniqueIDs[x]),]$NrBlocks)
}
)
Any suggestions for a speed improvement?
Upvotes: 1
Views: 173
Reputation: 99331
If you have 500 milliseconds to spare, you can do this with aggregate
in one easy line.
aggregate(NrBlocks ~ ID, dat, mean)
microbenchmark(aggregate(NrBlocks ~ ID, dat, mean))
# Unit: milliseconds
# expr min lq median uq max neval
# f() 655.218 655.6562 656.6428 661.6947 667.0888 100
Where dat
is a data frame created from nrussell's table.
dim(dat)
# [1] 10000 2
Upvotes: 0
Reputation: 18602
Obligatory data.table
solution -
options(stringsAsFactors=FALSE)
library(data.table)
##
set.seed(1234)
dTbl <- data.table(
ID = sample(c(letters,LETTERS),100000,replace=TRUE),
NrBlocks = rnorm(100000),
key = "ID")
##
gTbl <- dTbl[
,
list(sumNrBlocks = sum(NrBlocks)),
by = list(ID)]
##
> head(gTbl)
ID sumNrBlocks
1: A 56.50234
2: B -13.61380
3: C 24.66750
4: D 65.18829
5: E 26.14085
6: F 41.64376
Timings:
library(microbenchmark)
##
uniqueIDs <- unique(dTbl$ID)
f1 <- function(){
sapply(1:length(uniqueIDs),
FUN = function(x){
sum(dTbl[which(dTbl$ID == uniqueIDs[x]),]$NrBlocks)
}
)
}
##
f2 <- function(){
dTbl[
,
list(sumNrBlocks = sum(NrBlocks)),
by = list(ID)]
}
##
Res <- microbenchmark(
f1(),
f2(),
times=100L)
Res
> Res
Unit: milliseconds
expr min lq median uq max neval
f1() 139.054620 141.534227 144.213253 156.747569 193.278071 100
f2() 1.813652 1.911069 1.980874 2.140971 3.522545 100
Multiple columns:
dTbl2 <- copy(dTbl)
set.seed(1234)
dTbl2[,col3:=rexp(100000)]
dTbl2[,col4:=col3*2]
##
gTbl2 <- dTbl2[
,
lapply(.SD,sum),
by=list(ID)]
##
> head(gTbl2)
ID NrBlocks col3 col4
1: A 56.50234 1933.443 3866.886
2: B -13.61380 1904.282 3808.563
3: C 24.66750 1834.655 3669.310
4: D 65.18829 1884.364 3768.728
5: E 26.14085 1874.761 3749.523
6: F 41.64376 1977.219 3954.438
Multiple columns with specification -
gTbl2.2 <- dTbl2[
,
lapply(.SD,sum),
by=list(ID),
.SDcols=c(2,4)]
##
> head(gTbl2.2)
ID NrBlocks col4
1: A 56.50234 3866.886
2: B -13.61380 3808.563
3: C 24.66750 3669.310
4: D 65.18829 3768.728
5: E 26.14085 3749.523
6: F 41.64376 3954.438
Upvotes: 1