Reputation: 34703
It could very well be something silly, but I can't find a (nonexhaustive) workaround of the following problem:
Let's set:
data<-data.table(id=c("a","a","a","b","b"),
x1=1:5,
x2=6:10,
x3=11:15)
I want, say, the means of each column variable grouped according to "id", then append them as a variable to the dataset. With so few, of course, we can use the syntax:
means1<-data[,mean(x1),by=id]
And similarly for x2, x3, to get:
id x1 x2 x3 V1 V1.1 V1.2
1: a 1 6 11 2.0 7.0 12.0
2: a 2 7 12 2.0 7.0 12.0
3: a 3 8 13 2.0 7.0 12.0
4: b 4 9 14 4.5 9.5 14.5
5: b 5 10 15 4.5 9.5 14.5
However, for larger set, we're tempted to loop. Here's what I tried first:
for(nn in 1:3){
data<-data[data[,mean(paste("x",nn,sep="")),by=id]]
}
But this fails, I guess because even though strings are usually allowed to identify column names, the mean function tries to operate before the outer operator:
Warning messages:
1: In `[.data.table`(data, , mean(paste("x", nn, sep = "")), by = id) :
argument is not numeric or logical: returning NA
So, a next attempt:
for(nn in 1:3){
data<-data[data[,mean(data[[paste("x",nn,sep="")]]),by=id]]
}
However, this also fails, again because the mean operates first. (not to mention the syntax is cringe-worthy to anyone used to doing similar things in STATA) So we end up with the overall mean of each x being assigned for every id value:
id x1 x2 x3 V1 V1.1 V1.2
1: a 1 6 11 3 8 13
2: a 2 7 12 3 8 13
3: a 3 8 13 3 8 13
4: b 4 9 14 3 8 13
5: b 5 10 15 3 8 13
So, how might we perform this simple task in a loop?
Syntax like this has worked for me in similar calls, for example:
for(nn in 1:3){
data[,paste("x_greater_than_4_",nn,sep=""):=(data[[paste("x",nn,sep="")]]>4)]
}
produces what I'd expect:
id x1 x2 x3 x_greater_than_4_1 x_greater_than_4_2 x_greater_than_4_3
1: a 1 6 11 FALSE TRUE TRUE
2: a 2 7 12 FALSE TRUE TRUE
3: a 3 8 13 FALSE TRUE TRUE
4: b 4 9 14 FALSE TRUE TRUE
5: b 5 10 15 TRUE TRUE TRUE
Upvotes: 2
Views: 191
Reputation: 14604
How about:
> data[, x1Mean := mean(x1), by=id] # this command updates the data table
> data
id x1 x2 x3 x1Mean
1: a 1 6 11 2.0
2: a 2 7 12 2.0
3: a 3 8 13 2.0
4: b 4 9 14 4.5
5: b 5 10 15 4.5
Upvotes: 1
Reputation: 59970
I'd first set a key using setkey
and then use lapply
in the j
expression and self-join the result. You can use .SD
in lapply
and the associated .SDcols
to specify columns by numeric position. Like this:
setkey( data , id )
data[ data[ , lapply( .SD , mean ) , keyby = id , .SDcols = 2:4 ] ]
# id x1 x2 x3 x1.1 x2.1 x3.1
#1: a 1 6 11 2.0 7.0 12.0
#2: a 2 7 12 2.0 7.0 12.0
#3: a 3 8 13 2.0 7.0 12.0
#4: b 4 9 14 4.5 9.5 14.5
#5: b 5 10 15 4.5 9.5 14.5
# If you just want the group means use this:
data[ , lapply( .SD , mean ), by = id , .SDcols = 2:4 ]
Alternatively, you can use :=
along with by
as follows, which'll avoid the join as well:
sd_cols = c("x1", "x2", "x3")
data[, c(paste0("v", 1:3)) := lapply(.SD, mean), by=id, .SDcols=sd_cols]
# id x1 x2 x3 v1 v2 v3
# 1: a 1 6 11 2.0 7.0 12.0
# 2: a 2 7 12 2.0 7.0 12.0
# 3: a 3 8 13 2.0 7.0 12.0
# 4: b 4 9 14 4.5 9.5 14.5
# 5: b 5 10 15 4.5 9.5 14.5
Upvotes: 3
Reputation: 66834
Try using aggregate
with the formula interface to get the group means, then merge
into the original data:
merge(data,aggregate(.~id,data=data,mean),by="id",suffixes=c("",".mean"))
id x1 x2 x3 x1.mean x2.mean x3.mean
1: a 1 6 11 2.0 7.0 12.0
2: a 2 7 12 2.0 7.0 12.0
3: a 3 8 13 2.0 7.0 12.0
4: b 4 9 14 4.5 9.5 14.5
5: b 5 10 15 4.5 9.5 14.5
Upvotes: 1