Reputation: 499
I have a frame with many columns c1 c2 c3 c4 ... c30 d
I want to aggregate by and find all rows unique in c1..30 and then get the min(d) for that row. In sql this would be a group by c1, ..., c30.
d is of type date.
I have found some solutions here in stack but none seem to work for 1) so many columns 2) do a min instead of sum.
Any input would be great.
Upvotes: 3
Views: 2443
Reputation: 19454
Here's an answer using the data.table
package with some fake data:
library(data.table)
DT<-data.table(matrix(sample(1:2,3000,replace=TRUE),ncol=30))
DT2<-DT[sample(seq_len(nrow(DT)),9000,replace=TRUE)]
# EDIT: now "d" is a date.
DT2[,d:=as.POSIXct(origin = "1960-01-01",rnorm(nrow(DT2), sd = 1000))]
setnames(DT2,c(paste0("c",1:30),"d"))
## pick up herewith your own data, starting with the commented next line
# DT2 <- as.data.table(dataset)
setkeyv(DT2,paste0("c",1:30))
DT3<-DT2[,list(minD=min(d)),by=key(DT2)]
dim(DT2)
# [1] 9000 31
dim(DT3)
# [1] 100 31
Small addition from Matthew :
+10, and nice fake data. Setting a key first so you can do by=key(DT)
can get a bit onerous sometimes, so I'd usually just do an ad hoc by for something like this for simplicity. But, trying the most natural thing first :
DT2[,min(d),by=paste0("c",1:30)]
Error in `[.data.table`(DT2, , min(d), by = paste0("c", 1:30)) :
'by' appears to evaluate to column names but isn't c() or key(). Use by=
list(...) if you can. Otherwise, by=eval(paste0("c", 1:30)) should work.
This is for efficiency so data.table can detect which columns are needed.
The error message tells us what we need to do instead :
ans = DT2[,min(d),by=eval(paste0("c",1:30))]
dim(ans)
[1] 100 31
The natural next thought is of course: well, if data.table is clever enough to know by
is column names and put that in the error message, why can't it just do it? The answer is that it's only making a guess based on the data. In some edge cases it's not so clear. So currently extra intent is needed from user: wrapping with eval
. I'm not completely happy with that though, so perhaps we can improve that.
EDIT: renaming the new data.table
In my approach, I named the new column minD
when I created it by entering
DT3<-DT2[,list(minD=min(d)),by=key(DT2)]
Using Matthew Dowle's approach, you would achieve this pretty much the same way by entering
ans = DT2[,list(minD=min(d)),by=eval(paste0("c",1:30))]
If you've already created the column and want to rename it, use setnames
as follows:
setnames(DT3,old="minD",new="theNewMinD")
This avoids copying the whole data.table
and preserves the memory over allocation (both of these advantages are lost when using names(DT3)<-"something"
), as outlined in the documentation under ?setnames
Upvotes: 4
Reputation: 109844
Untested without a minimal reproducible example:
apply(unique(dataset), 1, min)
Break it down:
x <- unique(dataset) #finds only unique rows
apply(x, 1, min) # applies min across rows
# replace the 1 with a 2 and it applies across columns
Upvotes: 1