Jorge Guzman
Jorge Guzman

Reputation: 499

"group by" like command in R with min as aggregate function and multiple columns

I have a frame with many columns c1 c2 c3 c4 ... c30 d

I want to aggregate by and find all rows unique in c1..30 and then get the min(d) for that row. In sql this would be a group by c1, ..., c30.

d is of type date.

I have found some solutions here in stack but none seem to work for 1) so many columns 2) do a min instead of sum.

Any input would be great.

Upvotes: 3

Views: 2443

Answers (2)

BenBarnes
BenBarnes

Reputation: 19454

Here's an answer using the data.table package with some fake data:

library(data.table)

DT<-data.table(matrix(sample(1:2,3000,replace=TRUE),ncol=30))
DT2<-DT[sample(seq_len(nrow(DT)),9000,replace=TRUE)]
# EDIT: now "d" is a date.
DT2[,d:=as.POSIXct(origin = "1960-01-01",rnorm(nrow(DT2), sd = 1000))]
setnames(DT2,c(paste0("c",1:30),"d"))

## pick up herewith your own data, starting with the commented next line
# DT2 <- as.data.table(dataset)
setkeyv(DT2,paste0("c",1:30))

DT3<-DT2[,list(minD=min(d)),by=key(DT2)]

dim(DT2)
# [1] 9000   31
dim(DT3)
# [1] 100  31

Small addition from Matthew :

+10, and nice fake data. Setting a key first so you can do by=key(DT) can get a bit onerous sometimes, so I'd usually just do an ad hoc by for something like this for simplicity. But, trying the most natural thing first :

DT2[,min(d),by=paste0("c",1:30)]
Error in `[.data.table`(DT2, , min(d), by = paste0("c", 1:30)) : 
    'by' appears to evaluate to column names but isn't c() or key(). Use by=
    list(...) if you can. Otherwise, by=eval(paste0("c", 1:30)) should work.
    This is for efficiency so data.table can detect which columns are needed.

The error message tells us what we need to do instead :

ans = DT2[,min(d),by=eval(paste0("c",1:30))]
dim(ans)
[1] 100  31

The natural next thought is of course: well, if data.table is clever enough to know by is column names and put that in the error message, why can't it just do it? The answer is that it's only making a guess based on the data. In some edge cases it's not so clear. So currently extra intent is needed from user: wrapping with eval. I'm not completely happy with that though, so perhaps we can improve that.

EDIT: renaming the new data.table

In my approach, I named the new column minD when I created it by entering

DT3<-DT2[,list(minD=min(d)),by=key(DT2)]

Using Matthew Dowle's approach, you would achieve this pretty much the same way by entering

ans = DT2[,list(minD=min(d)),by=eval(paste0("c",1:30))]

If you've already created the column and want to rename it, use setnames as follows:

setnames(DT3,old="minD",new="theNewMinD")

This avoids copying the whole data.table and preserves the memory over allocation (both of these advantages are lost when using names(DT3)<-"something"), as outlined in the documentation under ?setnames

Upvotes: 4

Tyler Rinker
Tyler Rinker

Reputation: 109844

Untested without a minimal reproducible example:

apply(unique(dataset), 1, min)

Break it down:

x <- unique(dataset)  #finds only unique rows
apply(x, 1, min) # applies min across rows
                 # replace the 1 with a 2 and it applies across columns

Upvotes: 1

Related Questions