Reputation: 1294
I have managed to aggregate data successfully using the following pattern:
newdf <- setDT(df)[, list(X=sum(x),Y=max(y)), by=Z]
However, the moment I try to do anything more complicated, although the code runs, it no longer aggregates by Z: it seems to create a dataframe with the same number of observations as the original df so I know that no grouping is actually occurring.
The custom function I would like to apply is to find the n-quantile for the current list of values and then do some other stuff with it. I saw use of sdcols in another SO answer and tried something like:
customfunc <- function(dt){
q = unname(quantile(dt$column,0.25))
n = nrow(dt[dt$column <= q])
return(n/dt$someOtherColumn)
}
#fails to group anything!!! also rather slow...
newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c(column, someOtherColumn)]
Can someone please help me figure out what is wrong with the way I'm trying to use group by and custom functions? Thank you very much.
Literal example as requested:
> df <- data.frame(Z=c("abc","abc","def","abc"), column=c(1,2,3,4), someOtherColumn=c(5,6,7,8))
> df
Z column someOtherColumn
1 abc 1 5
2 abc 2 6
3 def 3 7
4 abc 4 8
> newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c("column", "someOtherColumn")]
> newdf
Z V1
1: abc 0.2000000
2: abc 0.1666667
3: abc 0.1250000
4: def 0.1428571
>
As you can see, DF is not grouped. There should just be two rows, one for "abc", and another for "def" since I am trying to group by Z.
Upvotes: 5
Views: 5131
Reputation: 1294
As guided by eddi's point above, the basic problem is thinking that your custom function is being called inside a loop and that 'dt$column
' will mysteriously give you the 'current value at the current row'. Instead it gives you the entire column (a vector). The function is passed the entire data table, not row-wise bits of data.
So, replacing the value in the return statement with something that represents a single value works. Example:
customfunc <- function(dt){
q = unname(quantile(dt$column,0.25))
n = nrow(dt[dt$column <= q])
return(n/length(dt$someOtherColumn))
}
> df <- data.frame(Z=c("abc","abc","def","abc"), column=c(1,2,3,4), someOtherColumn=c(5,6,7,8))
> df
Z column someOtherColumn
1 abc 1 5
2 abc 2 6
3 def 3 7
4 abc 4 8
> newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c("column", "someOtherColumn")]
> newdf
Z V1
1: abc 0.3333333
2: def 1.0000000
Now the data is aggregated correctly.
Upvotes: 5