Reputation: 63
I am having issues developing a data.table that gives me max/min based on multiple columns that share a name pattern.
This is a simplified table:
int <- seq(as.POSIXct("2016-04-08"), as.POSIXct("2016-04-10"), by="6 h")
df <- data.frame(date = int, x_01 = runif(9), x_02 = runif(9), x_10 = runif(9), b_31 = runif(9))
df$date <- format(as.POSIXct(df$date), format = "%Y-%m-%dM")
I know how to obtain summary statistics by applying the following code:
sum <- setDT(df)[, list(x_01min=min(x_01), x_01max=max(x_01)), by=list(date)]
My goal is to get summary statistics for all columns with the pattern "x_"
I have attempted nesting for
loops and using lapply
with grep
but cannot seems to get the desired results. The code below should show what I am trying to get at.
sum <- setDT(df)[, list(x_01min=min(x_01), x_01max=max(x_01),
x_02min=min(x_02), x_02max=max(x_02),
x_10min=min(x_10), x_10max=max(x_10)), by=list(date)]
Ideally column names for the summary table should incorporate names from the original table. My actual data set consists of multiple data frames with different number of columns matching the pattern. As I gather more data new variables will be added so it is important to be able to execute the function based on the colname
pattern.
Your help is appreciated!
Upvotes: 3
Views: 677
Reputation: 35314
library(data.table);
setDT(df); ## ensure df is a data.table
cns <- grep(value=T,'^x_',names(df));
df[,do.call(c,lapply(cns,function(cn) { x <- get(cn); setNames(nm=paste0(cn,c('min','max')),.(min(x),max(x))); })),.(date)];
## date x_01min x_01max x_02min x_02max x_10min x_10max
## 1: 2016-04-08M 0.2655087 0.9082078 0.06178627 0.6870228 0.21214252 0.93470523
## 2: 2016-04-09M 0.2016819 0.9446753 0.38410372 0.7698414 0.12555510 0.65167377
## 3: 2016-04-10M 0.6291140 0.6291140 0.99190609 0.9919061 0.01339033 0.01339033
First, the target column names are derived by calling grep()
with the value=T
argument. These names are stored in cns
in the global environment.
Then, the data.table is indexed, grouping on date
.
For each group, lapply()
is executed over the cns
vector, taking the current column name as parameter cn
.
Within the lambda, the column vector is retrieved and stored in a local variable x
by calling get()
on cn
, which works because a data.table's columns are always visible to the j
argument expression.
Finally, the summary statistics are computed in a list using .()
, and their names are set using setNames()
which allows us to dynamically compute them from cn
with paste0()
.
The result of the lapply()
call will be a list of lists, but because we need to produce a single non-nested list for the group aggregation result, we have to run it through do.call(c,...)
to unnest the nested lists. An alternative here would be unlist(recursive=F,...)
. Both alternatives preserve the names of the nested lists, which is what we want.
library(data.table);
library(microbenchmark);
bgoldst <- function(df) { cns <- grep(value=T,'^x_',names(df)); df[,do.call(c,lapply(cns,function(cn) { x <- get(cn); setNames(nm=paste0(cn,c('min','max')),.(min(x),max(x))); })),.(date)]; };
kunal <- function(df) { indices <- grep('x_',colnames(df)); col_names <- colnames(df)[indices]; query_min <- paste0(col_names,'min=min(',col_names,')'); query_max <- paste0(col_names,'max=max(',col_names,')'); query_1 <- paste(c(query_min,query_max),collapse=','); eval(parse(text=paste0('df[,.(',query_1,'),by=date]'))); };
psidom <- function(df) { cols <- names(df)[grepl('x_',names(df))]; newCols <- paste0(rep(cols,each=2),c('max','min')); sumFun <- function(col) list(max(col),min(col)); df[,c(newCols):=unlist(lapply(.SD,sumFun),recursive=F),.(date),.SDcols=cols]; unique(df[,.SD,.SDcols=c('date',newCols)]); };
set.seed(1L);
int <- seq(as.POSIXct('2016-04-08'),as.POSIXct('2016-04-10'),by='6 h');
df <- data.frame(date=int,x_01=runif(9L),x_02=runif(9L),x_10=runif(9L),b_31=runif(9L));
df$date <- format(as.POSIXct(df$date),format='%Y-%m-%dM');
setDT(df);
expected <- bgoldst(copy(df)); co <- names(expected);
identical(expected,kunal(copy(df))[,co,with=F]);
## [1] TRUE
identical(expected,psidom(copy(df))[,co,with=F]);
## [1] TRUE
microbenchmark(bgoldst(copy(df)),kunal(copy(df)),psidom(copy(df)));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(copy(df)) 1.397569 1.445893 1.522512 1.490369 1.538908 2.749805 100
## kunal(copy(df)) 1.318453 1.362287 1.483356 1.403555 1.443968 4.733684 100
## psidom(copy(df)) 1.451881 1.532920 1.625494 1.573120 1.624010 3.097487 100
set.seed(1L);
NR <- 500L; NC <- 100L;
df <- data.frame(
date=seq(as.POSIXct('2016-04-08'),by='6 h',len=NR),
setNames(nm=paste0('x_',seq_len(NC)),as.data.frame(replicate(NC,runif(NR)))),
b_31=runif(NR)
);
df$date <- format(as.POSIXct(df$date),format='%Y-%m-%dM');
setDT(df);
expected <- bgoldst(copy(df)); co <- names(expected);
identical(expected,kunal(copy(df))[,co,with=F]);
## [1] TRUE
identical(expected,psidom(copy(df))[,co,with=F]);
## [1] TRUE
microbenchmark(bgoldst(copy(df)),kunal(copy(df)),psidom(copy(df)));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(copy(df)) 94.75322 100.94627 106.61343 102.37655 105.89292 164.58885 100
## kunal(copy(df)) 21.38946 23.04383 24.60639 24.20192 25.18723 69.29593 100
## psidom(copy(df)) 45.32431 48.76798 50.63476 49.60532 51.03667 92.41567 100
Upvotes: 4
Reputation: 214937
cols <- names(df)[grepl("x_", names(df))]
newCols <- paste0(rep(cols, each = 2), c("max", "min"))
sumFun <- function(col) list(max(col), min(col))
setDT(df)[, c(newCols) := unlist(lapply(.SD, sumFun), recursive = F), .(date), .SDcols = cols]
sum <- unique(df[, .SD, .SDcols = c("date", newCols)])
> sum
date x_01max x_01min x_02max x_02min x_10max x_10min
1: 2016-04-08M 0.8770486 0.1828969 0.99869872 0.159936264 0.8983131 0.3767007
2: 2016-04-09M 0.6475017 0.1429131 0.57890510 0.007439883 0.9242098 0.1077233
3: 2016-04-10M 0.9176341 0.9176341 0.05900942 0.059009423 0.2717861 0.2717861
Upvotes: 1
Reputation: 3427
You may try this code:
## building the data.table
int <- seq(as.POSIXct("2016-04-08"), as.POSIXct("2016-04-10"), by="6 h")
df <- data.frame(date = int, x_01 = runif(9), x_02 = runif(9), x_10 = runif(9), b_31 = runif(9))
df$date <- format(as.POSIXct(df$date), format = "%Y-%m-%dM")
## actual work begins here
library(data.table)
setDT(df)
indices <- grep("x_",colnames(df))
col_names <- colnames(df)[indices]
query_min <- paste0(col_names,'min=min(',col_names,')')
query_max <- paste0(col_names,'max=max(',col_names,')')
query_1 <- paste(c(query_min,query_max),collapse=',')
eval(parse(text=paste0("df[,.(",query_1,"),by=date]")))
## date x_01min x_02min x_10min x_01max x_02max x_10max
##1: 2016-04-08M 0.07527176 0.026276086 0.3315467 0.9404001 0.906662120 0.7069425
##2: 2016-04-09M 0.34796983 0.065390319 0.2437374 0.8130796 0.739978420 0.6760062
##3: 2016-04-10M 0.45671821 0.003374905 0.7245515 0.4567182 0.003374905 0.7245515
Upvotes: 1