Wienand
Wienand

Reputation: 11

How to speed up per column iteration

I would like to loop through a data.table and apply a function that requires info from another column in the data.table, sometimes even multiple...

Let's take the mtcars as an example

I have the feeling that you can stick to the .SD way but provide extra arguments and make this much more efficient...

require(data.table)
dt = data.table(mtcars)

#looping through columns of mtcars...
cols = c('mpg', 'hp', 'disp')
dt[,lapply(.SD, function(x) x/mean(x)), .SDcols=cols]

# But actually I want to devide x by the mean of x where am==1

# Now I am doing this...

specificMean= function(DT) {
  x = DT$feature
  xAM = DT[AM==1]$feature
  MEAN = mean(xAM, na.rm=TRUE)
  x = x/MEAN    
  return(x)
}

dt[,(cols):=lapply(cols, function(x) specificMean(data.table(feature=get(x), AM=am))), .SDcols=cols]
print(dt)

I have the feeling this is much slower because it performs the data.table() function in each iteration...

A vectorized solution would be nice..

Upvotes: 1

Views: 110

Answers (3)

Wienand
Wienand

Reputation: 11

system.time(dt[,lapply(cols, function(x) specificMean(data.table(feature=get(x), AM=am))), .SDcols=cols])

user system elapsed 0.010 0.000 0.005

efficient way using two loop thanks to @chinsoon12

system.time(dt[,mapply(`/`, .SD[,-"am"], lapply(.SD[am==1, -"am"], mean), SIMPLIFY=FALSE), .SDcols=c("am", cols)])

user system elapsed 0.001 0.000 0.001

The winning efficient way using one loop thanks to @Cole

system.time(dt[,.SD / lapply(.SD[am == 1], mean, na.rm = TRUE), .SDcols = cols])

user system elapsed 0.001 0.000 0.001

Upvotes: 0

Cole
Cole

Reputation: 11255

Edit: This seems to produce the same result as yours.

library(data.table)

dt = data.table(mtcars)
cols = c('mpg', 'hp', 'disp')

dt[, (cols) := .SD / lapply(.SD[am == 1], mean, na.rm = TRUE), .SDcols = cols]

Upvotes: 0

chinsoon12
chinsoon12

Reputation: 25225

A possible approach:

dt[, (cols) := mapply(`/`, .SD[,-"am"], lapply(.SD[am==1, -"am"], mean), SIMPLIFY=FALSE), 
    .SDcols=c("am", cols)]

Upvotes: 1

Related Questions