David
David

Reputation: 10152

Changing multiple Columns in data.table r

I am looking for a way to manipulate multiple columns in a data.table in R. As I have to address the columns dynamically as well as a second input, I wasn't able to find an answer.

The idea is to index two or more series on a certain date by dividing all values by the value of the date eg:

set.seed(132)
# simulate some data
dt <- data.table(date = seq(from = as.Date("2000-01-01"), by = "days", length.out = 10),
                 X1 = cumsum(rnorm(10)),
                 X2 = cumsum(rnorm(10)))

# set a date for the index
indexDate <- as.Date("2000-01-05")

# get the column names to be able to select the columns dynamically
cols <- colnames(dt)
cols <- cols[substr(cols, 1, 1) == "X"]

Part 1: The Easy data.frame/apply approach

df <- as.data.frame(dt)
# get the right rownumber for the indexDate
rownum <- max((1:nrow(df))*(df$date==indexDate))

# use apply to iterate over all columns
df[, cols] <- apply(df[, cols], 
                    2, 
                    function(x, i){x / x[i]}, i = rownum)

Part 2: The (fast) data.table approach So far my data.table approach looks like this:

for(nam in cols) {
  div <- as.numeric(dt[rownum, nam, with = FALSE])
  dt[ , 
     nam := dt[,nam, with = FALSE] / div,
     with=FALSE]
}

especially all the with = FALSE look not very data.table-like.

Do you know any faster/more elegant way to perform this operation?

Any idea is greatly appreciated!

Upvotes: 11

Views: 3304

Answers (3)

sgrubsmyon
sgrubsmyon

Reputation: 1229

In the data.table (version 1.14.2) documentation for ?set, I find that there is a new and simpler way of accomplishing this:

The old syntax used to be:

DT[i, colvector := val, with = FALSE] # OLD syntax. The contents of "colvector" in calling scope determine the column(s).

The new syntax is:

DT[i, (colvector) := val] # same (NOW PREFERRED) shorthand syntax. The parens are enough to stop the LHS being a symbol; same as c(colvector).

Upvotes: 1

Moein
Moein

Reputation: 121

Following up on your code and the answer given by akrun, I would recommend you to use .SDcols to extract the numeric columns and lapply to loop through them. Here's how I would do it:

index <-as.Date("2000-01-05")

rownum<-max((dt$date==index)*(1:nrow(dt)))

dt[, lapply(.SD, function (i) i/i[rownum]), .SDcols = is.numeric]

Using .SDcols could be specially useful if you have a large number of numeric columns and you'd like to apply this division on all of them.

Upvotes: 2

akrun
akrun

Reputation: 886938

One option would be to use set as this involves multiple columns. The advantage of using set is that it will avoid the overhead of [.data.table and makes it faster.

library(data.table)
for(j in cols){
  set(dt, i=NULL, j=j, value= dt[[j]]/dt[[j]][rownum])
}

Or a slightly slower option would be

dt[, (cols) :=lapply(.SD, function(x) x/x[rownum]), .SDcols=cols]

Upvotes: 9

Related Questions