The use of the `by` parameter in data.table

Question

I am learning to work with R and more specifically the data.table package. Now I have found out that sometimes I try to intuitively solve things in ways that don't work with data.table. Specifically when I am trying to apply the same operation to each row in a data.table and often using that operation to fill a new column or modify an existing one. For example I created the next command to apply on the table I pasted below too. In the beginning the column C was non-existent, but I created that by applying the correct method.

DT[,C:=ifelse(nchar(unlist(strsplit(B,"-"))[1]) == 4,paste("0",unlist(strsplit(B,"-"))[1],sep=""),unlist(strsplit(B,"-"))[1])]

But applying this command only gives me the value 10:00 in every row of C.. table:

Id          A           B         C
 1         41 10:00-10:15     10:00
 2         38   9:15-9:30     09:15
 3         39   9:30-9:45     09:30
 4         40  9:45-10:00     09:45
 5         57 14:00-14:15     14:00
 6         59 14:30-14:45     14:30
 7         58 14:15-14:30     14:15
 8         56 13:45-14:00     13:45
 9         91 22:30-22:45     22:30
10         89 22:00-22:15     22:00

Then somewhere (I'm sorry I lost the link) on StackOverflow I found someone doing the same thing but using the by parameter. This would make my command like this:"

DT[,C:=ifelse(nchar(unlist(strsplit(B,"-"))[1]) == 4,paste("0",unlist(strsplit(B,"-"))[1],sep=""),unlist(strsplit(B,"-"))[1]),by=1:nrow(DT)]

This gives me the correct results. Basically the same as:

for (row in 1:nrow(DT))
{
  DT[row,C:=ifelse(nchar(unlist(strsplit(DT[row,B],"-"))[1]) == 4,paste("0",unlist(strsplit(DT[row,B],"-"))[1],sep=""),unlist(strsplit(DT[row,B],"-"))[1])]
}

This confused me because reading the help page for data.table about by it says:

A single unquoted column name, a list() of expressions of column names, a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end), or a character vector of column names.

So this made me think by was only to point to columns, but here I am using it for rows... Can someone explain what is going on?

Additional question after comments: When I use the following code

test <- data.table(matrix(1:10))
test[,V2:=V1+2]

# V1 V2
#  1  3
#  2  4
#  3  5
#  4  6
#  5  7
#  6  8
#  7  9
#  8 10
#  9 11
# 10 12

I get the expected result. V2 contains the value of V1 upped by 2. Why does the same method not work in the case of my initial code above, where I ask the data.table to split every value of column B and put the resulting string into the new column.

Roland · Accepted Answer

strsplit returns a list and you need the first element of each list element. This can be achieved using lapply or one of it's relatives. Here I use vapply, which returns a vector. Padding with zero is done in a seperate step, since ifelse is not the best option here for performance.

DT <- read.table(text="Id          A           B         C
 1         41 10:00-10:15     10:00
 2         38   9:15-9:30     09:15
 3         39   9:30-9:45     09:30
 4         40  9:45-10:00     09:45
 5         57 14:00-14:15     14:00
 6         59 14:30-14:45     14:30
 7         58 14:15-14:30     14:15
 8         56 13:45-14:00     13:45
 9         91 22:30-22:45     22:30
10         89 22:00-22:15     22:00", header=TRUE, stringsAsFactors=FALSE)



library(data.table)
setDT(DT)
#extract first entry in each list element
DT[, C := vapply(strsplit(B,"-"), `[`, "string", i = 1)]
#pad with zero 
DT[nchar(C) == 4, C := paste0(0, C)]
#    Id  A           B     C
# 1:  1 41 10:00-10:15 10:00
# 2:  2 38   9:15-9:30 09:15
# 3:  3 39   9:30-9:45 09:30
# 4:  4 40  9:45-10:00 09:45
# 5:  5 57 14:00-14:15 14:00
# 6:  6 59 14:30-14:45 14:30
# 7:  7 58 14:15-14:30 14:15
# 8:  8 56 13:45-14:00 13:45
# 9:  9 91 22:30-22:45 22:30
#10: 10 89 22:00-22:15 22:00

Using by as you show in the questiobn is an alternative to using *apply here. However, I wouldn't expect it to be faster in this example since data.table probably doesn't optimize the content of the loop.

The use of the `by` parameter in data.table

Answers (1)

Related Questions