Reputation: 355
I am learning to work with R and more specifically the data.table package. Now I have found out that sometimes I try to intuitively solve things in ways that don't work with data.table. Specifically when I am trying to apply the same operation to each row in a data.table and often using that operation to fill a new column or modify an existing one. For example I created the next command to apply on the table I pasted below too. In the beginning the column C was non-existent, but I created that by applying the correct method.
DT[,C:=ifelse(nchar(unlist(strsplit(B,"-"))[1]) == 4,paste("0",unlist(strsplit(B,"-"))[1],sep=""),unlist(strsplit(B,"-"))[1])]
But applying this command only gives me the value 10:00 in every row of C.. table:
Id A B C
1 41 10:00-10:15 10:00
2 38 9:15-9:30 09:15
3 39 9:30-9:45 09:30
4 40 9:45-10:00 09:45
5 57 14:00-14:15 14:00
6 59 14:30-14:45 14:30
7 58 14:15-14:30 14:15
8 56 13:45-14:00 13:45
9 91 22:30-22:45 22:30
10 89 22:00-22:15 22:00
Then somewhere (I'm sorry I lost the link) on StackOverflow I found someone doing the same thing but using the by
parameter. This would make my command like this:"
DT[,C:=ifelse(nchar(unlist(strsplit(B,"-"))[1]) == 4,paste("0",unlist(strsplit(B,"-"))[1],sep=""),unlist(strsplit(B,"-"))[1]),by=1:nrow(DT)]
This gives me the correct results. Basically the same as:
for (row in 1:nrow(DT))
{
DT[row,C:=ifelse(nchar(unlist(strsplit(DT[row,B],"-"))[1]) == 4,paste("0",unlist(strsplit(DT[row,B],"-"))[1],sep=""),unlist(strsplit(DT[row,B],"-"))[1])]
}
This confused me because reading the help page for data.table about by
it says:
A single unquoted column name, a list() of expressions of column names, a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end), or a character vector of column names.
So this made me think by
was only to point to columns, but here I am using it for rows... Can someone explain what is going on?
Additional question after comments: When I use the following code
test <- data.table(matrix(1:10))
test[,V2:=V1+2]
# V1 V2
# 1 3
# 2 4
# 3 5
# 4 6
# 5 7
# 6 8
# 7 9
# 8 10
# 9 11
# 10 12
I get the expected result. V2 contains the value of V1 upped by 2. Why does the same method not work in the case of my initial code above, where I ask the data.table to split every value of column B and put the resulting string into the new column.
Upvotes: 0
Views: 236
Reputation: 132706
strsplit
returns a list and you need the first element of each list element. This can be achieved using lapply
or one of it's relatives. Here I use vapply
, which returns a vector. Padding with zero is done in a seperate step, since ifelse
is not the best option here for performance.
DT <- read.table(text="Id A B C
1 41 10:00-10:15 10:00
2 38 9:15-9:30 09:15
3 39 9:30-9:45 09:30
4 40 9:45-10:00 09:45
5 57 14:00-14:15 14:00
6 59 14:30-14:45 14:30
7 58 14:15-14:30 14:15
8 56 13:45-14:00 13:45
9 91 22:30-22:45 22:30
10 89 22:00-22:15 22:00", header=TRUE, stringsAsFactors=FALSE)
library(data.table)
setDT(DT)
#extract first entry in each list element
DT[, C := vapply(strsplit(B,"-"), `[`, "string", i = 1)]
#pad with zero
DT[nchar(C) == 4, C := paste0(0, C)]
# Id A B C
# 1: 1 41 10:00-10:15 10:00
# 2: 2 38 9:15-9:30 09:15
# 3: 3 39 9:30-9:45 09:30
# 4: 4 40 9:45-10:00 09:45
# 5: 5 57 14:00-14:15 14:00
# 6: 6 59 14:30-14:45 14:30
# 7: 7 58 14:15-14:30 14:15
# 8: 8 56 13:45-14:00 13:45
# 9: 9 91 22:30-22:45 22:30
#10: 10 89 22:00-22:15 22:00
Using by
as you show in the questiobn is an alternative to using *apply
here. However, I wouldn't expect it to be faster in this example since data.table probably doesn't optimize the content of the loop.
Upvotes: 2