Reputation: 65
I am looking for a faster way to to the following, I need to split a column of a data.table object containing strings into separate columns. The strings are of the format "name1=value1;name2=value2;". The strings can be split into a variable number of columns in which case those values will need to be filled with NA. For example I have this:
library(data.table)
dt <- data.table("foo"=c("name=john;id=1234;last=smith", "name=greg;id=5678", "last=picard", "last=jones;number=1234567890"))
I would want this:
name id last number
john 1234 smith NA
greg 5678 NA NA
NA NA picard NA
NA NA jones 1234567890
This will work but it is slow given the amount of data to parse and I'm wondering if there is a better way:
x <- strsplit(as.character(dt$foo), ";|=")
a <- function(x){
name <- x[seq(1, length(x), 2)]
value <- x[seq(2, length(x), 2)]
tmp <- transpose(as.data.table(value))
names(tmp) <- name
return(tmp)
}
x <- lapply(x, a)
x <- rbindlist(x, fill=TRUE)
Upvotes: 4
Views: 1421
Reputation: 17299
We can try:
# split into different fields for each row
res <- lapply(strsplit(dt$foo, ';'), function(x){
# split the the fields into two vectors of field names and field values
res <- tstrsplit(x, '=')
# make a list of field values with the field names as names of the list
setNames(as.list(res[[2]]), res[[1]])
})
rbindlist(res, fill = T)
# name id last number
# 1: john 1234 smith NA
# 2: greg 5678 NA NA
# 3: NA NA picard NA
# 4: NA NA jones 1234567890
dplyr::bind_rows(res)
# # A tibble: 4 × 4
# name id last number
# <chr> <chr> <chr> <chr>
# 1 john 1234 smith <NA>
# 2 greg 5678 <NA> <NA>
# 3 <NA> <NA> picard <NA>
# 4 <NA> <NA> jones 1234567890
According to comment by David Arenburg, we can improve the speed by adding fixed = TRUE
to both strsplit
. I did a short benchmark with this data, adding fixed = TRUE
will increase the speed by about one fold.
library(microbenchmark)
dt <- dt[sample.int(nrow(dt), 100, replace = T)]
microbenchmark(
noFix = {
res <- lapply(strsplit(dt$foo, ';'), function(x){
res <- tstrsplit(x, '=')
setNames(as.list(res[[2]]), res[[1]])
})
},
Fixed = {
res <- lapply(strsplit(dt$foo, ';', fixed = TRUE), function(x){
res <- tstrsplit(x, '=', fixed = TRUE)
setNames(as.list(res[[2]]), res[[1]])
})
},
times = 1000
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# noFix 1.921947 1.999386 2.212511 2.064997 2.218706 11.290072 1000
# Fixed 1.026753 1.088712 1.226519 1.131899 1.219558 4.490796 1000
Upvotes: 3