Reputation: 697
I am very new to data.table
and would like to try it out to see if it makes my analysis faster. I mainly use knitr
to compile .rnw
files (which I tend to compile many times per hour so I want it to be as fast as possible).
I have posted a sample below and this is by no means a question of comparison agianst data.table
and data.frame
. I would like to know if I my code below is what it should be.
I am basically joining two data.tables
and then need to linearly approximate using na.approx
missing NA
values. I used the Introduction to data.table vignette from CRAN and JOINing data in R using data.table from R-Pubs.
The code I am using below results in my best attempt at a data.table
method taking a long time (in general too, I only added the other code reference).
Also, if anyone knows if there is a way to pipe in na.approx()
into a chain and still have the output as a data.frame
that would be appreciated. Note the df_merged = as.data.frame(df_merged)
line that I would like to get rid of if possible!
Any input is greatly appreciated thank you!
library(data.table)
library(zoo)
library(dplyr)
dt_function_test = function() {
set.seed(123)
# data.table
dt_random = data.table(vals = runif(1E5, 0, 500))
dt_na = data.table(vals = c(0, 250, 500),
ref1 = c(0.33, 0.45, 0.78),
ref2 = c(0.12, 0.79, 1))
dt_merged = merge(dt_random[],
dt_na[],
all = TRUE)
dt_merged = dt_merged[, lapply(.SD,
na.approx),
by = vals]
}
df_function_test = function() {
set.seed(123)
# data.frame
df_random = data.frame(vals = runif(1E5, 0, 500))
df_na = data.frame(vals = c(0, 250, 500),
ref1 = c(0.33, 0.45, 0.78),
ref2 = c(0.12, 0.79, 1))
df_merged = full_join(df_random,
df_na) %>%
na.approx
df_merged = as.data.frame(df_merged)
}
print(system.time(dt_function_test()))
# user system elapsed
# 11.42 0.00 11.46
print(system.time(df_function_test()))
# Joining, by = "vals"
# user system elapsed
# 0.05 0.05 0.10
Upvotes: 0
Views: 794
Reputation: 25225
Here is a few possible implementation using data.table
that performs zoo::na.approx
on the ref*
columns (note that a larger dataset has been used also):
library(data.table)
library(zoo)
dt_function_test_0 = function() {
set.seed(123)
# data.table
dt_random = data.table(vals = runif(1e7, 0, 500))
dt_na = data.table(vals = c(0, 250, 500),
ref1 = c(0.33, 0.45, 0.78),
ref2 = c(0.12, 0.79, 1))
cols <- c("ref1", "ref2")
##Version 0
merge(dt_random, dt_na, all=TRUE)[, lapply(.SD, na.approx)]
}
dt_function_test_1 = function() {
set.seed(123)
# data.table
dt_random = data.table(vals = runif(1e7, 0, 500))
dt_na = data.table(vals = c(0, 250, 500),
ref1 = c(0.33, 0.45, 0.78),
ref2 = c(0.12, 0.79, 1))
cols <- c("ref1", "ref2")
##Version 1: using update by reference
merge(dt_random, dt_na, all = TRUE)[,
(cols) := lapply(.SD, na.approx), .SDcols=cols]
}
dt_function_test_2 = function() {
set.seed(123)
# data.table
dt_random = data.table(vals = runif(1e7, 0, 500))
dt_na = data.table(vals = c(0, 250, 500),
ref1 = c(0.33, 0.45, 0.78),
ref2 = c(0.12, 0.79, 1))
cols <- c("ref1", "ref2")
##Version 2: using set
dt_merged <- merge(dt_random, dt_na, all = TRUE)
for (x in cols)
set(dt_merged, j=x, value=na.approx(dt_merged[[x]]))
dt_merged
}
timing output:
> system.time(dt_function_test_0())
user system elapsed
5.44 1.90 6.96
> system.time(dt_function_test_1())
user system elapsed
3.55 1.30 4.41
> system.time(dt_function_test_2())
user system elapsed
3.78 1.19 4.52
Upvotes: 3