Reputation: 2237
I am trying dtplyr & data.table for first time to do some time optimization in my existing dplyr code.
Issue: if I use data.table / dtplyr data object then I am unable to plot with ggplot. And before plotting in pipe/chain commands if I just convert data.table / dtplyr object into tibble then it works with ggplot but then it takes even more time than working with data.frame/tibble entirely which is shown later in this post.
library(tidyverse)
library(dtplyr)
library(data.table)
library(scale)
library(lubridate)
library(bench)
My Code attempts & time benchmarks:
data:
data.frame object
df_ind_stacked_daily <- read.csv(url("https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/df_ind_stacked_daily.csv")) %>%
mutate(Date = ymd(Date))
data.table object
df_ind_stacked_daily2 <- setDT(df_ind_stacked_daily)
Plot with data.table/dtplyr object:
df_ind_stacked_daily2 %>%
filter(Daily_cases_type == "Daily_confirmed",
Date >= max(Date) - 6 & Date <= max(Date),
State.UnionTerritory != "India"
) %>%
group_by( Date) %>%
slice_max(order_by = Daily_cases_counts, n = 10) %>%
ungroup() %>%
# as.tibble() %>%
ggplot(aes(x = Daily_cases_counts,
y = reorder_within(State.UnionTerritory,
by = Daily_cases_counts, within = Date),
fill = State.UnionTerritory)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Date, scales = "free_y") +
geom_text(aes(label = Daily_cases_counts), size=3, color="white",
# position = "dodge",
hjust = 1.2) +
# theme_minimal() +
theme(legend.position = "none") +
scale_x_continuous(labels = comma) + # unit_format(scale = 1e-3, unit = "k")
scale_fill_tableau(palette = "Tableau 20") +
scale_y_reordered() +
coord_cartesian(clip = "off")
Error: data
must be a data frame, or other object coercible by fortify()
, not an S3 object with class dtplyr_step_group/dtplyr_step.
P.S - If I uncomment as.tibble()
in above code chunk then ggplot works.
Code Time Benchmarks:
library(bench)
bench::mark(
df_ind_stacked_daily2 %>%
filter(Daily_cases_type == "Daily_confirmed",
Date >= max(Date) - 6 & Date <= max(Date),
State.UnionTerritory != "India"
) %>%
group_by( Date) %>%
slice_max(order_by = Daily_cases_counts, n = 10) %>%
ungroup()
# as.tibble() %>%
)
expression min median itr/sec
<S3: bench_expr> 2.45ms 2.75ms 320.3396
library(bench)
bench::mark(
df_ind_stacked_daily2 %>%
filter(Daily_cases_type == "Daily_confirmed",
Date >= max(Date) - 6 & Date <= max(Date),
State.UnionTerritory != "India"
) %>%
group_by( Date) %>%
slice_max(order_by = Daily_cases_counts, n = 10) %>%
ungroup() %>%
as.tibble()
)
expression min median itr/sec
<S3: bench_expr> 12.7ms 14ms 65.41098
library(bench)
bench::mark(
df_ind_stacked_daily %>%
filter(Daily_cases_type == "Daily_confirmed",
Date >= max(Date) - 6 & Date <= max(Date),
State.UnionTerritory != "India"
) %>%
group_by( Date) %>%
slice_max(order_by = Daily_cases_counts, n = 10) %>%
ungroup()
)
expression min median itr/sec
<S3: bench_expr> 6.71ms 7.97ms 120.3636
Question: So How can I make ggplot work with data.table / dtplyr without converting it to data.frame / tibble ??
############################
(UPDATE: Response to Answer)
Thanks @teunbrand I am mostly using your code below & added another function along & have put it in 3 scenarios:
I have created two functions: (1) That performs processing & do no coerce to tibble, (2) That coerce it to tibble after processing.
And I ran these in 3 scenarios overall - (1) data.table, (2) data.table converted to tibble after processing, (3) using tibble from beginning
# 1. function doesn't convert to tibble
fun <- function(x) {
x %>%
filter(Daily_cases_type == "Daily_confirmed",
Date >= max(Date) - 6 & Date <= max(Date),
State.UnionTerritory != "India"
) %>%
group_by( Date) %>%
slice_max(order_by = Daily_cases_counts, n = 10) %>%
ungroup() #%>%
# as_tibble() # Always coerce to tibble
}
# 2. function convert it to tibble after all processing
fun_to_tbl <- function(x) {
x %>%
filter(Daily_cases_type == "Daily_confirmed",
Date >= max(Date) - 6 & Date <= max(Date),
State.UnionTerritory != "India"
) %>%
group_by( Date) %>%
slice_max(order_by = Daily_cases_counts, n = 10) %>%
ungroup() %>%
as_tibble() # Always coerce to tibble
}
# Make data larger
dt <- do.call(rbind, rep(list(as.data.table(df_ind_stacked_daily)), 20))
tbl_df <- do.call(rbind, rep(list(as_tibble(df_ind_stacked_daily)), 20))
# Run data.table on single thread
setDTthreads(1)
For unknown reasons my benchmark didn't run simultaneously so I had to run them one by one.
(bm <- bench::mark(
dt_res = fun(dt), # bench dt
min_iterations = 20
))
expression min median itr/sec mem_alloc
<S3: bench_expr> 4.35ms 6.05ms 148.1923 5.12KB
(bm <- bench::mark(
dt_to_tbl_res = fun_to_tbl(dt), # bench dt converted to tibble at end
min_iterations = 20
))
expression min median itr/sec mem_alloc
<S3: bench_expr> 65.8ms 72.2ms 12.28566 47.6MB
(bm <- bench::mark(
tbl_res = fun(tbl_df), # bench tbl
min_iterations = 20
))
expression min median itr/sec mem_alloc
<S3: bench_expr> 55ms 67.8ms 13.70603 47.4MB
Objective: My main objective was to incorporate this into shiny app which has dynamic variable selection so wanted to optimize it with data.table. But I guess there is no way for ggplot to work with s3 objects / data.table.
And only time difference I am getting is when I use data.table & pass it as data.table otherwise there is no benefit.
Upvotes: 0
Views: 888
Reputation: 37933
There are a few observations to be made here:
As far as I understand dtplyr, along your piping chain it accumulates operations that aren't evaluated, they are just translated from dplyr to data.table syntax. Until you realise your pipe as a data.frame, data.table or tibble, your computer doesn't run the operations. This underestimates the runtime of your first benchmark.
Because you're using setDT
to convert a data.frame to a data.table, what you are benchmarking as a data.frame is not a benchmark for a data.frame. If you read the documentation of ?setDT
, you'll see that the object is converted in memory and without copying meaning that your df_ind_stacked_daily
is also a data.table.
The data.table package makes use of multiple threads by default. We should prevent this to make a fair comparison.
Your first filtering operation goes from medium-sized data (75748 rows) to small (252 rows). The majority of your pipe, you are not working with a lot of data, which is where data.table shines.
Adjusting for some of these things, I find that there is no difference in speed.
library(tidyverse)
library(dtplyr)
library(data.table)
library(lubridate)
library(bench)
df <- read.csv(url("https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/df_ind_stacked_daily.csv")) %>%
mutate(Date = ymd(Date))
fun <- function(x) {
x %>%
filter(Daily_cases_type == "Daily_confirmed",
Date >= max(Date) - 6 & Date <= max(Date),
State.UnionTerritory != "India"
) %>%
group_by( Date) %>%
slice_max(order_by = Daily_cases_counts, n = 10) %>%
ungroup() %>%
as_tibble() # Always coerce to tibble
}
# Make data larger
dt <- do.call(rbind, rep(list(as.data.table(df)), 20))
tbl <- do.call(rbind, rep(list(as_tibble(df)), 20))
# Run data.table on single thread
setDTthreads(1)
# Benchmark simultaneously
(bm <- bench::mark(
dt = fun(dt),
tbl = fun(tbl),
min_iterations = 20
))
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dt 41.1ms 42.5ms 23.4 72.2MB 35.2
#> 2 tbl 40.7ms 41.5ms 24.0 71MB 36.0
plot(bm)
Created on 2021-08-19 by the reprex package (v1.0.0)
Upvotes: 2