How to ggplot using dtplyr / data.table without converting it into dataframe or tibble?

Question

I am trying dtplyr & data.table for first time to do some time optimization in my existing dplyr code.

Issue: if I use data.table / dtplyr data object then I am unable to plot with ggplot. And before plotting in pipe/chain commands if I just convert data.table / dtplyr object into tibble then it works with ggplot but then it takes even more time than working with data.frame/tibble entirely which is shown later in this post.

library(tidyverse)
library(dtplyr)
library(data.table)
library(scale)
library(lubridate)
library(bench)

My Code attempts & time benchmarks:

data:

data.frame object

df_ind_stacked_daily <- read.csv(url("https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/df_ind_stacked_daily.csv")) %>% 
  mutate(Date = ymd(Date))

data.table object

df_ind_stacked_daily2 <- setDT(df_ind_stacked_daily)

Plot with data.table/dtplyr object:

 df_ind_stacked_daily2 %>% 
    
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() %>% 
    # as.tibble() %>%
    
            ggplot(aes(x = Daily_cases_counts, 
                       y = reorder_within(State.UnionTerritory, 
                                          by = Daily_cases_counts, within = Date),
                       fill = State.UnionTerritory)) +
            geom_col(show.legend = FALSE) +
            facet_wrap(~Date, scales = "free_y") +
            
            geom_text(aes(label = Daily_cases_counts), size=3, color="white", 
                       # position = "dodge", 
                      hjust = 1.2) + 
            
            # theme_minimal() +
            theme(legend.position = "none") +
            scale_x_continuous(labels = comma) + # unit_format(scale = 1e-3, unit = "k")
            scale_fill_tableau(palette = "Tableau 20") +
            scale_y_reordered() +
            coord_cartesian(clip = "off")

Error: data must be a data frame, or other object coercible by fortify(), not an S3 object with class dtplyr_step_group/dtplyr_step.

P.S - If I uncomment as.tibble() in above code chunk then ggplot works.

Code Time Benchmarks:

data.table/dtplyr object without converting to tibble

library(bench)

bench::mark(
  df_ind_stacked_daily2 %>% 
    
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() 
    # as.tibble() %>%
)

expression       min    median itr/sec
 2.45ms 2.75ms 320.3396

data.table/dtplyr object after converting to tibble

library(bench)

bench::mark(
  df_ind_stacked_daily2 %>% 
    
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() %>%
    as.tibble()
)

expression       min    median itr/sec
 12.7ms 14ms   65.41098

data.frame or tibble object

library(bench)

bench::mark(
  df_ind_stacked_daily %>% 
    
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup()
)

expression       min    median itr/sec
 6.71ms 7.97ms   120.3636

Question: So How can I make ggplot work with data.table / dtplyr without converting it to data.frame / tibble ??

                               ############################

(UPDATE: Response to Answer)

Thanks @teunbrand I am mostly using your code below & added another function along & have put it in 3 scenarios:

I have created two functions: (1) That performs processing & do no coerce to tibble, (2) That coerce it to tibble after processing.

And I ran these in 3 scenarios overall - (1) data.table, (2) data.table converted to tibble after processing, (3) using tibble from beginning

# 1. function doesn't convert to tibble 
fun <- function(x) {
  x %>%
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() #%>%
    # as_tibble() # Always coerce to tibble
}

# 2. function convert it to tibble after all processing
fun_to_tbl <- function(x) {
  x %>%
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() %>%
    as_tibble() # Always coerce to tibble
}


# Make data larger
dt  <- do.call(rbind, rep(list(as.data.table(df_ind_stacked_daily)), 20))
tbl_df <- do.call(rbind, rep(list(as_tibble(df_ind_stacked_daily)), 20))

# Run data.table on single thread
setDTthreads(1)

For unknown reasons my benchmark didn't run simultaneously so I had to run them one by one.

(bm <- bench::mark(
  dt_res = fun(dt), # bench dt
  min_iterations = 20
))

expression       min    median itr/sec    mem_alloc
 4.35ms 6.05ms   148.1923 5.12KB

(bm <- bench::mark(
  dt_to_tbl_res = fun_to_tbl(dt), # bench dt converted to tibble at end
  min_iterations = 20
))

expression       min    median itr/sec    mem_alloc
 65.8ms 72.2ms   12.28566 47.6MB

(bm <- bench::mark(
  tbl_res =  fun(tbl_df),   # bench tbl
  min_iterations = 20
))

expression       min    median itr/sec  mem_alloc
 55ms 67.8ms   13.70603 47.4MB

Objective: My main objective was to incorporate this into shiny app which has dynamic variable selection so wanted to optimize it with data.table. But I guess there is no way for ggplot to work with s3 objects / data.table.

And only time difference I am getting is when I use data.table & pass it as data.table otherwise there is no benefit.

How to ggplot using dtplyr / data.table without converting it into dataframe or tibble?

Answers (1)

Related Questions