Convert all columns to lower case in data.table

Question

I can easily convert all columns to lower case in a data.table with lapply(df, stringi::stri_trans_tolower). This is also faster than the tidyverse method:

microbenchmark::microbenchmark(dplyr::mutate(df, dplyr::across(dplyr::everything(), tolower)),
                               lapply(df, stringi::stri_trans_tolower) %>% data.table::as.data.table(),
                               times = 5)

I use stringi instead of base tolower because it is twice as fast:

> microbenchmark::microbenchmark(tolower(rep("APPLE", 100000)),
+                                stringi::stri_trans_tolower(rep("APPLE", 100000)),
+                                times = 5)
Unit: milliseconds
                                             expr      min       lq     mean   median       uq      max neval
                     tolower(rep("APPLE", 1e+05)) 25.51155 25.55177 26.28368 25.59082 25.67324 29.09102     5
 stringi::stri_trans_tolower(rep("APPLE", 1e+05)) 15.21042 15.60595 15.71065 15.80013 15.81833 16.11840     5

However, this creates a copy on modify when it coerces data.table via as.list, as seen below:

invisible(lapply(x, stringi::stri_trans_tolower))
tracemem[0x7fd8b1b65800 -> 0x7fd8b65eac00]: as.list.data.frame as.list lapply

This seems small (only one copy) but when my data tables are 3-4 GB, I need to optimize where I can. Is there any way to utilized the data.table semantics to take advantage of modify-by-reference to make everything lower case and avoid making a copy? I would even be open to an rcpp option if possible.

Convert all columns to lower case in data.table

Answers (1)

Related Questions