Dylan Russell
Dylan Russell

Reputation: 1098

Convert all columns to lower case in data.table

I can easily convert all columns to lower case in a data.table with lapply(df, stringi::stri_trans_tolower). This is also faster than the tidyverse method:

microbenchmark::microbenchmark(dplyr::mutate(df, dplyr::across(dplyr::everything(), tolower)),
                               lapply(df, stringi::stri_trans_tolower) %>% data.table::as.data.table(),
                               times = 5)

I use stringi instead of base tolower because it is twice as fast:

> microbenchmark::microbenchmark(tolower(rep("APPLE", 100000)),
+                                stringi::stri_trans_tolower(rep("APPLE", 100000)),
+                                times = 5)
Unit: milliseconds
                                             expr      min       lq     mean   median       uq      max neval
                     tolower(rep("APPLE", 1e+05)) 25.51155 25.55177 26.28368 25.59082 25.67324 29.09102     5
 stringi::stri_trans_tolower(rep("APPLE", 1e+05)) 15.21042 15.60595 15.71065 15.80013 15.81833 16.11840     5

However, this creates a copy on modify when it coerces data.table via as.list, as seen below:

invisible(lapply(x, stringi::stri_trans_tolower))
tracemem[0x7fd8b1b65800 -> 0x7fd8b65eac00]: as.list.data.frame as.list lapply

This seems small (only one copy) but when my data tables are 3-4 GB, I need to optimize where I can. Is there any way to utilized the data.table semantics to take advantage of modify-by-reference to make everything lower case and avoid making a copy? I would even be open to an rcpp option if possible.

Upvotes: 0

Views: 1321

Answers (1)

Andrew
Andrew

Reputation: 5138

Here are three possible solutions--depending on the structure of your data I would go with the base R solution or the data.table solution. The tidyvese and base R solutions use a dataframe, the other uses a data.table. Also, if you are concerned about memory allocation, the bench package is nice:

library(dplyr)
library(data.table)

df1 <- data.frame(col1 = rep(NA, 1e5))
df1[paste0("col", 1:10)] <- lapply(paste0("TEST_TEXT_", 1:10), rep, 1e5)
df2 <- df1
dt1 <- copy(df1)
setDT(dt1) 


bench::mark(mutate = {df1 <- mutate(df1, across(everything(), stringi::stri_trans_tolower))},
            dt_set = {for(col in seq_along(dt1)) set(dt1, j = col, value = stringi::stri_trans_tolower(dt1[[col]]))},
            base = {df2[] <- lapply(df2, stringi::stri_trans_tolower)},
            check = FALSE)

# A tibble: 3 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                     time         gc              
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                     <list>       <list>          
1 mutate       94.9ms     95ms      10.5   18.66MB    21.1      2     4      190ms <NULL> <Rprofmem[,3] [2,359 x 3]> <bch:tm [5]> <tibble [5 x 3]>
2 dt_set       89.9ms   89.9ms      11.1    7.67MB    11.1      3     3      270ms <NULL> <Rprofmem[,3] [65 x 3]>    <bch:tm [6]> <tibble [6 x 3]>
3 base         87.3ms   87.6ms      11.4    7.63MB     2.28     5     1      438ms <NULL> <Rprofmem[,3] [10 x 3]>    <bch:tm [6]> <tibble [6 x 3]>

All are approximately equal in time, but the base R and data.table solutions appear to avoid making additional copies of the data (as far as memory is concerned at least; the base R solution does not modify in-place the same way the data.table solution does). I.e.,

format(object.size(df1), units = "MB")
[1] "7.6 Mb"

Upvotes: 1

Related Questions