Reputation: 1098
I can easily convert all columns to lower case in a data.table
with lapply(df, stringi::stri_trans_tolower)
. This is also faster than the tidyverse method:
microbenchmark::microbenchmark(dplyr::mutate(df, dplyr::across(dplyr::everything(), tolower)),
lapply(df, stringi::stri_trans_tolower) %>% data.table::as.data.table(),
times = 5)
I use stringi
instead of base tolower
because it is twice as fast:
> microbenchmark::microbenchmark(tolower(rep("APPLE", 100000)),
+ stringi::stri_trans_tolower(rep("APPLE", 100000)),
+ times = 5)
Unit: milliseconds
expr min lq mean median uq max neval
tolower(rep("APPLE", 1e+05)) 25.51155 25.55177 26.28368 25.59082 25.67324 29.09102 5
stringi::stri_trans_tolower(rep("APPLE", 1e+05)) 15.21042 15.60595 15.71065 15.80013 15.81833 16.11840 5
However, this creates a copy on modify when it coerces data.table
via as.list
, as seen below:
invisible(lapply(x, stringi::stri_trans_tolower))
tracemem[0x7fd8b1b65800 -> 0x7fd8b65eac00]: as.list.data.frame as.list lapply
This seems small (only one copy) but when my data tables are 3-4 GB, I need to optimize where I can. Is there any way to utilized the data.table
semantics to take advantage of modify-by-reference to make everything lower case and avoid making a copy? I would even be open to an rcpp
option if possible.
Upvotes: 0
Views: 1321
Reputation: 5138
Here are three possible solutions--depending on the structure of your data I would go with the base R solution or the data.table
solution. The tidyvese
and base R solutions use a dataframe, the other uses a data.table
.
Also, if you are concerned about memory allocation, the bench
package is nice:
library(dplyr)
library(data.table)
df1 <- data.frame(col1 = rep(NA, 1e5))
df1[paste0("col", 1:10)] <- lapply(paste0("TEST_TEXT_", 1:10), rep, 1e5)
df2 <- df1
dt1 <- copy(df1)
setDT(dt1)
bench::mark(mutate = {df1 <- mutate(df1, across(everything(), stringi::stri_trans_tolower))},
dt_set = {for(col in seq_along(dt1)) set(dt1, j = col, value = stringi::stri_trans_tolower(dt1[[col]]))},
base = {df2[] <- lapply(df2, stringi::stri_trans_tolower)},
check = FALSE)
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 mutate 94.9ms 95ms 10.5 18.66MB 21.1 2 4 190ms <NULL> <Rprofmem[,3] [2,359 x 3]> <bch:tm [5]> <tibble [5 x 3]>
2 dt_set 89.9ms 89.9ms 11.1 7.67MB 11.1 3 3 270ms <NULL> <Rprofmem[,3] [65 x 3]> <bch:tm [6]> <tibble [6 x 3]>
3 base 87.3ms 87.6ms 11.4 7.63MB 2.28 5 1 438ms <NULL> <Rprofmem[,3] [10 x 3]> <bch:tm [6]> <tibble [6 x 3]>
All are approximately equal in time, but the base R and data.table
solutions appear to avoid making additional copies of the data (as far as memory is concerned at least; the base R solution does not modify in-place the same way the data.table
solution does). I.e.,
format(object.size(df1), units = "MB")
[1] "7.6 Mb"
Upvotes: 1