JohnL_10
JohnL_10

Reputation: 569

An efficient way to apply a function over a list of dataframes

I have a list of dataframes in R. What I need to do is apply a function to each dataframe, in this case removing special characters, and have returned a list of dataframes.

Using lapply and as.data.frame the following works fine and delivers exactly what I need:

my_df =data.frame(names = seq(1,10), chars = c("abcabc!!", "abcabc234234!!"))
my_list = list(my_df, my_df, my_df)

#str(my_list)
List of 3
 $ :'data.frame':   10 obs. of  2 variables: ...

new_list <- lapply(my_list, function(y) as.data.frame(lapply(y, function(x) gsub("[^[:alnum:][:space:]']", "", x))))

# str(new_list)
List of 3
 $ :'data.frame':   10 obs. of  2 variables:
  ..$ names: Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
  ..$ chars: Factor w/ 2 levels "abcabc","abcabc234234": 1 2 1 2 1 2 1 2 1 2
 $ :'data.frame':   10 obs. of  2 variables:
  ..$ names: Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
  ..$ chars: Factor w/ 2 levels "abcabc","abcabc234234": 1 2 1 2 1 2 1 2 1 2
 $ :'data.frame':   10 obs. of  2 variables:
  ..$ names: Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
  ..$ chars: Factor w/ 2 levels "abcabc","abcabc234234": 1 2 1 2 1 2 1 2 1 2

But I am wondering if there is a more efficient way that doesn't require nested lapply. Perhaps a different apply-family function that returns the elements as a dataframe?

Upvotes: 4

Views: 244

Answers (2)

mpjdem
mpjdem

Reputation: 1544

While @akrun is right that your second lapply call is useless in this example, I think it does not solve the general case where many columns might be relevant, and it is unknown which might be.

What is inefficient here is the conversion back with as.data.frame, not the inner lapply call. The lapply call itself is almost just as fast as if you would apply the function to a single vector or a matrix of the same size.

If you really want to be more time-efficient here, I would suggest using data.table. I've made the example a bit larger so we can time it.

library(data.table)

f <- function(x) gsub("[^[:alnum:][:space:]']", "", x)

my_df <- as.data.frame(matrix(paste0(sample(c(letters,'!'), size=1000000, replace=T),
                                 sample(c(letters,'!'), size=1000000, replace=T)), 
                                 ncol=250), stringsAsFactors = FALSE)
my_list = list(my_df, my_df, my_df)

system.time(lapply(my_list, function(y) as.data.frame(lapply(y, f))))
# 2.256 seconds

my_dt <- as.data.table(my_df)
my_list2 = list(my_dt, my_dt, my_dt)

system.time(lapply(my_list2, function(y) y[,lapply(.SD,f)]))
# 1.180 seconds

Upvotes: 1

akrun
akrun

Reputation: 887851

We don't need a nested lapply, just a single lapply with transform does it

lapply(my_list, transform, chars = gsub("[^[:alnum:][:space:]']", "", chars))

The pattern can be made compact to "[^[[:alnum:] ']"

Upvotes: 4

Related Questions