Ahmet Atilla Colak
Ahmet Atilla Colak

Reputation: 79

for loop a tibble takes too much time

For the project I am working on, I am analyzing two datasets each of 500,000 rows. I had to filter these rows based on a value in one specific column. Here is the function I've coded to use on the tibbles:


theme_analyser <- function(tibble_to_analyse) {
 
for (i in 1:nrow(tibble_to_analyse)) {
  theme <- unlist(strsplit((tibble_to_analyse$themes[i]), ";"))
  if (any(theme %in% themes_to_use)){
    next}
  else {
    tibble_to_analyse <- tibble_to_analyse[-i,]
  }
}  
}

In this function, themes_to_use is a vector that contains a set of string values. The themes column takes more than one value, and each are separated by a ";". Therefore, I first split these values and unlist them.

The problem with this code is that it works too slow. It managed to complete the work for only 250k rows in 18 hours. What are the ways I can fasten this process so that it does not take as much time?

Assume I have a dataset like below :

A  B  
1  "bright" 
2  "shiny"
3  "bright" 

I want to filter the rows so I only get the rows where B column is equal to "bright". My code was used to select rows where the themes column is equal to at least one of the values of the vector of values.

Thank you in advance.

Upvotes: 1

Views: 159

Answers (2)

Jon Spring
Jon Spring

Reputation: 66835

Another approach using dplyr/string from tidyverse:

library(tidyverse)
tibble_to_analyse %>% 
  filter(str_detect(themes, paste(themes_to_use, collapse = "|"))) # Edit, thank you @Jared_mamrot

Example data:

set.seed(123)
n = 10000
tibble_to_analyse = tibble(
  val1 = sample(c(LETTERS, themes_to_use), n, replace = TRUE),
  val2 = sample(c(LETTERS, themes_to_use), n, replace = TRUE),
  themes = paste(val1, val2, sep = ";"),
  values = 1:n
)

Plenty of speed improvement, but not as fast as @Jared_marot's base R solution.

enter image description here

Upvotes: 2

jared_mamrot
jared_mamrot

Reputation: 26675

Without a minimal reproducible example it's difficult to say whether this solution is appropriate, but one of the reasons your loop is taking so long is that for each iteration of the loop you are writing the tibble and calculating theme <- unlist(strsplit((tibble_to_analyse$themes[i]), ";")).

If you bypass those issues by vectorising the function it should be significantly faster - here is an example:

library(tidyverse)

set.seed(123)

df <- data.frame(themes = sample(c("one;theme", "two;theme",
                               "three;theme", "four;theme"),
                             size = 10000, replace = TRUE),
              values = rnorm(10000))

themes_to_use <- c("one", "three")
                         
theme_analyser <- function(tibble_to_analyse) {
  
  for (i in 1:nrow(tibble_to_analyse)) {
    theme <- unlist(strsplit((tibble_to_analyse$themes[i]), ";"))
    if (any(theme %in% themes_to_use)){
      next}
    else {
      tibble_to_analyse <- tibble_to_analyse[-i,]
    }
  }  
}

vectorised_theme_analyser <- function(tibble_to_analyse) {
  tibble_to_analyse[which(gsub(";.*", "\\1", tibble_to_analyse$themes) %in% themes_to_use),]
}

res <- microbenchmark::microbenchmark(vectorised_theme_analyser(df),
                                      theme_analyser(df))
autoplot(res)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Created on 2021-09-10 by the reprex package (v2.0.1)

Edit

With the simplified example data provided in your question, here is a simplified subset method and comparison with @Jon Spring's tidy_detect method:

library(tidyverse)

set.seed(123)

df <- data.frame(themes = sample(c("one", "two",
                               "three", "four"),
                             size = 500000, replace = TRUE),
              values = rnorm(500000))

themes_to_use <- c("one", "three")

subset_in <- function(tibble_to_analyse) {
  tibble_to_analyse[tibble_to_analyse$themes %in% themes_to_use,]
}

tidy_detect <- function(tibble_to_analyse) {
  tibble_to_analyse %>% filter(str_detect(themes, paste(themes_to_use, collapse = "|")))
}

res <- microbenchmark::microbenchmark(subset_in(df),
                                      tidy_detect(df))
autoplot(res)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Created on 2021-09-10 by the reprex package (v2.0.1)

Upvotes: 1

Related Questions