safex
safex

Reputation: 2514

Speed up regex replace

I am have a large dataframe, where I need to clean a string column. A colleague wrote a function that contains dozens and dozens of such statements:

  word <- gsub("#","",word)
  ...
  word <- gsub("&","",word)

Other than that there is hardly any code other than:

  word <- str_replace_all(word, "[[:punct:]]", "")
  word <- str_replace_all(word, "[^[:alnum:]]", "") 

  idx = rep(remove_numbers,length(word)) & grepl("\\D", word)
  word[idx] <- gsub("^\\d+|\\d+$", "", word[idx])

The function takes very long to complete. I am looking for ways to speed this up. My ideas:

  1. Is there a way to group multiple such gsubs into one? I.e., a regexp like gsub("#|&", "", word)
  2. Parallelize the application of the function

Upvotes: 1

Views: 77

Answers (1)

Konrad Rudolph
Konrad Rudolph

Reputation: 545923

Yes, you can combine the regular expressions as in your example, and yes, this should speed up the function considerably.

Instead of writing #|…|& you can also use character classes, and write [#…&], if all the replacements concern single symbols.

Furthermore, it’s unclear why your colleague suddenly switched from gsub to str_replace_all. The two do the same thing, so you can merge those statements, too.

But, more importantly, that last str_replace_all replacement makes all the others redundant, because it removes all non-alphanumeric characters. So everything before that, which removed individual non-alphanumeric characters, is unnecessary. Just remove it.

Upvotes: 3

Related Questions