Reputation: 2514
I am have a large dataframe, where I need to clean a string column. A colleague wrote a function that contains dozens and dozens of such statements:
word <- gsub("#","",word)
...
word <- gsub("&","",word)
Other than that there is hardly any code other than:
word <- str_replace_all(word, "[[:punct:]]", "")
word <- str_replace_all(word, "[^[:alnum:]]", "")
idx = rep(remove_numbers,length(word)) & grepl("\\D", word)
word[idx] <- gsub("^\\d+|\\d+$", "", word[idx])
The function takes very long to complete. I am looking for ways to speed this up. My ideas:
gsubs
into one? I.e., a regexp like gsub("#|&", "", word)
Upvotes: 1
Views: 77
Reputation: 545923
Yes, you can combine the regular expressions as in your example, and yes, this should speed up the function considerably.
Instead of writing #|…|&
you can also use character classes, and write [#…&]
, if all the replacements concern single symbols.
Furthermore, it’s unclear why your colleague suddenly switched from gsub
to str_replace_all
. The two do the same thing, so you can merge those statements, too.
But, more importantly, that last str_replace_all
replacement makes all the others redundant, because it removes all non-alphanumeric characters. So everything before that, which removed individual non-alphanumeric characters, is unnecessary. Just remove it.
Upvotes: 3