Reputation: 49
I'm trying to clean messy free-form-captured strings in some dataframes by looping through regular expressions defined in other dataframes. For example:
id <- c(1, 2, 3, 4)
text <- c(NA, "messy1 messy2", "MESSY2,,, messy1", "ignore")
df <- data.frame(id, text)
mapin <- c("messy1", "messy2")
mapout <- c("Clean 1", "Clean 2")
map <- data.frame(mapin, mapout)
cout <- c(NA, "Clean 1, Clean 2", "Clean 1, Clean 2", NA)
I want to create the new column df$cout
by regex'ing the matching tokens in map$mapin
(ignoring case), and outputting corresponding cleaned tokens from map$mapout
in a concatenated string, where the output tokens are sorted by alphabetical order. It seems to me that copy/regex-substitute on the messy dataframe column would be more pain that it's worth, because all the other non-matching stuff will have to be discarded somehow.
Can anyone see a good R / vectorised / not-hammy way of doing this?
Upvotes: 1
Views: 204
Reputation: 49
Based on @Jaap's great answer, with a few tweaks that came up in my problem domain, but which could be common to others' challenges:
Dynamically determine the size of the map when creating matrix m2
.
Sort the output tokens consistently.
Use stri_extract_first_fixed
to be insensitive to duplicated input tokens.
Full example:
# Problem
id <- c(1, 2, 3, 4)
text <- c(NA, "messy1 messy2", "MESSY2,,, messy1,messy1", "ignore")
df <- data.frame(id, text)
mapin <- c("messy2", "messy1")
mapout <- c("Clean 2", "Clean 1")
map <- data.frame(mapin, mapout)
# Solution
m1 <- sapply(map$mapin, function(x) stri_extract_first_fixed(tolower(df$text), x))
m2 <- matrix(map$mapout[match(m1, map$mapin)], ncol = nrow(map))
vec <- apply(m2, 1, function(x)
paste0(c(unique(sort(x, na.last = NA)), use.names = FALSE), collapse = ", "))
vec[vec==''] <- NA
df$clean <- vec
Per @r2evans' comment above, if you have a regular expression need here, just replace stri_extract_first_fixed
with stri_extract_first_regex
.
Upvotes: 0
Reputation: 83215
Using stri_extract_all
from the stringi
-package:
library(stringi)
m1 <- sapply(map$mapin, function(x) stri_extract_all_fixed(tolower(df$text), x))
m2 <- matrix(map$mapout[match(m1, map$mapin)], ncol = nrow(map))
vec <- apply(m2, 1, function(x) paste(na.omit(x), collapse = ", ") )
vec[vec == ''] <- NA
df$cout <- vec
you get:
> df
id text cout
1 1 <NA> <NA>
2 2 messy1 messy2 Clean 1, Clean 2
3 3 MESSY2,,, messy1 Clean 1, Clean 2
4 4 ignore <NA>
Upvotes: 1