richarddb
richarddb

Reputation: 49

Regex-based string re-mapping in R dataframes

I'm trying to clean messy free-form-captured strings in some dataframes by looping through regular expressions defined in other dataframes. For example:

id <- c(1, 2, 3, 4) 
text <- c(NA, "messy1 messy2", "MESSY2,,, messy1", "ignore")
df <- data.frame(id, text)

mapin <- c("messy1", "messy2")
mapout <- c("Clean 1", "Clean 2")
map <- data.frame(mapin, mapout)

cout <- c(NA, "Clean 1, Clean 2", "Clean 1, Clean 2", NA)

I want to create the new column df$cout by regex'ing the matching tokens in map$mapin (ignoring case), and outputting corresponding cleaned tokens from map$mapout in a concatenated string, where the output tokens are sorted by alphabetical order. It seems to me that copy/regex-substitute on the messy dataframe column would be more pain that it's worth, because all the other non-matching stuff will have to be discarded somehow.

Can anyone see a good R / vectorised / not-hammy way of doing this?

Upvotes: 1

Views: 204

Answers (2)

richarddb
richarddb

Reputation: 49

Based on @Jaap's great answer, with a few tweaks that came up in my problem domain, but which could be common to others' challenges:

  1. Dynamically determine the size of the map when creating matrix m2.

  2. Sort the output tokens consistently.

  3. Use stri_extract_first_fixed to be insensitive to duplicated input tokens.

Full example:

# Problem
id <- c(1, 2, 3, 4) 
text <- c(NA, "messy1 messy2", "MESSY2,,, messy1,messy1", "ignore")
df <- data.frame(id, text)
mapin <- c("messy2", "messy1")
mapout <- c("Clean 2", "Clean 1")
map <- data.frame(mapin, mapout)

# Solution
m1 <- sapply(map$mapin, function(x) stri_extract_first_fixed(tolower(df$text), x))
m2 <- matrix(map$mapout[match(m1, map$mapin)], ncol = nrow(map))
vec <- apply(m2, 1, function(x)
    paste0(c(unique(sort(x, na.last = NA)), use.names = FALSE), collapse = ", "))
vec[vec==''] <- NA
df$clean <- vec

Per @r2evans' comment above, if you have a regular expression need here, just replace stri_extract_first_fixed with stri_extract_first_regex.

Upvotes: 0

Jaap
Jaap

Reputation: 83215

Using stri_extract_all from the stringi-package:

library(stringi)
m1 <- sapply(map$mapin, function(x) stri_extract_all_fixed(tolower(df$text), x))
m2 <- matrix(map$mapout[match(m1, map$mapin)], ncol = nrow(map))

vec <- apply(m2, 1, function(x) paste(na.omit(x), collapse = ", ") )
vec[vec == ''] <- NA

df$cout <- vec

you get:

> df
  id             text             cout
1  1             <NA>             <NA>
2  2    messy1 messy2 Clean 1, Clean 2
3  3 MESSY2,,, messy1 Clean 1, Clean 2
4  4           ignore             <NA>

Upvotes: 1

Related Questions