Dictionary regex

Question

I have a function as follows:

HistolMacDescrip <- function(dataframe, MacroColumn) {
  dataframe <- data.frame(dataframe)

  # Column specific cleanup
  dataframe[, MacroColumn] <- str_replace(dataframe[, MacroColumn],
                                          "[Dd]ictated by.*", "")
  # Conversion of text numbers to allow number of biopsies to be extracted
  dataframe[, MacroColumn] <- str_replace(dataframe[, MacroColumn],
                                          "[Oo]ne", "1")
  dataframe[, MacroColumn] <- str_replace(dataframe[, MacroColumn],
                                          "[Ss]ingle", "1")
  dataframe[, MacroColumn] <- str_replace(dataframe[, MacroColumn],
                                          "[Tt]wo", "2")
  dataframe[, MacroColumn] <- str_replace(dataframe[, MacroColumn],
                                          "[Tt]hree", "3")
  dataframe[, MacroColumn] <- str_replace(dataframe[, MacroColumn],
                                          "[Ff]our", "4")
  dataframe[, MacroColumn] <- str_replace(dataframe[, MacroColumn],
                                          "[Ff]ive", "5")
  dataframe[, MacroColumn] <- str_replace(dataframe[, MacroColumn],
                                          "[Ss]ix", "6")
  dataframe[, MacroColumn] <- str_replace(dataframe[, MacroColumn],
                                          "[Ss]even", "7")
  dataframe[, MacroColumn] <- str_replace(dataframe[, MacroColumn],
                                          "[Ee]ight", "8")
  return(dataframe)
}

This strikes me as a bit inefficient. I have other functions that do a similar thing and I'd like to create one function instead that performs this kind of dictionary lookup, perhaps based on an external file that lists key values. Some of the keys will be regexes eg

key                                  value
bus|car|.*toy                     vehicle
\d+\s+mg                            dose

Is there a function that can do this kind of dictionary lookup function so that all I have to do is define the dictionary eg in a csv or something?

niko · Accepted Answer

Here is a possible approach

# Function
my_transform <- function (string, lookup) {
  new_string <- string
  vapply(1:nrow(lookup),
         function (k) {
           new_string <<- gsub(lookup$key[k], lookup$value[k], new_string)
           0L
         }, integer(1))
  new_string
}

# Results
# lookup table
lookup <- structure(list(key = c("bus|car|.*toy", "\d+\s+mg"), 
                         value = c("vehicle","dose")), 
                    row.names = 1:2, class = "data.frame")

# string 1
string1 <- c('This car', '256 mg', '6536 
 mg')
my_transform(string1, lookup)
# [1] "This vehicle" "dose"         "dose" 

# # string 2
string2 <- c('This car is no toy', '256 mg', '6536 
 mg')
my_transform(string2, lookup)
# [1] "vehicle" "dose"    "dose"

# data frame
df <- data.frame(string1, string2, stringsAsFactors = FALSE)
matrix(my_transform(unlist(df), lookup), nrow(df), ncol(df))  
#      [,1]           [,2]     
# [1,] "This vehicle" "vehicle"
# [2,] "dose"         "dose"   
# [3,] "dose"         "dose"  
# or
vapply(1:ncol(df), 
       function (k) my_transform(.subset2(df, k), lookup),
       character(nrow(df)))
#      [,1]           [,2]     
# [1,] "This vehicle" "vehicle"
# [2,] "dose"         "dose"   
# [3,] "dose"         "dose"

So the idea is to store the substitutions in a table and then apply them. Using the above, it should be possible to obtain the desired output.

Note however issues can arise c.f. string2[1]: here you need to make sure what exactly is the desired output for such an instance.

Finally, two final points:

gsub has other useful arguments such as perl (TRUE or FALSE) and fixed (TRUE or FALSE). These could be incorporated into the lookup table by for example adding columns labeled perl, fixed, etc.. This gives you more control.
there a lot of useful regex functions (c.f. ?sub): depending on your needs, you can either use or combine other functions.

Dictionary regex

Answers (1)

Related Questions