mattpolicastro
mattpolicastro

Reputation: 377

Lookup table with subset/grepl in R

I'm analyzing a set of urls and values extracted using a crawler. While I could extract substrings from the URL, I'd really rather not bother with the regex to do so—is there a simple way to do a lookup table-style replacement using subset/grepl without resorting to dplyr(do a conditional mutate on the vairables)?

My current process:

test <- data.frame(
  url = c('google.com/testing/duck', 'google.com/evaluating/dog', 'google.com/analyzing/cat'),
  content = c(1, 2, 3),
  subdir = NA
)

test[grepl('testing', test$url), ]$subdir <- 'testing'
test[grepl('evaluating', test$url), ]$subdir <- 'evaluating'
test[grepl('analyzing', test$url), ]$subdir <- 'analyzing'

Obviously, this is a little clumsy and doesn't scale well. With dplyr, I'd be able to do something with conditionals like:

test %<>% tbl_df() %>% 
  mutate(subdir = ifelse(
    grepl('testing', subdir), 
    'test r', 
    ifelse(
      grepl('evaluating', subdir), 
      'eval r', 
      ifelse(
        grepl('analyzing', subdir), 
        'anal r', 
        NA
      ))))

But, again, really goofy and I don't want to incur a package dependency if at all possible. Is there any way to do regex-based subsetting with some sort of lookup table?

Edit: Just a few clarifications:

  1. For extracting subdirectories, yes, regex would be most efficient; however, I was hoping for a more general pattern that could match a dictionary-like struct of strings with other, arbitrary values.
  2. Of course, nested ifelse is ugly and prone to error—just wanted to get a quick-and-dirty example with dplyr up.

Edit 2: Thought I'd loop back and post what I ended up with based upon BondedDust's approach. Decided to practice some mapping and non-standard eval while at it:

test <- data.frame(
  url = c(
    'google.com/testing/duck',
    'google.com/testing/dog',
    'google.com/testing/cat',
    'google.com/evaluating/duck', 
    'google.com/evaluating/dog', 
    'google.com/evaluating/cat', 
    'google.com/analyzing/duck',
    'google.com/analyzing/dog',
    'google.com/analyzing/cat',
    'banana'
  ),
  content = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  subdir = NA
)

# List used for key/value lookup, names can be regex
lookup <- c(
  "testing" = "Testing is important",
  "Eval.*" = 'eval in R',
  "analy(z|s)ing" = 'R is fun'
)

# Dumb test for error handling:
# lookup <- c('test', 'hey')

# Defining new lookup function
regexLookup <- function(data, dict, searchColumn, targetColumn, ignore.case = TRUE){
  # Basic check—need to separate errors/handling
  if(is.null(names(dict)) || is.null(dict[[1]])) {
    stop("Not a valid replacement value; use a key/value store for `dict`.")
  }

  # Non-standard eval for the column names; not sure if I should
  # add safetytype/checks for these
  searchColumn <- eval(substitute(searchColumn), data)
  targetColumn <- deparse(substitute(targetColumn))

  # Define find-and-replace utility
  findAndReplace <- function (key, val){
    data[grepl(key, searchColumn, ignore.case = ignore.case), targetColumn] <- val
    data <<- data
  }

  # Map over the key/value store
  mapply(findAndReplace, names(dict), dict)

  # Return result, with non-matching rows preserved
  return(data)
}

regexLookup(test, lookup, url, subdir, ignore.case = FALSE)

Upvotes: 1

Views: 852

Answers (2)

IRTFM
IRTFM

Reputation: 263332

 for (target in  c('testing','evaluating','analyzing') ) {
                    test[grepl(target, test$url),'subdir' ] <- target }

 test
                        url content     subdir
1   google.com/testing/duck       1    testing
2 google.com/evaluating/dog       2 evaluating
3  google.com/analyzing/cat       3  analyzing

The vector of targets could have instead been the name of a vector that is in the workspace.

targets <-   c('testing','evaluating','analyzing') 
for( target in targets ) { ...}

Upvotes: 3

Shenglin Chen
Shenglin Chen

Reputation: 4554

Try this:

test$subdir<-gsub('.*\\/(.*)\\/.*','\\1',test$url)

Upvotes: 2

Related Questions