user101089
user101089

Reputation: 3992

R: How to classify strings based on a set of regular expressions

I have a vector of text strings that describe the sources for images in a book, but not in an entirely consistent format--- they are captured from the LaTeX source & written to a text file.

I need to process these and collect information about the source attribution and permission status. A sample of strings looks like this:

strings <- '
GBD 2016 Alcohol Collaborators (2018)
\citet {DeBoldFriedman:2015}, permission required
Author graphic, using various public domain images
\url {http://www.histogames.com/HTML/inventaire/periodes-historiques/prehistoire.php}
\url {https://commons.wikimedia.org/wiki/File:Egyptmotionseries.jpg}
\url {http://commons.wikimedia.org}, public domain.
\citet {Plot:1685}, author image collection
Author image collection
 From \citet {Priestley:1765}, author image collection
 Author image collection
 Courtesy Library of Congress
\citet {Langren:1644}, image courtesy of the Koninklijke Bibliotheek van Belgi\"e 
Public domain
Author graphic
Author graphic
Author graphic
Author image collection
Wikipedia, https://commons.wikimedia.org/wiki/File:Tablas\_alfonsies.jpg, public domain
'

One main task is to create a variable status indicating permission status by finding matches to simple regular expressions in the strings. The categories can be defined by the following patterns, all of which should be matched to the text of the strings w/o regard to case.

AUpat <- "author (image|graphic|collection)"
PDpat <- "public domain"
REQpat <- "permission required"
LIBpat <- "courtesy|by permission"

# or as a list:
pats <- list(
  'AU' = "author (image|graphic|collection)",
  'PD' = "public domain",
  'REQ'= "permission required",
  'LIB'= "courtesy|by permission"
)

Pseudo code to do what I want (doesn't work):

status <- rep("", length(strings))
for(i in seq_along(strings)) {
  if      (grep(AUpat, strings[i],  ignore.case=TRUE)) status[i] <- "AU"
  else if (grep(PDpat, strings[i],  ignore.case=TRUE)) status[i] <- "PD"
  else if (grep(REQpat, strings[i], ignore.case=TRUE)) status[i] <- "REQ"
  else if (grep(LIBpat, strings[i], ignore.case=TRUE)) status[i] <- "LIB"
}

This is ugly, aside from the fact that it generates an error:

Error in if (grep(AUpat, strings[i])) status[i] <- "AU" else if (grep(PDpat,  : 
  argument is of length zero

Is there some better way to try to do this task, maybe with stringr or other tidyverse tools?

Upvotes: 1

Views: 622

Answers (1)

r2evans
r2evans

Reputation: 160687

It looks from your patterns as if strings can actually belong to zero or more, not just one. If that's true, then a first step would be to determine which categories (0 or more) a string possesses.

I don't have enough of your strings to really test, so I'll generate some simple data and patterns:

strings <- c("something", "something else", "nothing", "nothing here either",
             "something or nothing", "interesting",
             "something interesting", "nothing interesting")

pats <- c(p1 = "something", p2 = "nothing", p3 = "interesting")

(The patterns are named as a convenience when the patterns themselves get cumbersome as labels.) Start with generating a logical matrix:

m <- sapply(pats, grepl, strings, ignore.case = TRUE)
m
#         p1    p2    p3
# [1,]  TRUE FALSE FALSE
# [2,]  TRUE FALSE FALSE
# [3,] FALSE  TRUE FALSE
# [4,] FALSE  TRUE FALSE
# [5,]  TRUE  TRUE FALSE
# [6,] FALSE FALSE  TRUE
# [7,]  TRUE FALSE  TRUE
# [8,] FALSE  TRUE  TRUE

If you must apply just one category to a string, then assuming the patterns are in order of priority, you can do:

apply(m, 1, function(a) head(c(which(a), NA), n = 1))
# [1] 1 1 2 2 1 3 1 2

names(pats)[ apply(m, 1, function(a) head(c(which(a), 0), n = 1)) ]
# [1] "p1" "p1" "p2" "p2" "p1" "p3" "p1" "p2"

Note: this will create NAs when strings contain none of the patterns, demonstrating:

m[8,] <- FALSE
apply(m, 1, function(a) head(c(which(a), NA), n = 1))
# [1]  1  1  2  2  1  3  1 NA

You should guard against this in your assumptions. (I'll keep m with this change for now.)

If you need to preserve the categories per-string, then how you do it depends on how you intend to store said classifications. The start for this might be:

str(m2 <- apply(m, 1, function(r) names(which(r))))
# List of 8
#  $ : chr "p1"
#  $ : chr "p1"
#  $ : chr "p2"
#  $ : chr "p2"
#  $ : chr [1:2] "p1" "p2"
#  $ : chr "p3"
#  $ : chr [1:2] "p1" "p3"
#  $ : chr(0) 

This is one direct storage: each element of the list corresponds to strings, and it contains zero or more pattern names. If this is for human-eyes consumption, you might want to convert this into a comma-delimited set:

sapply(m2, paste, collapse = ",")
# [1] "p1"    "p1"    "p2"    "p2"    "p1,p2" "p3"    "p1,p3" ""     

or for more formal database storage, how about

stringids <- seq_len(length(strings)) # perhaps you have something better?
d <- data.frame(
  stringid = rep(stringids, times = lengths(m2)),
  ptnmatch = unlist(m2),
  stringsAsFactors = FALSE
)
d
#   stringid ptnmatch
# 1        1       p1
# 2        2       p1
# 3        3       p2
# 4        4       p2
# 5        5       p1
# 6        5       p2
# 7        6       p3
# 8        7       p1
# 9        7       p3

Notice how string 8 (that I modified to have no categories) is not included, this is by-design here. Once could always force it back in with:

misses <- setdiff(stringids, unique(d$stringid))
misses
# [1] 8

d <- rbind(d, data.frame(stringid = misses, ptnmatch = rep(NA, length(misses))),
           stringsAsFactors = FALSE)
d
#    stringid ptnmatch
# 1         1       p1
# 2         2       p1
# 3         3       p2
# 4         4       p2
# 5         5       p1
# 6         5       p2
# 7         6       p3
# 8         7       p1
# 9         7       p3
# 10        8     <NA>

Upvotes: 2

Related Questions