Reputation: 3992
I have a vector of text strings that describe the sources for images in a book, but not in an entirely consistent format--- they are captured from the LaTeX source & written to a text file.
I need to process these and collect information about the source attribution and permission status. A sample of strings looks like this:
strings <- '
GBD 2016 Alcohol Collaborators (2018)
\citet {DeBoldFriedman:2015}, permission required
Author graphic, using various public domain images
\url {http://www.histogames.com/HTML/inventaire/periodes-historiques/prehistoire.php}
\url {https://commons.wikimedia.org/wiki/File:Egyptmotionseries.jpg}
\url {http://commons.wikimedia.org}, public domain.
\citet {Plot:1685}, author image collection
Author image collection
From \citet {Priestley:1765}, author image collection
Author image collection
Courtesy Library of Congress
\citet {Langren:1644}, image courtesy of the Koninklijke Bibliotheek van Belgi\"e
Public domain
Author graphic
Author graphic
Author graphic
Author image collection
Wikipedia, https://commons.wikimedia.org/wiki/File:Tablas\_alfonsies.jpg, public domain
'
One main task is to create a variable status
indicating permission status by finding
matches to simple regular expressions in the strings. The categories can be defined by
the following patterns, all of which should be matched to the text of the strings
w/o regard to case.
AUpat <- "author (image|graphic|collection)"
PDpat <- "public domain"
REQpat <- "permission required"
LIBpat <- "courtesy|by permission"
# or as a list:
pats <- list(
'AU' = "author (image|graphic|collection)",
'PD' = "public domain",
'REQ'= "permission required",
'LIB'= "courtesy|by permission"
)
Pseudo code to do what I want (doesn't work):
status <- rep("", length(strings))
for(i in seq_along(strings)) {
if (grep(AUpat, strings[i], ignore.case=TRUE)) status[i] <- "AU"
else if (grep(PDpat, strings[i], ignore.case=TRUE)) status[i] <- "PD"
else if (grep(REQpat, strings[i], ignore.case=TRUE)) status[i] <- "REQ"
else if (grep(LIBpat, strings[i], ignore.case=TRUE)) status[i] <- "LIB"
}
This is ugly, aside from the fact that it generates an error:
Error in if (grep(AUpat, strings[i])) status[i] <- "AU" else if (grep(PDpat, :
argument is of length zero
Is there some better way to try to do this task, maybe with stringr
or other tidyverse
tools?
Upvotes: 1
Views: 622
Reputation: 160687
It looks from your patterns as if strings can actually belong to zero or more, not just one. If that's true, then a first step would be to determine which categories (0 or more) a string possesses.
I don't have enough of your strings
to really test, so I'll generate some simple data and patterns:
strings <- c("something", "something else", "nothing", "nothing here either",
"something or nothing", "interesting",
"something interesting", "nothing interesting")
pats <- c(p1 = "something", p2 = "nothing", p3 = "interesting")
(The patterns are named as a convenience when the patterns themselves get cumbersome as labels.) Start with generating a logical
matrix:
m <- sapply(pats, grepl, strings, ignore.case = TRUE)
m
# p1 p2 p3
# [1,] TRUE FALSE FALSE
# [2,] TRUE FALSE FALSE
# [3,] FALSE TRUE FALSE
# [4,] FALSE TRUE FALSE
# [5,] TRUE TRUE FALSE
# [6,] FALSE FALSE TRUE
# [7,] TRUE FALSE TRUE
# [8,] FALSE TRUE TRUE
If you must apply just one category to a string, then assuming the patterns are in order of priority, you can do:
apply(m, 1, function(a) head(c(which(a), NA), n = 1))
# [1] 1 1 2 2 1 3 1 2
names(pats)[ apply(m, 1, function(a) head(c(which(a), 0), n = 1)) ]
# [1] "p1" "p1" "p2" "p2" "p1" "p3" "p1" "p2"
Note: this will create NA
s when strings contain none of the patterns, demonstrating:
m[8,] <- FALSE
apply(m, 1, function(a) head(c(which(a), NA), n = 1))
# [1] 1 1 2 2 1 3 1 NA
You should guard against this in your assumptions. (I'll keep m
with this change for now.)
If you need to preserve the categories per-string, then how you do it depends on how you intend to store said classifications. The start for this might be:
str(m2 <- apply(m, 1, function(r) names(which(r))))
# List of 8
# $ : chr "p1"
# $ : chr "p1"
# $ : chr "p2"
# $ : chr "p2"
# $ : chr [1:2] "p1" "p2"
# $ : chr "p3"
# $ : chr [1:2] "p1" "p3"
# $ : chr(0)
This is one direct storage: each element of the list
corresponds to strings
, and it contains zero or more pattern names. If this is for human-eyes consumption, you might want to convert this into a comma-delimited set:
sapply(m2, paste, collapse = ",")
# [1] "p1" "p1" "p2" "p2" "p1,p2" "p3" "p1,p3" ""
or for more formal database storage, how about
stringids <- seq_len(length(strings)) # perhaps you have something better?
d <- data.frame(
stringid = rep(stringids, times = lengths(m2)),
ptnmatch = unlist(m2),
stringsAsFactors = FALSE
)
d
# stringid ptnmatch
# 1 1 p1
# 2 2 p1
# 3 3 p2
# 4 4 p2
# 5 5 p1
# 6 5 p2
# 7 6 p3
# 8 7 p1
# 9 7 p3
Notice how string 8 (that I modified to have no categories) is not included, this is by-design here. Once could always force it back in with:
misses <- setdiff(stringids, unique(d$stringid))
misses
# [1] 8
d <- rbind(d, data.frame(stringid = misses, ptnmatch = rep(NA, length(misses))),
stringsAsFactors = FALSE)
d
# stringid ptnmatch
# 1 1 p1
# 2 2 p1
# 3 3 p2
# 4 4 p2
# 5 5 p1
# 6 5 p2
# 7 6 p3
# 8 7 p1
# 9 7 p3
# 10 8 <NA>
Upvotes: 2