JBDonges
JBDonges

Reputation: 316

R: Separate string based on set of regular expressions

I have a data frame foo.df that contains one variable that is just a very long string consisting of several substrings. Additionally, I have vectors of characters that match parts of the string. Example for the variable in the data frame:

foo.df$var[1] [1] "Peter Paul SmithLabour3984234.55%Hans NicholsConservative103394.13%Turnout294834.3%

Now an example for the vectors of characters:

head(candidates) [1] "Peter Paul Smith" "Hans Nichols" "Denny Gross" "Walter Mittens"
[5] "Charles Butt" "Mitch Esterhazy"

I want to create a variable foo.df$candidate1 that contains the name of the first candidate appearing in the string (i.e. food.df$candidate1[1] would be Peter Paul Smith). I was trying to approach this with grepl but it doesn't work as grepl only uses the first the first entry from candidates. Any idea how this could be done efficiently?

Upvotes: 1

Views: 28

Answers (1)

Rui Barradas
Rui Barradas

Reputation: 76402

You can use the regex OR character, |, with paste and regmatches/regexpr.

candidates <- scan(what = character(), text = '
"Peter Paul Smith"  "Hans Nichols" "Denny Gross" "Walter Mittens"')

var1 <- "Peter Paul SmithLabour3984234.55%Hans NicholsConservative103394.13%Turnout294834.3%"

foo.df <- data.frame(var1)

pat <- paste(candidates, collapse = "|")
regmatches(foo.df$var1, regexpr(pat, foo.df$var1))
#[1] "Peter Paul Smith"

foo.df$candidate1 <- regmatches(foo.df$var1, regexpr(pat, foo.df$var1))

Upvotes: 1

Related Questions