Michael
Michael

Reputation: 85

R - regular expression, match after second or third occurence

I have a vector of character strings:
x <- c( "\nFolsom Field, University of Colorado, Boulder, CO (9/3/72)", "\nHollywood Palladium, Hollywood, CA (9/9/72)" )

And I want to extract event location, city, state, and date. I have figured out the event location, city, and date, but cannot correctly match the state -- This issue I am having is that I need to match after the second or the third comma and before the first parentheses.

I tried: stateLoc <- regexpr(",{2,}.+?\\(", x) state <- regmatches(x, stateLoc) but that returned an empty character vector.

Any input is appreciated, thank you.

Upvotes: 1

Views: 1616

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You may extract these details using a single str_match call:

library(stringr)
x <- c("\nFolsom Field, University of Colorado, Boulder, CO (9/3/72)","\nHollywood Palladium, Hollywood, CA (9/9/72)")
> res <- str_match(x, "\\s*([^,]*),\\s*([A-Z]+)\\s*\\(([0-9/]+)\\)")
> res[,2]
[1] "Boulder"   "Hollywood"
> res[,3]
[1] "CO" "CA"
> res[,4]
[1] "9/3/72" "9/9/72"

See the regex demo online.

Details

  • \\s* - 0+ whitespaces
  • ([^,]*) - Capturing group 1: any 0 or more chars other than a comma
  • , - a comma
  • \\s* - 0+ whitespaces
  • ([A-Z]+) - Capturing group 2: 1 or more uppercase letters
  • \\s* - 0+ whitespaces
  • \\( - a ( char
  • ([0-9/]+) - Capturing group 3: 1 or more digits or slashes
  • \\) - a ) char.

Upvotes: 1

Henry Cyranka
Henry Cyranka

Reputation: 3060

This regex worked for me

library(stringr)
x <- c(
  "\nFolsom Field, University of Colorado, Boulder, CO (9/3/72)",
  "\nHollywood Palladium, Hollywood, CA (9/9/72)",
  "\nThe Spectrum, Philadelphia, PA (5/1/2010) "
)

##String trim is just to cut trailing spaces
states <- str_trim(str_extract(x, "\\s[A-Z]{1,2}\\s"))
states

Upvotes: 1

Related Questions