Problems in a regular expression to extract names using stringr

I cannot fully understand why my regular expression does not work to extract the info I want. I have an unlisted vector that looks like this:

   text <- c("Senator, 1.4balbal", "rule 46.1, declares",
             "Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23)

I would like to create a regular expression to extract only the name of the "Town", even if the town has a long name as the one written in the vector ("A Town with a Long Name"). I have tried this to extract the name of the town:

   reg.town <- "[[:alpha:]](.+?)+,(.+?)\\d{2}" 
   towns<- unlist(str_extract_all(example, reg.prov))

but I extract everything around the ",".

Thanks in advance,

Upvotes: 1

Views: 392

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626929

You may use the following regex:

> library(stringr)
> text <- c("Senator, 1.4balbal", "rule 46.1, declares", "Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23")
> towns <- unlist(str_extract_all(text, "\\b\\p{Lu}[^,]++(?=, \\d)"))
> towns
[1] "Senator"                 "Town"                   
[3] "A Town with a Long Name"

The regex matches:

  • \\b - a leading word boundary
  • \\p{Lu} - an uppercase letter
  • [^,]++ - 1+ chars other than a , (possessively, due to ++ quantifier, with no backtracking into this pattern for a more efficient matching)
  • (?=, \\d) - a positive lookahead that requires a ,, then a space and then any digit to appear immediately after the last non-, symbol matched with [^,]++.

Note you may get the same results with base R using the same regex with a PCRE option enabled:

> towns_baseR <- unlist(regmatches(text, gregexpr("\\b\\p{Lu}[^,]++(?=, \\d)", text, perl=TRUE)))
> towns_baseR
[1] "Senator"                 "Town"                   
[3] "A Town with a Long Name"
> 

Upvotes: 1

I have solve the problem thanks to @Dmitry Egorov 's demo post in the comment. the regular expression is this one ([[:upper:]].+?, [[:digit:]])

Thanks for your quick replies!!

Upvotes: 1

Dmitry Egorov
Dmitry Egorov

Reputation: 9650

It looks like a town name starts with a capital letter ([[:upper:]]), ends with a comma (or continues to the end of text if there is no comma) ([^,]+) and should be at the start of the input text (^). The corresponding regex in this case would be:

^[[:upper:]][^,]+

Demo: https://regex101.com/r/QXYtyv/1

Upvotes: 2

Related Questions