Reputation: 97
I cannot fully understand why my regular expression does not work to extract the info I want. I have an unlisted vector that looks like this:
text <- c("Senator, 1.4balbal", "rule 46.1, declares",
"Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23)
I would like to create a regular expression to extract only the name of the "Town", even if the town has a long name as the one written in the vector ("A Town with a Long Name"). I have tried this to extract the name of the town:
reg.town <- "[[:alpha:]](.+?)+,(.+?)\\d{2}"
towns<- unlist(str_extract_all(example, reg.prov))
but I extract everything around the ",".
Thanks in advance,
Upvotes: 1
Views: 392
Reputation: 626929
You may use the following regex:
> library(stringr)
> text <- c("Senator, 1.4balbal", "rule 46.1, declares", "Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23")
> towns <- unlist(str_extract_all(text, "\\b\\p{Lu}[^,]++(?=, \\d)"))
> towns
[1] "Senator" "Town"
[3] "A Town with a Long Name"
The regex matches:
\\b
- a leading word boundary\\p{Lu}
- an uppercase letter[^,]++
- 1+ chars other than a ,
(possessively, due to ++
quantifier, with no backtracking into this pattern for a more efficient matching)(?=, \\d)
- a positive lookahead that requires a ,
, then a space and then any digit to appear immediately after the last non-,
symbol matched with [^,]++
.Note you may get the same results with base R using the same regex with a PCRE option enabled:
> towns_baseR <- unlist(regmatches(text, gregexpr("\\b\\p{Lu}[^,]++(?=, \\d)", text, perl=TRUE)))
> towns_baseR
[1] "Senator" "Town"
[3] "A Town with a Long Name"
>
Upvotes: 1
Reputation: 97
I have solve the problem thanks to @Dmitry Egorov 's demo post in the comment. the regular expression is this one ([[:upper:]].+?, [[:digit:]])
Thanks for your quick replies!!
Upvotes: 1
Reputation: 9650
It looks like a town name starts with a capital letter ([[:upper:]]
), ends with a comma (or continues to the end of text if there is no comma) ([^,]+
) and should be at the start of the input text (^
). The corresponding regex in this case would be:
^[[:upper:]][^,]+
Demo: https://regex101.com/r/QXYtyv/1
Upvotes: 2