Reputation: 3842
How do I match the year such that it is general for the following examples.
a <- '"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}'
b <- 'Þegar það gerist (1998/I) (TV)'
I have tried the following, but did not have the biggest success.
gsub('.+\\(([0-9]+.+\\)).?$', '\\1', a)
What I thought it did was to go until it finds a (, then it would make a group of numbers, then any character until it meets a ). And if there are several matches, I want to extract the first group.
Any suggestions to where I go wrong? I have been doing this in R.
Upvotes: 3
Views: 97
Reputation: 626738
Your pattern contains .+
parts that match 1 or more chars as many as possible, and at best your pattern could grab last 4 digit chunks from the incoming strings.
You may use
^.*?\((\d{4})(?:/[^)]*)?\).*
Replace with \1
to only keep the 4 digit number. See the regex demo.
Details
^
- start of string.*?
- any 0+ chars as few as possible\(
- a (
(\d{4})
- Group 1: four digits(?:
- start of an optional non-capturing group
/
- a /
[^)]*
- any 0+ chars other than )
)?
- end of the group\)
- a )
(OPTIONAL, MAY BE OMITTED).*
- the rest of the string.See the R demo:
a <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)', 'Johannes Passion, BWV. 245 (1725 Version) (1996) (V)')
sub("^.*?\\((\\d{4})(?:/[^)]*)?\\).*", "\\1", a)
# => [1] "1953" "1998" "1996"
Another base R solution is to match the 4 digits after (
:
regmatches(a, regexpr("\\(\\K\\d{4}(?=(?:/[^)]*)?\\))", a, perl=TRUE))
# => [1] "1953" "1998" "1996"
The \(\K\d{4}
pattern matches (
and then drops it due to \K
match reset operator and then a (?=(?:/[^)]*)?\\))
lookahead ensures there is an optional /
+ 0+ chars other than )
and then a )
. Note that regexpr
extracts the first match only.
Upvotes: 2
Reputation: 43169
You could use
library(stringr)
strings <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)')
years <- str_match(strings, "\\((\\d+(?: B\\.C\\.)?)")[,2]
years
# [1] "1953" "1998"
The expression here is
\( # (
(\d+ # capture 1+ digits
(?: B\.C\.)? # B.C. eventually
)
Note that backslashes need to be escaped in R
.
Upvotes: 3