Reputation: 459
I want to extract a string between two other strings. One string is a carriage return, whereas the other is a variation of almost similar characters:
dput(head(decisions$Title))
c("Zinaida Shumilina et al. v. Belarus \r\n
CCPR/C/120/D/2142/2012",
"K.E.R. vs. Canada \r\n
CCPR/C/120/D/2196/2012",
"Lounis Khelifati v Algeria \r\n
CCPR/C/120/D/2267/2013",
"Hibaq Said Hash v. Denmark \r\n
CCPR/C/120/D/2470/2014",
"Anton Batanov v. Russian Federation \r\n
CCPR/C/120/D/2532/2015",
"S. Z. v. Denmark \r\n
CCPR/C/120/D/2625/2015"
)
I essentially want to extract the country names between "v." and the carriage return \r. However, "v." is sometimes "v", "vs.", "vs" and "v:".
Based on the answer from a related SO question, I tried the following:
res <- str_match(decisions$Title, "(v\\.|vs\\.|v)(.*?)\\r")
res[,3]
Unfortunately, this doesn't get all variations, or in some cases it returns data such as "ruz Tahirovich Nasyrlayev v. Turkmenistan" when trying to extract the country name from "Navruz Tahirovich Nasyrlayev v. Turkmenistan CCPR/C/117/D/2219/2012".
Is there another way to achieve this?
Upvotes: 2
Views: 94
Reputation: 627488
You may use
trimws(str_match(decisions$Title, "\\bv(?:s?\\.|:)?\\s*(.*)")[,2])
See the regex demo. Note that trimws
will remove redundant leading and trailing whitespace chars.
Pattern details
\b
- a word boundaryv
- a v
char(?:s?\\.|:)?
- optionally matches an optional s
followed with .
or a :
char\\s*
- 0+ whitespace chars(.*)
- Group 1: any 0+ chars other than line break chars (note that you do not have to worry about whether .
matches a CR symbol or not (in TRE regex flavor used in sub
the .
also matches LF symbols) becaue trimws
will cut the leading/trailing whitespaces anyway).Tested in R:
> df<-c("Zinaida Shumilina et al. v. Belarus \r\n
+ CCPR/C/120/D/2142/2012",
+ "K.E.R. vs. Canada \r\n
+ CCPR/C/120/D/2196/2012",
+ "Lounis Khelifati v Algeria \r\n
+ CCPR/C/120/D/2267/2013",
+ "Hibaq Said Hash v. Denmark \r\n
+ CCPR/C/120/D/2470/2014",
+ "Anton Batanov v. Russian Federation \r\n
+ CCPR/C/120/D/2532/2015",
+ "S. Z. v. Denmark \r\n
+ CCPR/C/120/D/2625/2015"
+ )
> trimws(str_match(df, "\\bv(?:s?\\.|:)?\\s*(.*)")[,2])
[1] "Belarus" "Canada" "Algeria"
[4] "Denmark" "Russian Federation" "Denmark"
>
Upvotes: 6
Reputation: 887901
We can use sub
to match characters (.*
) until a word boundary (\\b
) followed by 'v' followed by s or ., one or more spaces (\\s+
) and capture the characters that are not a \r
([^\r]+
) and other characters following it. In the replacement, use the backreference of the captured group (\\2
) and remove the trailing spaces with trimws
trimws(sub(".*\\bv(s*\\.*)\\s+([^\r]+)\\s*\r.*", "\\2", v1))
#[1] "Belarus" "Canada" "Algeria"
#[4] "Denmark" "Russian Federation" "Denmark"
Upvotes: 4
Reputation: 999
You can also include a word boundary before "v"
str_match(decisions$Title, "(\\b)(v\\.|vs\\.|v)(.*?)\\r")
Upvotes: 0