mundos
mundos

Reputation: 459

Extract strings between only one known string in R

I want to extract a string between two other strings. One string is a carriage return, whereas the other is a variation of almost similar characters:

dput(head(decisions$Title))
c("Zinaida Shumilina et al. v. Belarus                    \r\n                    
CCPR/C/120/D/2142/2012", 
"K.E.R. vs. Canada                    \r\n                    
CCPR/C/120/D/2196/2012", 
"Lounis Khelifati v Algeria                    \r\n                    
CCPR/C/120/D/2267/2013", 
"Hibaq Said Hash v. Denmark                    \r\n                    
CCPR/C/120/D/2470/2014", 
"Anton Batanov v. Russian Federation                    \r\n                    
CCPR/C/120/D/2532/2015", 
"S. Z. v. Denmark                    \r\n                    
CCPR/C/120/D/2625/2015"
)

I essentially want to extract the country names between "v." and the carriage return \r. However, "v." is sometimes "v", "vs.", "vs" and "v:".

Based on the answer from a related SO question, I tried the following:

res <- str_match(decisions$Title, "(v\\.|vs\\.|v)(.*?)\\r")
res[,3]

Unfortunately, this doesn't get all variations, or in some cases it returns data such as "ruz Tahirovich Nasyrlayev v. Turkmenistan" when trying to extract the country name from "Navruz Tahirovich Nasyrlayev v. Turkmenistan CCPR/C/117/D/2219/2012".

Is there another way to achieve this?

Upvotes: 2

Views: 94

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627488

You may use

trimws(str_match(decisions$Title, "\\bv(?:s?\\.|:)?\\s*(.*)")[,2])

See the regex demo. Note that trimws will remove redundant leading and trailing whitespace chars.

Pattern details

  • \b - a word boundary
  • v - a v char
  • (?:s?\\.|:)? - optionally matches an optional s followed with . or a : char
  • \\s* - 0+ whitespace chars
  • (.*) - Group 1: any 0+ chars other than line break chars (note that you do not have to worry about whether . matches a CR symbol or not (in TRE regex flavor used in sub the . also matches LF symbols) becaue trimws will cut the leading/trailing whitespaces anyway).

Tested in R:

> df<-c("Zinaida Shumilina et al. v. Belarus                    \r\n                    
+ CCPR/C/120/D/2142/2012", 
+ "K.E.R. vs. Canada                    \r\n                    
+ CCPR/C/120/D/2196/2012", 
+ "Lounis Khelifati v Algeria                    \r\n                    
+ CCPR/C/120/D/2267/2013", 
+ "Hibaq Said Hash v. Denmark                    \r\n                    
+ CCPR/C/120/D/2470/2014", 
+ "Anton Batanov v. Russian Federation                    \r\n                    
+ CCPR/C/120/D/2532/2015", 
+ "S. Z. v. Denmark                    \r\n                    
+ CCPR/C/120/D/2625/2015"
+ )

> trimws(str_match(df, "\\bv(?:s?\\.|:)?\\s*(.*)")[,2])
[1] "Belarus"            "Canada"             "Algeria"           
[4] "Denmark"            "Russian Federation" "Denmark"           
> 

Upvotes: 6

akrun
akrun

Reputation: 887901

We can use sub to match characters (.*) until a word boundary (\\b) followed by 'v' followed by s or ., one or more spaces (\\s+) and capture the characters that are not a \r ([^\r]+) and other characters following it. In the replacement, use the backreference of the captured group (\\2) and remove the trailing spaces with trimws

trimws(sub(".*\\bv(s*\\.*)\\s+([^\r]+)\\s*\r.*", "\\2", v1))
#[1] "Belarus"            "Canada"             "Algeria"   
#[4] "Denmark"            "Russian Federation" "Denmark"           

Upvotes: 4

Esteban PS
Esteban PS

Reputation: 999

You can also include a word boundary before "v"

str_match(decisions$Title, "(\\b)(v\\.|vs\\.|v)(.*?)\\r")

Upvotes: 0

Related Questions