eyei
eyei

Reputation: 402

How to make optional lookbehind and lookahead in r

I would like to extract the text between de and en as well as the text in the strings that don't have de or en. I am not very good with regex but after reading about lookaheads and lookbehinds I managed to get partly what I want. Now I have to make them optional but whatever I've tried, I can't get it right. Any help would be highly appreciated!

library(stringr)
(sstring = c('{\"de\":\"extract this one\",\"en\":\"some text\"}',     'extract this one',     '{\"de\":\"extract this one\",\"en\":\"some text\"}', "p (340) extract this one"))
#> [1] "{\"de\":\"extract this one\",\"en\":\"some text\"}"
#> [2] "extract this one"                                  
#> [3] "{\"de\":\"extract this one\",\"en\":\"some text\"}"
#> [4] "p (340) extract this one"

str_extract_all(string = sstring, pattern = "(?<=.de\":\").*(?=.,\"en\":)")
#> [[1]]
#> [1] "extract this one"
#> 
#> [[2]]
#> character(0)
#> 
#> [[3]]
#> [1] "extract this one"
#> 
#> [[4]]
#> character(0)

desired output:

#> [1] "extract this one"         "extract this one"        
#> [3] "extract this one"         "p (340) extract this one"

Created on 2020-05-08 by the reprex package (v0.3.0)

Upvotes: 0

Views: 338

Answers (2)

Daniel O
Daniel O

Reputation: 4358

in Base R

gsub('.*de\":\"(.*)\",\"en.*',"\\1",sstring)


[1] "extract this one"        
[2] "extract this one"        
[3] "extract this one"        
[4] "p (340) extract this one"

Where:

  • .* indicates any length of any character
  • (...) brackets store whats inside to latter be recalled by "\\1" Essentially, were subbing the entire string with the matching patterns with only the text we want.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626932

I suggest a pattern that will match any string not containing {"de":" substring or a substring after {"de":" that contains 1+ chars other than ":

(?<=\{"de":")[^"]+|^(?!.*\{"de":").+

See the regex demo.

Details

  • (?<=\{"de":") - a positive lookbehind that looks for a position immediately preceded with {"de":"
  • [^"]+ - then extracts 1+ chars other than "
  • | - or
  • ^ - at the start of string
  • (?!.*\{"de":") - make sure there is no {"de":" in the string and
  • .+ - extract 1+ chars other than line break chars as many as possible.

See an R demo online:

library(stringr)
sstring = c('{\"de\":\"extract this one\",\"en\":\"some text\"}',     'extract this one',     '{\"de\":\"extract this one\",\"en\":\"some text\"}', "p (340) extract this one")
str_extract( sstring, '(?<=\\{"de":")[^"]+|^(?!.*\\{"de":").+')
# => [1] "extract this one"         "extract this one"        
#    [3] "extract this one"         "p (340) extract this one"

Upvotes: 3

Related Questions