Reputation: 402
I would like to extract the text between de and en as well as the text in the strings that don't have de or en. I am not very good with regex but after reading about lookaheads and lookbehinds I managed to get partly what I want. Now I have to make them optional but whatever I've tried, I can't get it right. Any help would be highly appreciated!
library(stringr)
(sstring = c('{\"de\":\"extract this one\",\"en\":\"some text\"}', 'extract this one', '{\"de\":\"extract this one\",\"en\":\"some text\"}', "p (340) extract this one"))
#> [1] "{\"de\":\"extract this one\",\"en\":\"some text\"}"
#> [2] "extract this one"
#> [3] "{\"de\":\"extract this one\",\"en\":\"some text\"}"
#> [4] "p (340) extract this one"
str_extract_all(string = sstring, pattern = "(?<=.de\":\").*(?=.,\"en\":)")
#> [[1]]
#> [1] "extract this one"
#>
#> [[2]]
#> character(0)
#>
#> [[3]]
#> [1] "extract this one"
#>
#> [[4]]
#> character(0)
desired output:
#> [1] "extract this one" "extract this one"
#> [3] "extract this one" "p (340) extract this one"
Created on 2020-05-08 by the reprex package (v0.3.0)
Upvotes: 0
Views: 338
Reputation: 4358
in Base R
gsub('.*de\":\"(.*)\",\"en.*',"\\1",sstring)
[1] "extract this one"
[2] "extract this one"
[3] "extract this one"
[4] "p (340) extract this one"
Where:
.*
indicates any length of any character(...)
brackets store whats inside to latter be recalled by "\\1"
Essentially, were subbing the entire string with the matching patterns with only the text we want.Upvotes: 1
Reputation: 626932
I suggest a pattern that will match any string not containing {"de":"
substring or a substring after {"de":"
that contains 1+ chars other than "
:
(?<=\{"de":")[^"]+|^(?!.*\{"de":").+
See the regex demo.
Details
(?<=\{"de":")
- a positive lookbehind that looks for a position immediately
preceded with {"de":"
[^"]+
- then extracts 1+ chars other than "
|
- or^
- at the start of string(?!.*\{"de":")
- make sure there is no {"de":"
in the string and.+
- extract 1+ chars other than line break chars as many as possible.See an R demo online:
library(stringr)
sstring = c('{\"de\":\"extract this one\",\"en\":\"some text\"}', 'extract this one', '{\"de\":\"extract this one\",\"en\":\"some text\"}', "p (340) extract this one")
str_extract( sstring, '(?<=\\{"de":")[^"]+|^(?!.*\\{"de":").+')
# => [1] "extract this one" "extract this one"
# [3] "extract this one" "p (340) extract this one"
Upvotes: 3