Reputation: 477

How to extract substrings dynamically

From the string

s <- "|tree| Lorem ipsum dolor sit amet, |house| consectetur adipiscing elit, 
|street| sed do eiusmod tempor incididunt ut labore et |car| dolore magna aliqua."

I want to extract the text after the letters within the |-symbols.

My approach:

words <- list("tree","house","street","car")

for(word in words){
   expression <- paste0("^.*\\|",word,"\\|\\s*(.+?)\\s*\\|.*$")
   print(sub(expression, "\\1", s))
}

This works fine for all but the last wortd car. It instead returns the entire string s. How can I modify the regex such that for the last element of words-list in prints out dolore magna aliqua..

\Edit: Previously the list with expressions was a,b,c,d. Solutions to this specific problem cannot be generalized very well.

Upvotes: 3

Answers (3)

Mike V

Reputation: 1364

You can try this pattern

library(stringr)
s <- "|tree| Lorem ipsum dolor sit amet, |house| consectetur adipiscing elit, 
|street| sed do eiusmod tempor incididunt ut labore et |car| dolore magna aliqua."

str_extract_all(s, regex("(?<=\\|)\\w+(?=\\|)"))
#[1] "tree"   "house"  "street" "car"

(?<=\\|): Look behind, position following by |; \\|: is an escape for |
\\w: word characters
(?=\\|): Lookahead, position followed by |

Upvotes: 2

Wiktor Stribiżew

Reputation: 627410

I suggest extracting all the words with corresponding values using stringr::str_match_all:

s <- "|tree| Lorem ipsum dolor sit amet, |house| consectetur adipiscing elit, 
|street| sed do eiusmod tempor incididunt ut labore et |car| dolore magna aliqua."
words1 <- list("tree","house","street","car")
library(stringr)
expression <- paste0("\\|(", paste(words1, collapse="|"),")\\|\\s*([^|]*)")
result <- str_match_all(s, expression)
lapply(result, function(x) x[,-1])

See the R demo

Output:

[[1]]
     [,1]     [,2]                                            
[1,] "tree"   "Lorem ipsum dolor sit amet, "                  
[2,] "house"  "consectetur adipiscing elit, \n"               
[3,] "street" "sed do eiusmod tempor incididunt ut labore et "
[4,] "car"    "dolore magna aliqua."

The regex is

\|(tree|house|street|car)\|\s*([^|]*)

See the regex demo, details:

\| - a | char
(tree|house|street|car) - Group 1: one of the words
\| - a | char
\s* - 0 or more whitespace chars
([^|]*) - Group 2: any 0 or more chars other than |.

Upvotes: 1

daniellga

Reputation: 1224

Try this:

library(stringi)

s <- '|a| Lorem ipsum dolor sit amet, |b| consectetur adipiscing elit, 
|c| sed do eiusmod tempor incididunt ut labore et |d| dolore magna aliqua.'

stri_split_regex(s, '\\|[:alpha:]\\|')

[[1]]
[1] ""                                                " Lorem ipsum dolor sit amet, "                  
[3] " consectetur adipiscing elit, \n"                " sed do eiusmod tempor incididunt ut labore et "
[5] " dolore magna aliqua."

Upvotes: 2

How to extract substrings dynamically

Answers (3)

Related Questions