volfi
volfi

Reputation: 477

How to extract substrings dynamically

From the string

s <- "|tree| Lorem ipsum dolor sit amet, |house| consectetur adipiscing elit, 
|street| sed do eiusmod tempor incididunt ut labore et |car| dolore magna aliqua."

I want to extract the text after the letters within the |-symbols.

My approach:

words <- list("tree","house","street","car")

for(word in words){
   expression <- paste0("^.*\\|",word,"\\|\\s*(.+?)\\s*\\|.*$")
   print(sub(expression, "\\1", s))
}

This works fine for all but the last wortd car. It instead returns the entire string s. How can I modify the regex such that for the last element of words-list in prints out dolore magna aliqua..

\Edit: Previously the list with expressions was a,b,c,d. Solutions to this specific problem cannot be generalized very well.

Upvotes: 3

Views: 100

Answers (3)

Mike V
Mike V

Reputation: 1354

You can try this pattern

library(stringr)
s <- "|tree| Lorem ipsum dolor sit amet, |house| consectetur adipiscing elit, 
|street| sed do eiusmod tempor incididunt ut labore et |car| dolore magna aliqua."

str_extract_all(s, regex("(?<=\\|)\\w+(?=\\|)"))
#[1] "tree"   "house"  "street" "car" 
  • (?<=\\|): Look behind, position following by |; \\|: is an escape for |
  • \\w: word characters
  • (?=\\|): Lookahead, position followed by |

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626699

I suggest extracting all the words with corresponding values using stringr::str_match_all:

s <- "|tree| Lorem ipsum dolor sit amet, |house| consectetur adipiscing elit, 
|street| sed do eiusmod tempor incididunt ut labore et |car| dolore magna aliqua."
words1 <- list("tree","house","street","car")
library(stringr)
expression <- paste0("\\|(", paste(words1, collapse="|"),")\\|\\s*([^|]*)")
result <- str_match_all(s, expression)
lapply(result, function(x) x[,-1])

See the R demo

Output:

[[1]]
     [,1]     [,2]                                            
[1,] "tree"   "Lorem ipsum dolor sit amet, "                  
[2,] "house"  "consectetur adipiscing elit, \n"               
[3,] "street" "sed do eiusmod tempor incididunt ut labore et "
[4,] "car"    "dolore magna aliqua."    

The regex is

\|(tree|house|street|car)\|\s*([^|]*)

See the regex demo, details:

  • \| - a | char
  • (tree|house|street|car) - Group 1: one of the words
  • \| - a | char
  • \s* - 0 or more whitespace chars
  • ([^|]*) - Group 2: any 0 or more chars other than |.

Upvotes: 1

daniellga
daniellga

Reputation: 1224

Try this:

library(stringi)

s <- '|a| Lorem ipsum dolor sit amet, |b| consectetur adipiscing elit, 
|c| sed do eiusmod tempor incididunt ut labore et |d| dolore magna aliqua.'

stri_split_regex(s, '\\|[:alpha:]\\|')

[[1]]
[1] ""                                                " Lorem ipsum dolor sit amet, "                  
[3] " consectetur adipiscing elit, \n"                " sed do eiusmod tempor incididunt ut labore et "
[5] " dolore magna aliqua."     

Upvotes: 2

Related Questions