Reputation: 477
From the string
s <- "|tree| Lorem ipsum dolor sit amet, |house| consectetur adipiscing elit,
|street| sed do eiusmod tempor incididunt ut labore et |car| dolore magna aliqua."
I want to extract the text after the letters within the |-symbols.
My approach:
words <- list("tree","house","street","car")
for(word in words){
expression <- paste0("^.*\\|",word,"\\|\\s*(.+?)\\s*\\|.*$")
print(sub(expression, "\\1", s))
}
This works fine for all but the last wortd car
. It instead returns the entire string s.
How can I modify the regex such that for the last element of words-list in prints out dolore magna aliqua.
.
\Edit: Previously the list with expressions was a,b,c,d. Solutions to this specific problem cannot be generalized very well.
Upvotes: 3
Views: 100
Reputation: 1354
You can try this pattern
library(stringr)
s <- "|tree| Lorem ipsum dolor sit amet, |house| consectetur adipiscing elit,
|street| sed do eiusmod tempor incididunt ut labore et |car| dolore magna aliqua."
str_extract_all(s, regex("(?<=\\|)\\w+(?=\\|)"))
#[1] "tree" "house" "street" "car"
(?<=\\|)
: Look behind, position following by |; \\|
: is an escape for |\\w
: word characters(?=\\|)
: Lookahead, position followed by |Upvotes: 2
Reputation: 626699
I suggest extracting all the words with corresponding values using stringr::str_match_all
:
s <- "|tree| Lorem ipsum dolor sit amet, |house| consectetur adipiscing elit,
|street| sed do eiusmod tempor incididunt ut labore et |car| dolore magna aliqua."
words1 <- list("tree","house","street","car")
library(stringr)
expression <- paste0("\\|(", paste(words1, collapse="|"),")\\|\\s*([^|]*)")
result <- str_match_all(s, expression)
lapply(result, function(x) x[,-1])
See the R demo
Output:
[[1]]
[,1] [,2]
[1,] "tree" "Lorem ipsum dolor sit amet, "
[2,] "house" "consectetur adipiscing elit, \n"
[3,] "street" "sed do eiusmod tempor incididunt ut labore et "
[4,] "car" "dolore magna aliqua."
The regex is
\|(tree|house|street|car)\|\s*([^|]*)
See the regex demo, details:
\|
- a |
char(tree|house|street|car)
- Group 1: one of the words
\|
- a |
char\s*
- 0 or more whitespace chars([^|]*)
- Group 2: any 0 or more chars other than |
.Upvotes: 1
Reputation: 1224
Try this:
library(stringi)
s <- '|a| Lorem ipsum dolor sit amet, |b| consectetur adipiscing elit,
|c| sed do eiusmod tempor incididunt ut labore et |d| dolore magna aliqua.'
stri_split_regex(s, '\\|[:alpha:]\\|')
[[1]]
[1] "" " Lorem ipsum dolor sit amet, "
[3] " consectetur adipiscing elit, \n" " sed do eiusmod tempor incididunt ut labore et "
[5] " dolore magna aliqua."
Upvotes: 2