Reputation: 141
I know a lot of people have already posted some issues related to mine, but I couldn't found the correct solution.
I have a lot of sentences like: "Therapie: I like the elephants so much Indication"
I want to extract all the words between "Therapie:" and "Indication" in the provided example above would it be "I like the elephants so much".
When I use my code I get always the next 3 words back. What am I doing wrong?
my_df <- c("Therapie: I like the elephants so much Indication")
These are sentences out of the documents and I need just all the words between "Therapie: and Indikation:"
Examples:
____________________________________________________________________________ _____ Diagnose: Blepharochalasis Therapie: Oberlidplastik und Fettresektion mediales und nasales Pocket Indikation:
____________________________________________________________________________ _____ Diagnose: Mammahypoplasie Therapie: Dual Plane Augmentation bds. über IMF Schnitt Indikation:
exc <- sub(".*?\\bTherapie\\W+(\\w+(?:\\W+\\w+){0,2}).*", "\\1", my_df, to = "documents")`, perl=TRUE)
Upvotes: 1
Views: 94
Reputation: 163207
Another option with a match only:
str <- "Therapie: I like the elephants so much Indication"
regmatches(str, regexpr("\\bTherapie:\\h*\\K.*?(?=\\h*\\bIndication\\b)", str, perl=TRUE))
Output
[1] "I like the elephants so much"
The pattern matches:
\bTherapie:
A word boundary to prevent matching a partial word, match the word Therapie
and :
\h*\K
Match optional spaces and clear clear what is matched so far.*?
Match as least as possible(?=\h*\bIndication\b)
Positive lookahead, assert optional spaces and the word Indication
to the rightSee an R demo.
Upvotes: 0
Reputation: 173793
You can do
my_df <- c("Therapie: I like the elephants so much Indication")
sub("^Therapie: (.*) Indication$", "\\1", my_df)
#> [1] "I like the elephants so much"
Upvotes: 2
Reputation: 51894
With str_match
. \\s*
allows to trim whitespace.
str <- "Therapie: I like the elephants so much Indication"
library(stringr)
str_match(str, "Therapie:\\s*(.*?)\\s*Indication")[, 2]
# [1] "I like the elephants so much"
What about a custom function?
str_between <- function(str, w1, w2){
stringr::str_match(str, paste0(w1, "\\s*(.*?)\\s*", w2))[, 2]
}
str_between(str, "Therapie:", "Indication")
# [1] "I like the elephants so much"
Upvotes: 3
Reputation: 3269
Another way using strsplit
:
str <- "Therapie: I like the elephants so much Indication"
!strsplit(str, " ")[[1]] %in% c("Therapie:", "Indication") -> x
paste0(strsplit(str, " ")[[1]][x], collapse = ' ')
#"I like the elephants so much"
Upvotes: 1
Reputation: 886938
An option with trimws
from base R
trimws(str, whitespace = ".*:\\s+|\\s+Indication.*")
[1] "I like the elephants so much"
str <- "Therapie: I like the elephants so much Indication"
Upvotes: 0