Selina1
Selina1

Reputation: 141

gsub: How to extract words between two words

I know a lot of people have already posted some issues related to mine, but I couldn't found the correct solution.

I have a lot of sentences like: "Therapie: I like the elephants so much Indication"

I want to extract all the words between "Therapie:" and "Indication" in the provided example above would it be "I like the elephants so much".

When I use my code I get always the next 3 words back. What am I doing wrong?

my_df <- c("Therapie: I like the elephants so much Indication")

These are sentences out of the documents and I need just all the words between "Therapie: and Indikation:"

Examples: 
 ____________________________________________________________________________ _____    Diagnose:   Blepharochalasis    Therapie:   Oberlidplastik und Fettresektion mediales und nasales Pocket   Indikation: 

  ____________________________________________________________________________ _____    Diagnose:   Mammahypoplasie    Therapie:   Dual Plane Augmentation bds. über IMF Schnitt  Indikation: 



exc <- sub(".*?\\bTherapie\\W+(\\w+(?:\\W+\\w+){0,2}).*", "\\1", my_df, to = "documents")`, perl=TRUE)

Upvotes: 1

Views: 94

Answers (5)

The fourth bird
The fourth bird

Reputation: 163207

Another option with a match only:

str <- "Therapie: I like the elephants so much Indication"
regmatches(str, regexpr("\\bTherapie:\\h*\\K.*?(?=\\h*\\bIndication\\b)", str, perl=TRUE))

Output

[1] "I like the elephants so much"

The pattern matches:

  • \bTherapie: A word boundary to prevent matching a partial word, match the word Therapie and :
  • \h*\K Match optional spaces and clear clear what is matched so far
  • .*? Match as least as possible
  • (?=\h*\bIndication\b) Positive lookahead, assert optional spaces and the word Indication to the right

See an R demo.

Upvotes: 0

Allan Cameron
Allan Cameron

Reputation: 173793

You can do

my_df <- c("Therapie: I like the elephants so much Indication")
sub("^Therapie: (.*) Indication$", "\\1", my_df)
#> [1] "I like the elephants so much"

Upvotes: 2

Ma&#235;l
Ma&#235;l

Reputation: 51894

With str_match. \\s* allows to trim whitespace.

str <- "Therapie: I like the elephants so much Indication"

library(stringr)
str_match(str, "Therapie:\\s*(.*?)\\s*Indication")[, 2]
# [1] "I like the elephants so much"

What about a custom function?

str_between <- function(str, w1, w2){
  stringr::str_match(str, paste0(w1, "\\s*(.*?)\\s*", w2))[, 2]
}

str_between(str, "Therapie:", "Indication")
# [1] "I like the elephants so much"

Upvotes: 3

AlexB
AlexB

Reputation: 3269

Another way using strsplit:

str <- "Therapie: I like the elephants so much Indication"

!strsplit(str, " ")[[1]] %in% c("Therapie:", "Indication") -> x
paste0(strsplit(str, " ")[[1]][x], collapse = ' ')
#"I like the elephants so much"

Upvotes: 1

akrun
akrun

Reputation: 886938

An option with trimws from base R

trimws(str, whitespace = ".*:\\s+|\\s+Indication.*")
[1] "I like the elephants so much"

data

str <- "Therapie: I like the elephants so much Indication"

Upvotes: 0

Related Questions