Cybernetic
Cybernetic

Reputation: 13334

Extract text that follows a specific word/s in R

Say I have a string that reads:

"database service crashed due to monkeys in the circuit board and this is a serious problem."

How can I extract the, say, 5 words that follow the phrase 'due to'

So I would get this:

monkeys in the circuit board

Upvotes: 0

Views: 270

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269471

Its not clear whether you want a single string as output or a string for each word but assuming you want a single string if x is the input string then this sub will do it:

s <- sub(".*due to ((\\w+ ){4}\\w+).*", "\\1", x)

giving:

> s
[1] "monkeys in the circuit board"

Here is a visualization of the regular expression:

.*due to ((\w+ ){4}\w+).*

Regular expression visualization

Debuggex Demo

If you want separate words then

strsplit(s, " ")[[1]]

giving:

[1] "monkeys" "in"      "the"     "circuit" "board" 

Upvotes: 2

lawyeR
lawyeR

Reputation: 7654

Here is another approach. It has the advantage over RStudent's of extracting the five important words that follow "due to", but it creates an odd stemming result. I suspect that can be solved too. The two lines could be combined of course.

text <- "database service crashed due to monkeys in the circuit board and this is a serious problem." 
text.short <- unlist(str_split(text, "due to"))
five <- str_extract_all(text.short[2], "(\\w){5}")

[1] "monke" "circu" "board" "serio" "probl"

Upvotes: 0

DatamineR
DatamineR

Reputation: 9618

What about this tinkered way?

v <- "database service crashed due to monkeys in the circuit board and this is a serious problem."
unlist(strsplit(unlist(strsplit(v, "due to"))[2], " "))[2:6]
[1] "monkeys" "in"      "the"     "circuit" "board"  

Upvotes: 2

Related Questions