Reputation: 68
I have strings from which I need to extract all matches from a vector using the rebus package.
s <- "Heart disease Heart diseases include Blood vessel disease, such as coronary artery disease Heart rhythm problems arrhythmias Heart defects you're born with congenital heart defects Heart valve disease Disease of the heart muscle Heart infection"
symp.rx <- or1(whole_word(c("Heart", "Heart rythm", "Heart valve")))
stri_extract_all_regex(s, symp.rx)
running the above code gives me "Heart" "Heart" "Heart" "Heart" "Heart"
what am I missing? I need also heart rythm, heart valve, etc...
note: The text is actually a column of a large dataframe and the vector (symp.rx) is over 5000 words and I need the output as a simplified vector for each row of the dataframe (in a second column).
Upvotes: 0
Views: 35
Reputation: 389155
Pass the longer pattern first and then include the shorter ones.
library(rebus)
symp.rx <- or1(whole_word(c("Heart rythm", "Heart valve", "Heart")))
stringi::stri_extract_all_regex(s, symp.rx)
#[1] "Heart" "Heart" "Heart" "Heart" "Heart valve" "Heart"
Upvotes: 2