Reputation: 13792
I'm editing some text and wondering whether I can programatically search for certain words.
These words: almost, nearly, quite, close to and very, do not work next to these words: certain, complete, dead, entire, essential and extinct.
Lets say I have this character vector:
text <- c("R is a very essential tool for data analysis. While it is regarded as domain specific, it is a very complete programming language. Almost certainly, many people who would benefit from using R, do not use it")
Can I get R to return a numeric vector, giving line numbers (or sentence numbers) where these words are placed next to each other?
Note that I've used "certainly", so ideally I would need R to search for words that contain "certain" or other words, as opposed to the whole word "certain" or other words.
Upvotes: 1
Views: 2487
Reputation: 109864
Andrie's solution is much better for your needs however I'm providing a second solution for those future searchers looking to parse transcripts.
library(qdap)
stext <- c("R is a very essential tool for data analysis. While it is regarded
as domain specific, it is a very complete programming language. Almost
certainly, many people who would benefit from using R, do not use it.")
dat <- sentSplit(data.frame(dialogue=stext), "dialogue")
with(dat, termco(dialogue, tot, "certain"))
## tot word.count certain
## 1 1.1 9 0
## 2 2.2 14 0
## 3 3.3 14 1(7.14%)
Note that punctuation is important and I needed to add in the missing period on the last sentence.
To get a vector of which sentences contain "certain":
which(with(dat, termco(dialogue, tot, "certain"))$raw$certain > 0)
## [1] 3
Upvotes: 2
Reputation: 179418
Use grep
for this, after splitting your text at sentence boundaries using strsplit
:
stext <- strsplit(text, split="\\.")[[1]]
grep("certain", stext)
[1] 3
Upvotes: 2