luciano
luciano

Reputation: 13792

Find combinations of words using R

I'm editing some text and wondering whether I can programatically search for certain words.

These words: almost, nearly, quite, close to and very, do not work next to these words: certain, complete, dead, entire, essential and extinct.

Lets say I have this character vector:

text <- c("R is a very essential tool for data analysis. While it is regarded as domain specific, it is a very complete programming language. Almost certainly, many people who would benefit from using R, do not use it")

Can I get R to return a numeric vector, giving line numbers (or sentence numbers) where these words are placed next to each other?

Note that I've used "certainly", so ideally I would need R to search for words that contain "certain" or other words, as opposed to the whole word "certain" or other words.

Upvotes: 1

Views: 2487

Answers (2)

Tyler Rinker
Tyler Rinker

Reputation: 109864

Andrie's solution is much better for your needs however I'm providing a second solution for those future searchers looking to parse transcripts.

library(qdap)
stext <- c("R is a very essential tool for data analysis. While it is regarded 
    as domain specific, it is a very complete programming language. Almost 
    certainly, many people who would benefit from using R, do not use it.")

dat <- sentSplit(data.frame(dialogue=stext), "dialogue")
with(dat, termco(dialogue, tot, "certain"))

##   tot word.count  certain
## 1 1.1          9        0
## 2 2.2         14        0
## 3 3.3         14 1(7.14%)

Note that punctuation is important and I needed to add in the missing period on the last sentence.

To get a vector of which sentences contain "certain":

which(with(dat, termco(dialogue, tot, "certain"))$raw$certain > 0)
## [1] 3

Upvotes: 2

Andrie
Andrie

Reputation: 179418

Use grep for this, after splitting your text at sentence boundaries using strsplit:

stext <- strsplit(text, split="\\.")[[1]]
grep("certain", stext)
[1] 3

Upvotes: 2

Related Questions