Premal
Premal

Reputation: 143

How to extract Sentence from text with use of particular word in Rstudio?

I want to Extract Sentences which have a particular word in Text files which contains multiple Paragraphs.

For Example: Digital India is an initiative by the Government of India to ensure that Government services are made available to citizens electronically by improving online infrastructure and by increasing Internet connectivity. It was launched on 1 July 2015 by Prime Minister Narendra Modi.

Now from This paragraph I need to Extract all those sentences that contains the word "India".

I tried to use substr and substring command in R but was not helpful. Someone please help me in this Issue.

Thanks in Advance

Upvotes: 1

Views: 6475

Answers (2)

fdetsch
fdetsch

Reputation: 5308

Using regular expressions along with grep (or, for that matter, most likely any pattern matching function in R) provides an even finer control over features to extract from a given input string. That said, base-R regmatches (in combination with regexpr) or str_extract_all from stringr can help accomplish your particular task without being explicitly required to split your input vector beforehand.

For example, the extraction of any sentence containing the word 'India' can easily be achieved using the following expression. Note that I added another sentence containing 'India' in a derivative form for illustration purposes.

text = "Digital India is an initiative by the Government of India ensuring that Government services are made available to citizens electronically by improving online infrastructure and by increasing Internet connectivity. It was launched on 1 July 2015 by Prime Minister Narendra Modi."
text = paste(text, "Indian summer is a periodically recurring weather phenomenon in Central Europe.")

library(stringr)
str_extract_all(text, "([:alnum:]+\\s)*India[[:alnum:]\\s]*\\.")[[1]]

[1] "Digital India is an initiative by the Government of India ensuring that Government services are made available to citizens electronically by improving online infrastructure and by increasing Internet connectivity."
[2] "Indian summer is a periodically recurring weather phenomenon in Central Europe."

There are plenty of excellent tutorials about regular expressions on the web, so I'll spare you the details here. In order to decipher the above statement, Regular Expressions in R might be a good starting point.

Upvotes: 1

Hardik Gupta
Hardik Gupta

Reputation: 4790

You can use grep like this

text <- c("Digital India is an initiative by the Government of India to ensure that Government services are made available to citizens electronically by improving online infrastructure and by increasing Internet connectivity. It was launched on 1 July 2015 by Prime Minister Narendra Modi.")
text <- unlist(strsplit(text, "\\."))

text[grep(pattern = "India", text, ignore.case = T)]

[1] "Digital India is an initiative by the Government of India ...

Upvotes: 4

Related Questions