Reputation: 143
I want to Extract Sentences which have a particular word in Text files which contains multiple Paragraphs.
For Example: Digital India is an initiative by the Government of India to ensure that Government services are made available to citizens electronically by improving online infrastructure and by increasing Internet connectivity. It was launched on 1 July 2015 by Prime Minister Narendra Modi.
Now from This paragraph I need to Extract all those sentences that contains the word "India".
I tried to use substr and substring command in R but was not helpful. Someone please help me in this Issue.
Thanks in Advance
Upvotes: 1
Views: 6475
Reputation: 5308
Using regular expressions along with grep
(or, for that matter, most likely any pattern matching function in R) provides an even finer control over features to extract from a given input string. That said, base-R regmatches
(in combination with regexpr
) or str_extract_all
from stringr can help accomplish your particular task without being explicitly required to split your input vector beforehand.
For example, the extraction of any sentence containing the word 'India' can easily be achieved using the following expression. Note that I added another sentence containing 'India' in a derivative form for illustration purposes.
text = "Digital India is an initiative by the Government of India ensuring that Government services are made available to citizens electronically by improving online infrastructure and by increasing Internet connectivity. It was launched on 1 July 2015 by Prime Minister Narendra Modi."
text = paste(text, "Indian summer is a periodically recurring weather phenomenon in Central Europe.")
library(stringr)
str_extract_all(text, "([:alnum:]+\\s)*India[[:alnum:]\\s]*\\.")[[1]]
[1] "Digital India is an initiative by the Government of India ensuring that Government services are made available to citizens electronically by improving online infrastructure and by increasing Internet connectivity."
[2] "Indian summer is a periodically recurring weather phenomenon in Central Europe."
There are plenty of excellent tutorials about regular expressions on the web, so I'll spare you the details here. In order to decipher the above statement, Regular Expressions in R might be a good starting point.
Upvotes: 1
Reputation: 4790
You can use grep
like this
text <- c("Digital India is an initiative by the Government of India to ensure that Government services are made available to citizens electronically by improving online infrastructure and by increasing Internet connectivity. It was launched on 1 July 2015 by Prime Minister Narendra Modi.")
text <- unlist(strsplit(text, "\\."))
text[grep(pattern = "India", text, ignore.case = T)]
[1] "Digital India is an initiative by the Government of India ...
Upvotes: 4