Reputation: 1
I got a spreadsheet which looks like this. I will like to keep the file column, but extract only the sentences with the word "India". Is there a way to do that? Prefer to use KNIME or R, but happy with any solution.
Only the sentences with "India" is extracted, but the file column is kept.
Upvotes: -3
Views: 68
Reputation: 887911
We can use base R
with grepl
subset(df, grepl("India", Text, ignore.case = TRUE))
Upvotes: 1
Reputation: 7205
This can be achieved using the dplyr
and str_detect()
from the stringr
package. Note that "India | india" in the following code will capture both "India" and the grammatically incorrect "india" in case it exists:
library(dplyr)
library(stringr)
# Some example data
df <- data.frame(File = c(1356, 1548, 1600, 1601),
Text = c("Digital India is an initiative by the Government of India to ensure that Government services are made available to citizens electronically by improving online infrastructure and by i",
"The textile industry in India traditionally, after agriculture, is the only industry that has generated huge employment for both skilled and unskilled labour. The textile industry conti",
"Some other text",
"This string has india without a capital I."))
df <- df %>%
filter(str_detect(Text, "India | india"))
df
# File Text
# 1 1356 Digital India is an initiative by the Government of India to ensure that Government services are made available to citizens electronically by improving online infrastructure and by i
# 2 1548 The textile industry in India traditionally, after agriculture, is the only industry that has generated huge employment for both skilled and unskilled labour. The textile industry conti
# 3 1601 This string has india without a capital I.
Upvotes: 0