Reputation: 1695
Continuing with my exploration into text analysis, i have encountered yet another roadblock.I understand the logic but don't know how to do it in R. Here's what i want to do: I have 2 CSVs- 1. contains 10,000 comments 2. containing a list of words I want to select all those comments that have any of the words in the 2nd CSV. How can i go about it?
example:
**CSV 1:**
this is a sample set
the comments are not real
this is a random set of words
hope this helps the problem case
thankyou for helping out
i have learned a lot here
feel free to comment
**CSV 2**
sample
set
comment
**Expected output:**
this is a sample set
the comments are not real
this is a random set of words
feel free to comment
Please note: the different forms of words is also considered, eg, comment and comments are both considered.
Upvotes: 0
Views: 81
Reputation: 887108
We can use grep
after paste
ing the elements in the second dataset.
v1 <- scan("file2.csv", what ="")
lines1 <- readLines("file1.csv")
grep(paste(v1, collapse="|"), lines1, value=TRUE)
#[1] "this is a sample set" "the comments are not real"
#[3] "this is a random set of words" "feel free to comment"
Upvotes: 1
Reputation: 88
First create two objects called lines
and words.to.match
from your files. You could do it like this:
lines <- read.csv('csv1.csv', stringsAsFactors=F)[[1]]
words.to.match <- read.csv('csv2.csv', stringsAsFactors=F)[[1]]
Let's say they look like this:
lines <- c(
'this is a sample set',
'the comments are not real',
'this is a random set of words',
'hope this helps the problem case',
'thankyou for helping out',
'i have learned a lot here',
'feel free to comment'
)
words.to.match <- c('sample', 'set', 'comment')
You can then compute the matches with two nested *apply-
functions:
matches <- mapply(
function(words, line)
any(sapply(words, grepl, line, fixed=T)),
list(words.to.match),
lines
)
matched.lines <- lines[which(matches)]
What's going on here? I use mapply
to compute a function over each line in lines, taking words.to.match
as the other argument. Note that the cardinality of list(words.to.match)
is 1. I just recycle this argument across each application. Then, inside the mapply
function I call an sapply
function to check whether any of the words match the line (I check for the match via grepl).
This is not necessarily the most efficient solution, but it's a bit more intelligible to me. Another way you could compute matches
is:
matches <- lapply(words.to.match, grepl, lines, fixed=T)
matches <- do.call("rbind", matches)
matches <- apply(matches, c(2), any)
I dislike this solution because you need to do a do.call("rbind",...)
, which is a bit hacky.
Upvotes: 0