eclairs
eclairs

Reputation: 1695

extracting only relevant comments from a list of comments

Continuing with my exploration into text analysis, i have encountered yet another roadblock.I understand the logic but don't know how to do it in R. Here's what i want to do: I have 2 CSVs- 1. contains 10,000 comments 2. containing a list of words I want to select all those comments that have any of the words in the 2nd CSV. How can i go about it?

example:

**CSV 1:**
this is a sample set
the comments are not real
this is a random set of words
hope this helps the problem case
thankyou for helping out
i have learned a lot here
feel free to comment

**CSV 2**
sample
set
comment

**Expected output:**
 this is a sample set
 the comments are not real
 this is a random set of words
 feel free to comment

Please note: the different forms of words is also considered, eg, comment and comments are both considered.

Upvotes: 0

Views: 81

Answers (2)

akrun
akrun

Reputation: 887108

We can use grep after pasteing the elements in the second dataset.

v1 <- scan("file2.csv", what ="")
lines1 <- readLines("file1.csv")
grep(paste(v1, collapse="|"), lines1, value=TRUE)
#[1] "this is a sample set"          "the comments are not real" 
#[3] "this is a random set of words" "feel free to comment"   

Upvotes: 1

bogdata
bogdata

Reputation: 88

First create two objects called lines and words.to.match from your files. You could do it like this:

lines <- read.csv('csv1.csv', stringsAsFactors=F)[[1]]
words.to.match <- read.csv('csv2.csv', stringsAsFactors=F)[[1]]

Let's say they look like this:

lines <- c(
  'this is a sample set',
  'the comments are not real',
  'this is a random set of words',
  'hope this helps the problem case',
  'thankyou for helping out',
  'i have learned a lot here',
  'feel free to comment'
)
words.to.match <- c('sample', 'set', 'comment')

You can then compute the matches with two nested *apply-functions:

matches <- mapply(
    function(words, line)
        any(sapply(words, grepl, line, fixed=T)),
    list(words.to.match),
    lines
)
matched.lines <- lines[which(matches)]

What's going on here? I use mapply to compute a function over each line in lines, taking words.to.match as the other argument. Note that the cardinality of list(words.to.match) is 1. I just recycle this argument across each application. Then, inside the mapply function I call an sapply function to check whether any of the words match the line (I check for the match via grepl).

This is not necessarily the most efficient solution, but it's a bit more intelligible to me. Another way you could compute matches is:

matches <- lapply(words.to.match, grepl, lines, fixed=T)
matches <- do.call("rbind", matches)
matches <- apply(matches, c(2), any)

I dislike this solution because you need to do a do.call("rbind",...), which is a bit hacky.

Upvotes: 0

Related Questions