Reputation: 6778
I'm wondering if there is a more efficient way to accomplish my goal. I'm currently writing a spider algorithm to get news stories every morning and I want to filter the initial links from the front page to ignore stuff I don't care about.
You can generate a reproducible example with the following code:
library(RCurl)
library(XML)
opts = list(
proxy = "***.***.***.***", #insert your proxy
proxyusername = "domain\\username",
proxypassword = "password",
proxyport = ****) #insert your port number
links <- 'http://www.cnn.com'
xpaths <- c('//ul[@id="us-menu"]//a', '//div[@id="cnn_maint1lftf"]//a', '//div[@id="cnn_maintt2bul"]//a', '//div[@id="cnn_maintoplive"]//a')
response <- getURL('www.cnn.com', .opts=opts)
doc <- htmlParse(response)
for (xpath in xpaths) {
li <- getNodeSet(doc, xpath)
links <- c(links, sapply(li, xmlGetAttr, 'href'))
}
links <- links[!duplicated(links)]
links <- links[-1]
Here is the code where I'm looking to improve efficiency:
bad.words <- c('video', 'travel', 'living', 'health', 'ireport', 'bleacher', 'showbiz', 'mcafee')
t.1 <- sapply(links, function(x) sapply(bad.words, function(z) any(length(grep(z, x, ignore.case=T)) > 0)))
t.1 <- unname(t.1)
t.1 <- colSums(t.1)
links <- links[!t.1]
I have to assume there is a cleaner, more efficient way to achieve my goal than this. Any thoughts?
Upvotes: 1
Views: 101
Reputation: 206308
You could use a regular expression in this case. It would be important that your list of bad words doesn't have any "special" regex characters such as periods or other punctuation. But if that's the case, you can paste them together and do everything in one grep. We just create the regex by pasting all the words together with an "or" operator.
bad.words <- c('video', 'travel', 'living', 'health',
'ireport', 'bleacher', 'showbiz', 'mcafee')
re <- paste0("\\b(",paste(bad.words, collapse="|"),")\\b")
links <- links[ !grepl(re, links) ]
We also add the boundary matches \b
to make sure we match the full word. But this means it will not match things like "videos" so make sure that's what you want.
Upvotes: 2