Reputation:
I am scraping newspaper articles and am struggling to figure out how to exclude more than one node. The R help says that :not()
accepts a sequence of simple selectors. I tried the following
zeit_url <- read.html("http://www.zeit.de/wissen/gesundheit/2017-09/aids-hiv-neuinfektionen-europa-virus-gesundheit")
article <- zeit_url %>%
html_nodes('.article-page>:not(.ad-container, .cardstack)') %>%
html_text()
It does not work to separate the two nodes with a comma. Any suggestions how to correctly specify the sequence of selectors in :not()
?
I have spent a lot of time searching for an answer, but I am new to R (and HTML), so I appreciate your patience if this is something obvious.
Upvotes: 1
Views: 686
Reputation: 321
library(rvest)
zeit_url <- read_html("http://www.zeit.de/wissen/gesundheit/2017-
09/aids-hiv-neuinfektionen-europa-virus-gesundheit")
article <- zeit_url %>%
html_nodes(".article-page>:not(.ad-container):not(.cardstack)") %>%
html_text()
Upvotes: 1