Reputation: 33
Although read_html_live() does return a nodeset that seems to contain all the relevant "bits", I can't then use html_elements() on it (even though the same website, and the same xpath, work perfectly using the more traditional read_html).
I have experience using various other libraries for webscraping, but I'm a relatively new convert to rvest, so entirely possible I'm missing something obvious.
Minimum working example below:
library(rvest)
x <- read_html("https://www.ngaarawhetu.org/news/")
y <- read_html_live("https://www.ngaarawhetu.org/news/")
x_ele <- html_elements(x, xpath = "//link[@rel = 'alternate']") # Just to demonstrate - doesn't seem to work with anything
y_ele <- html_elements(y, xpath = "//link[@rel = 'alternate']")
print(x_ele)
print(y_ele)
The 'x' version, using read_html(), returns the expected values:
{xml_nodeset (5)}
[1] <link rel="alternate" type="application/rss+xml" title="Ngā Ara Whetū » Feed" href="https://www.ngaarawhetu.org/feed/ ...
[2] <link rel="alternate" type="application/rss+xml" title="Ngā Ara Whetū » Comments Feed" href="https://www.ngaarawhetu. ...
[3] <link rel="alternate" type="application/json" href="https://www.ngaarawhetu.org/wp-json/wp/v2/pages/113">\n
[4] <link rel="alternate" type="application/json+oembed" href="https://www.ngaarawhetu.org/wp-json/oembed/1.0/embed?url=h ...
[5] <link rel="alternate" type="text/xml+oembed" href="https://www.ngaarawhetu.org/wp-json/oembed/1.0/embed?url=https%3A% ...
Whereas the 'y' version, using read_html_live() returns no results:
{xml_nodeset (0)}
For this particular website, I would expect html_elements() to return the same results for both object classes (xml_document/xml_node vs. LiveHTML/R6).
Upvotes: 1
Views: 301
Reputation: 33
As per margusl's comment above, the answer was to swap the quotation marks from
html_elements(y, xpath = "//link[@rel = 'alternate']")
to
html_elements(y, xpath = '//link[@rel = "alternate"]')
.
Upvotes: 1