Reputation: 1058
I am trying to scrape data from www.speedtest.net/awards/ca/ontario and when I go down some paths the standard functions seems to work, but other paths it doesn't. I'm not sure why.
For example if I go into the header and look for script it works
library(rvest)
URL<-read_html("http://www.speedtest.net/awards/ca/ontario")
test1<-html_nodes(URL,xpath='/html/head/script[1]')
test1
This will return {xml_nodeset (1)} as expected.
But if I go into the body and try something similar
test2<-html_nodes(URL,xpath='/html/body/script[1]')
test2
I get {xml_nodeset (0)}.
Why can I not get to the nodes that are under body?
I'm trying to use the code below but I've traced my issue back to the problem described above.
real<-html_nodes(URL,xpath='/html/body/div[1]/div[3]/div/div[2]/div/div[3]/div[2]/table')
real
Any ideas?
Upvotes: 4
Views: 11017
Reputation: 1058
Thanks. Using the css tag search I was able to come up with this which works great to get the table I wanted (the one in the bottom right).
library(rvest)
URL<-read_html("http://www.speedtest.net/awards/ca/ontario")
table<-html_nodes(URL, "table")
table<-html_table(table)[[2]]
Upvotes: 3
Reputation: 24079
Try this, may not be complete but it should provide a head start in answering your question:
library(rvest)
URL<-read_html("http://www.speedtest.net/awards/ca/ontario")
#find the table rows in the page
table<-html_nodes(URL, "tbody tr")
#pull info from the table rows
num<-html_text(html_nodes(table, "td.u-align-right"))
provider<-html_text(html_nodes(table, "td.cell-provider-name"))
#final data.frame with a table of the results
df<-data.frame(provider, matrix(num, ncol=3, byrow=TRUE))
With rvest I find it easier to search for the css tag as opposed to the xpath.
Upvotes: 1