Reputation: 4784
I want to scrape data from a public web page of a Google sheet. This is the link.
I am specifically interested in the data in the 4th tab, "US daily 4 pm ET", however the url for that tab is the same as for all the other tabs (at least according the address bar of the browsers I've tried - both Chrome and Firefox). When I try to scrape the data using the rvest package in R, I end up with the data from the 2nd tab, "States current".
I did a right-click to inspect the 1st tab, "README", to see if I could figure something out about the tab names. It looks like the name of the 4th tab is sheet-button-916628299. But entering URLS in my browser that ended with /pubhtml#gid=sheet-button-916628299 or /pubhtml#gid=916628299 didn't take me to the 4th tab.
How can I find a URL that takes me (and, more importantly, the rvest package in R) to the data in the 4th tab?
Upvotes: 0
Views: 388
Reputation: 174278
This is fairly straightforward: the data for all the tabs is loaded on the page already rather than being loaded by xhr requests. The contents of each tab are just hidden or unhidden by css.
If you use the developer pane in your browser, you can see that each tab is in a div with a numerical id which is given by the number in the id of each tab.
We can get the page and make a dataframe of the correct css selectors to get each tab's contents like this:
library(rvest)
url <- paste0("https://docs.google.com/spreadsheets/u/2/d/e/",
"2PACX-1vRwAqp96T9sYYq2-i7Tj0pvTf6XVHjDSMIKBdZ",
"HXiCGGdNC0ypEU9NbngS8mxea55JuCFuua1MUeOj5/pubhtml#")
page <- read_html(url)
tabs <- html_nodes(page, xpath = "//li")
tab_df <- data.frame(name = tabs %>% html_text,
css = paste0("#", gsub("\\D", "", html_attr(tabs, "id"))),
stringsAsFactors = FALSE)
tab_df
#> name css
#> 1 README #1600800428
#> 2 States current #1189059067
#> 3 US current #294274214
#> 4 States daily 4 pm ET #916628299
#> 5 US daily 4 pm ET #964640830
#> 6 States #1983833656
So now we can get the contents of, say, the fourth tab like this:
html_node(page, tab_df$css[4]) %>% html_nodes("table") %>% html_table()
#> [[1]]
#>
#> 1 1 Date State Positive Negative Pending Death Total
#> 2 NA
#> 3 2 20200314 AK 1 143 144
#> 4 3 20200314 AL 6 22 46 74
#> 5 4 20200314 AR 12 65 26 103
#> 6 5 20200314 AZ 12 121 50 0 183
#> 7 6 20200314 CA 252 916 5 1,168
#> 8 7 20200314 CO 101 712 1 814
#> 9 8 20200314 CT 11 125 136
#> 10 9 20200314 DC 10 49 10 69
#> 11 10 20200314 DE 6 36 32 74
#> 12 11 20200314 FL 77 478 221 3 776
#> 13 12 20200314 GA 66 1 66
#> 14 13 20200314 HI 2 2
#> 15 14 20200314 IA 17 83 100
#> .... (535 rows in total)
Upvotes: 1