Jean V. Adams
Jean V. Adams

Reputation: 4784

Scraping data from public Google sheet - same url for different tabs

I want to scrape data from a public web page of a Google sheet. This is the link.

I am specifically interested in the data in the 4th tab, "US daily 4 pm ET", however the url for that tab is the same as for all the other tabs (at least according the address bar of the browsers I've tried - both Chrome and Firefox). When I try to scrape the data using the rvest package in R, I end up with the data from the 2nd tab, "States current".

I did a right-click to inspect the 1st tab, "README", to see if I could figure something out about the tab names. It looks like the name of the 4th tab is sheet-button-916628299. But entering URLS in my browser that ended with /pubhtml#gid=sheet-button-916628299 or /pubhtml#gid=916628299 didn't take me to the 4th tab.

How can I find a URL that takes me (and, more importantly, the rvest package in R) to the data in the 4th tab?

Upvotes: 0

Views: 388

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 174278

This is fairly straightforward: the data for all the tabs is loaded on the page already rather than being loaded by xhr requests. The contents of each tab are just hidden or unhidden by css.

If you use the developer pane in your browser, you can see that each tab is in a div with a numerical id which is given by the number in the id of each tab.

We can get the page and make a dataframe of the correct css selectors to get each tab's contents like this:

library(rvest)

url <- paste0("https://docs.google.com/spreadsheets/u/2/d/e/",
              "2PACX-1vRwAqp96T9sYYq2-i7Tj0pvTf6XVHjDSMIKBdZ",
              "HXiCGGdNC0ypEU9NbngS8mxea55JuCFuua1MUeOj5/pubhtml#")

page <- read_html(url)
tabs <- html_nodes(page, xpath = "//li")
tab_df <- data.frame(name = tabs %>% html_text, 
                     css = paste0("#", gsub("\\D", "", html_attr(tabs, "id"))),
                     stringsAsFactors = FALSE)
tab_df
#>                   name         css
#> 1               README #1600800428
#> 2       States current #1189059067
#> 3           US current  #294274214
#> 4 States daily 4 pm ET  #916628299
#> 5     US daily 4 pm ET  #964640830
#> 6               States #1983833656

So now we can get the contents of, say, the fourth tab like this:

html_node(page, tab_df$css[4]) %>% html_nodes("table") %>% html_table()
#> [[1]]
#>                                                             
#> 1     1     Date State Positive Negative Pending Death Total
#> 2    NA                                                     
#> 3     2 20200314    AK        1      143                 144
#> 4     3 20200314    AL        6       22      46          74
#> 5     4 20200314    AR       12       65      26         103
#> 6     5 20200314    AZ       12      121      50     0   183
#> 7     6 20200314    CA      252      916             5 1,168
#> 8     7 20200314    CO      101      712             1   814
#> 9     8 20200314    CT       11      125                 136
#> 10    9 20200314    DC       10       49      10          69
#> 11   10 20200314    DE        6       36      32          74
#> 12   11 20200314    FL       77      478     221     3   776
#> 13   12 20200314    GA       66                      1    66
#> 14   13 20200314    HI        2                            2
#> 15   14 20200314    IA       17       83                 100
#> .... (535 rows in total)

Upvotes: 1

Related Questions