scraping wikipedia data which looks like a table but is not actually a table

Question

I am trying to scrape some data from Wikiepedia. The data I want to collect is the # of cases and # of deaths from the first "table" on the Wikipedia page. Usually I would get the xpath of the table and use rvest but I cannot seem to collect this piece of data. I would actually prefer to collect the numbers from the graphic, if I look at one of the collapsible's I get (for the date 2020-04-04):


2020-04-04





307,876(+12%)
8,359(+19%)

The data is here - 8359, 14825, 284692 along with the # of cases - 307,876 and # of deaths - 8,359. I am trying to extract these numbers for each day.

Code:

url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States"

url %>% 
  read_html() %>% 
  html_node(xpath = '//*[@id="mw-content-text"]/div[1]/div[4]/div/table/tbody') %>% 
  html_table(fill = TRUE)

QHarr · Accepted Answer

You could use nth-child to target the various columns. To get the right number of rows in each column it is useful to use a css attribute selector with starts with operator to target the appropriate id attribute and substring of attribute value

library(rvest)
library(tidyverse)
library(stringr)

p <- read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States')

covid_info <- tibble(
  dates = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(1)') %>% html_text() %>% as.Date(),
  cases = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(3)') %>% html_text(),
  deaths = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(4)') %>% html_text()
)%>% 
  mutate(
    case_numbers = str_extract(gsub(',','',cases), '^.*(?=\()' ) %>% as.integer(),
    death_numbers = replace_na(str_extract(gsub(',','',deaths), '^.*(?=\()' ) %>% as.integer(), NA_integer_)
)

print(covid_info)

scraping wikipedia data which looks like a table but is not actually a table

Answers (1)

Related Questions