user8959427
user8959427

Reputation: 2067

scraping wikipedia data which looks like a table but is not actually a table

I am trying to scrape some data from Wikiepedia. The data I want to collect is the # of cases and # of deaths from the first "table" on the Wikipedia page. Usually I would get the xpath of the table and use rvest but I cannot seem to collect this piece of data. I would actually prefer to collect the numbers from the graphic, if I look at one of the collapsible's I get (for the date 2020-04-04):

<tr class="mw-collapsible mw-collapsed mw-made-collapsible" id="mw-customcollapsible-apr" style="display: none;">
<td colspan="2" style="text-align:center" class="bb-04em">2020-04-04</td>
<td class="bb-lr">
<div title="8359" style="background:#A50026;width:0.6px" class="bb-fl">​</div>
<div title="14825" style="background:SkyBlue;width:1.06px" class="bb-fl">​</div>
<div title="284692" style="background:Tomato;width:20.36px" class="bb-fl">​</div>
</td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:5.6em">307,876</span><span class="cbs-ibl" style="width:3.5em">(+12%)</span></td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:4.55em">8,359</span><span class="cbs-ibl" style="width:3.5em">(+19%)</span></td>
</tr>

The data is here - 8359, 14825, 284692 along with the # of cases - 307,876 and # of deaths - 8,359. I am trying to extract these numbers for each day.

Code:

url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States"

url %>% 
  read_html() %>% 
  html_node(xpath = '//*[@id="mw-content-text"]/div[1]/div[4]/div/table/tbody') %>% 
  html_table(fill = TRUE)

Upvotes: 0

Views: 46

Answers (1)

QHarr
QHarr

Reputation: 84465

You could use nth-child to target the various columns. To get the right number of rows in each column it is useful to use a css attribute selector with starts with operator to target the appropriate id attribute and substring of attribute value

library(rvest)
library(tidyverse)
library(stringr)

p <- read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States')

covid_info <- tibble(
  dates = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(1)') %>% html_text() %>% as.Date(),
  cases = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(3)') %>% html_text(),
  deaths = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(4)') %>% html_text()
)%>% 
  mutate(
    case_numbers = str_extract(gsub(',','',cases), '^.*(?=\\()' ) %>% as.integer(),
    death_numbers = replace_na(str_extract(gsub(',','',deaths), '^.*(?=\\()' ) %>% as.integer(), NA_integer_)
)

print(covid_info)

Upvotes: 1

Related Questions