Reputation: 2067
I am trying to scrape some data from Wikiepedia. The data I want to collect is the # of cases
and # of deaths
from the first "table" on the Wikipedia page. Usually I would get the xpath
of the table and use rvest
but I cannot seem to collect this piece of data. I would actually prefer to collect the numbers from the graphic, if I look at one of the collapsible
's I get (for the date 2020-04-04
):
<tr class="mw-collapsible mw-collapsed mw-made-collapsible" id="mw-customcollapsible-apr" style="display: none;">
<td colspan="2" style="text-align:center" class="bb-04em">2020-04-04</td>
<td class="bb-lr">
<div title="8359" style="background:#A50026;width:0.6px" class="bb-fl"></div>
<div title="14825" style="background:SkyBlue;width:1.06px" class="bb-fl"></div>
<div title="284692" style="background:Tomato;width:20.36px" class="bb-fl"></div>
</td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:5.6em">307,876</span><span class="cbs-ibl" style="width:3.5em">(+12%)</span></td>
<td style="text-align:center" class="bb-04em"><span class="cbs-ibr" style="padding:0 0.3em 0 0; width:4.55em">8,359</span><span class="cbs-ibl" style="width:3.5em">(+19%)</span></td>
</tr>
The data is here - 8359
, 14825
, 284692
along with the # of cases
- 307,876
and # of deaths
- 8,359
. I am trying to extract these numbers for each day.
Code:
url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States"
url %>%
read_html() %>%
html_node(xpath = '//*[@id="mw-content-text"]/div[1]/div[4]/div/table/tbody') %>%
html_table(fill = TRUE)
Upvotes: 0
Views: 46
Reputation: 84465
You could use nth-child to target the various columns. To get the right number of rows in each column it is useful to use a css attribute selector with starts with operator to target the appropriate id attribute and substring of attribute value
library(rvest)
library(tidyverse)
library(stringr)
p <- read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States')
covid_info <- tibble(
dates = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(1)') %>% html_text() %>% as.Date(),
cases = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(3)') %>% html_text(),
deaths = p %>% html_nodes('[id^=mw-customcollapsible-] td:nth-child(4)') %>% html_text()
)%>%
mutate(
case_numbers = str_extract(gsub(',','',cases), '^.*(?=\\()' ) %>% as.integer(),
death_numbers = replace_na(str_extract(gsub(',','',deaths), '^.*(?=\\()' ) %>% as.integer(), NA_integer_)
)
print(covid_info)
Upvotes: 1