Reputation: 57
I'm trying to build a Shiny app to track vaccine progress, since the CDC page doesn't retain historical information. Looking at the page code with Chrome DevTools, I can see that I want to pull information from every <div>
tag with class "card-content"
, which I tried doing with the following code, using the rvest
package:
data <- read_html('https://covid.cdc.gov/covid-data-tracker/#vaccinations')
current_numbers <- data %>% html_nodes('div.card-content')
This returns an empty object with structure "List of 0."
I also used readr::read_file
to generate a .txt file to see if something weird was happening. It returned a file with
<main id="maincontent">
</main>
and no intervening content, though the header and footer code seems all there.
Is there a better way to pull the data from the <main>
content on the page? Is rvest
the right package for this? I alternatively could try bs4
in Python, but don't know how to make a Shiny app from that.
Upvotes: 1
Views: 95
Reputation: 25241
Website is dealing with dynamic content, so you wont get any information that way.
I am not that deep in r
, but as you mentioned python
and bs4
I could give you an small an working example.
Example
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
driver = webdriver.Chrome(executable_path='C:\Program Files\ChromeDriver\chromedriver.exe')
url = "https://covid.cdc.gov/covid-data-tracker/#vaccinations"
driver.get(url)
sleep(2)
soup = BeautifulSoup(driver.page_source, "lxml")
driver.close()
[{'title': item.find('h4').get_text(), 'value': item.find('div').get_text()} for item in soup.select('div.card-content')]
Output
[{'title': 'Total Doses Distributed', 'value': '37.960.000'},
{'title': 'Total Doses Administered', 'value': '17.546.374'},
{'title': 'Number of People Receiving 1 or More Doses',
'value': '15.053.257'},
{'title': 'Number of People Receiving 2 Doses', 'value': '2.394.961'},
{'title': 'Doses Administered in Long-Term Care Facilities ',
'value': '2.089.181'}]
Upvotes: 1