Erik Rehnberg Steeb
Erik Rehnberg Steeb

Reputation: 57

Scrape full contents of web page

I'm trying to build a Shiny app to track vaccine progress, since the CDC page doesn't retain historical information. Looking at the page code with Chrome DevTools, I can see that I want to pull information from every <div> tag with class "card-content", which I tried doing with the following code, using the rvest package:

data <- read_html('https://covid.cdc.gov/covid-data-tracker/#vaccinations')
current_numbers <- data %>% html_nodes('div.card-content')

This returns an empty object with structure "List of 0."

I also used readr::read_file to generate a .txt file to see if something weird was happening. It returned a file with

    <main id="maincontent">

    </main>

and no intervening content, though the header and footer code seems all there.

Is there a better way to pull the data from the <main> content on the page? Is rvest the right package for this? I alternatively could try bs4 in Python, but don't know how to make a Shiny app from that.

Upvotes: 1

Views: 95

Answers (1)

HedgeHog
HedgeHog

Reputation: 25241

Website is dealing with dynamic content, so you wont get any information that way.

I am not that deep in r, but as you mentioned python and bs4 I could give you an small an working example.

Example

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

driver = webdriver.Chrome(executable_path='C:\Program Files\ChromeDriver\chromedriver.exe')
url = "https://covid.cdc.gov/covid-data-tracker/#vaccinations"

driver.get(url)
sleep(2)

soup = BeautifulSoup(driver.page_source, "lxml")

driver.close()
[{'title': item.find('h4').get_text(), 'value': item.find('div').get_text()}  for item in soup.select('div.card-content')]

Output

[{'title': 'Total Doses Distributed', 'value': '37.960.000'},
 {'title': 'Total Doses Administered', 'value': '17.546.374'},
 {'title': 'Number of People Receiving 1 or More Doses',
  'value': '15.053.257'},
 {'title': 'Number of People Receiving 2 Doses', 'value': '2.394.961'},
 {'title': 'Doses Administered in Long-Term Care Facilities ',
  'value': '2.089.181'}]

Upvotes: 1

Related Questions