Reputation: 3
I am attempting my first web scraping problem for a work project, but I am running into an issue extracting the information I need. I suspect that it has something to do with the page containing javascript. I have done a little research on the issue, and I believe my next step may involve using the RSelenium package, but before diving into that solution, I would rather see if there is a simpler one.
I am trying to scrape information from this page "https://le.utah.gov/DynaBill/BillList?session=2024GS"
I need the same information for all the bills listed on the page. I've provided examples from the first bill for reference. I need the bill number (H.B. 1 Substitute), the short title (Public Education Base Budget Amendments), the sponsor (Rep. Pulsipher, S.), the last update (Thu, Mar 21, 2024 9:35 AM), and the link (https://le.utah.gov/~2024/bills/static/HB0001.html).
I've tried using rvest to select individual HTML elements with the needed information, but after countless trial and error attempts at finding the right element, I ran the code below and discovered the resulting text lacks the information I need.
library("rvest")
url <- "https://le.utah.gov/DynaBill/BillList?session=2024GS"
page <- read_html(url)
html_text(page)
I am a bit confused though, because when I inspect the page, I can identify the following element with all the information I need: <a href="/~2024/bills/static/HB0001.html" class="nlink" target="_blank">H.B. 1 Substitute</a>
.
I have limited experience using R, and virtually none for this type of application, so I would greatly appreciate any feedback.
Upvotes: 0
Views: 60
Reputation: 7540
Saving the expanded html
page to the working directory (after clicking section toggles) is one option per below.
read_html_live()
as suggested in the comment (or RSelenium) may be the way to go though if you need to automate the clicking on the section toggles.
library(rvest)
# url <- "https://le.utah.gov/DynaBill/BillList"
read_html("Bills.html") |>
html_elements("#em , dt") |>
html_text() |>
head()
#> [1] "H.B. 50 Substitute -- State Highway Designation Amendments (Rep. Peterson, K.)"
#> [2] "H.B. 51 Substitute -- Health and Human Services Funding Amendments (Rep. Spendlove, R.)"
#> [3] "H.B. 52 -- Industrial Hemp Amendments (Rep. Dailey-Provost, J.)"
#> [4] "H.B. 53 Substitute -- Property Valuation Amendments (Rep. Thurston, N.)"
#> [5] "H.B. 54 -- Coal Miner Certification Panel Amendments (Rep. Albrecht, C.)"
#> [6] "H.B. 55 Second Substitute -- Employment Confidentiality Amendments (Rep. Birkeland, K.)"
read_html("Bills.html") |>
html_elements(".nlink") |>
html_attr("href") |>
head()
#> [1] "https://le.utah.gov/~2024/bills/static/HB0050.html"
#> [2] "https://le.utah.gov/~2024/bills/static/HB0051.html"
#> [3] "https://le.utah.gov/~2024/bills/static/HB0052.html"
#> [4] "https://le.utah.gov/~2024/bills/static/HB0053.html"
#> [5] "https://le.utah.gov/~2024/bills/static/HB0054.html"
#> [6] "https://le.utah.gov/~2024/bills/static/HB0055.html"
Created on 2024-04-03 with reprex v2.1.0
Upvotes: 0