Ncliften
Ncliften

Reputation: 3

web-scraping issue

I am attempting my first web scraping problem for a work project, but I am running into an issue extracting the information I need. I suspect that it has something to do with the page containing javascript. I have done a little research on the issue, and I believe my next step may involve using the RSelenium package, but before diving into that solution, I would rather see if there is a simpler one.

I am trying to scrape information from this page "https://le.utah.gov/DynaBill/BillList?session=2024GS"

I need the same information for all the bills listed on the page. I've provided examples from the first bill for reference. I need the bill number (H.B. 1 Substitute), the short title (Public Education Base Budget Amendments), the sponsor (Rep. Pulsipher, S.), the last update (Thu, Mar 21, 2024 9:35 AM), and the link (https://le.utah.gov/~2024/bills/static/HB0001.html).

I've tried using rvest to select individual HTML elements with the needed information, but after countless trial and error attempts at finding the right element, I ran the code below and discovered the resulting text lacks the information I need.

library("rvest")
url <- "https://le.utah.gov/DynaBill/BillList?session=2024GS"
page <- read_html(url)
html_text(page)

I am a bit confused though, because when I inspect the page, I can identify the following element with all the information I need: <a href="/~2024/bills/static/HB0001.html" class="nlink" target="_blank">H.B. 1 Substitute</a>.

I have limited experience using R, and virtually none for this type of application, so I would greatly appreciate any feedback.

Upvotes: 0

Views: 60

Answers (1)

Carl
Carl

Reputation: 7540

Saving the expanded html page to the working directory (after clicking section toggles) is one option per below.

read_html_live() as suggested in the comment (or RSelenium) may be the way to go though if you need to automate the clicking on the section toggles.

library(rvest)

# url <- "https://le.utah.gov/DynaBill/BillList"

read_html("Bills.html") |> 
  html_elements("#em , dt") |> 
  html_text() |> 
  head()
#> [1] "H.B. 50 Substitute -- State Highway Designation Amendments (Rep. Peterson, K.)"         
#> [2] "H.B. 51 Substitute -- Health and Human Services Funding Amendments (Rep. Spendlove, R.)"
#> [3] "H.B. 52 -- Industrial Hemp Amendments (Rep. Dailey-Provost, J.)"                        
#> [4] "H.B. 53 Substitute -- Property Valuation Amendments (Rep. Thurston, N.)"                
#> [5] "H.B. 54 -- Coal Miner Certification Panel Amendments (Rep. Albrecht, C.)"               
#> [6] "H.B. 55 Second Substitute -- Employment Confidentiality Amendments (Rep. Birkeland, K.)"

read_html("Bills.html") |> 
  html_elements(".nlink") |> 
  html_attr("href") |> 
  head()
#> [1] "https://le.utah.gov/~2024/bills/static/HB0050.html"
#> [2] "https://le.utah.gov/~2024/bills/static/HB0051.html"
#> [3] "https://le.utah.gov/~2024/bills/static/HB0052.html"
#> [4] "https://le.utah.gov/~2024/bills/static/HB0053.html"
#> [5] "https://le.utah.gov/~2024/bills/static/HB0054.html"
#> [6] "https://le.utah.gov/~2024/bills/static/HB0055.html"

Created on 2024-04-03 with reprex v2.1.0

Upvotes: 0

Related Questions