Reputation: 23
I'm trying to scrape Department of Labor data using rvest. I have a list of EINs and PNs (parameters in the web search form) I want to search by. Here's what I have so far:
library(rvest)
library(magrittr)
## URL to page with search form to be populated
site <- "http://www.efast.dol.gov/portal/app/disseminate?execution=e1s1"
session <- html_session(site)
form <- session %>%
html_nodes("form") %>%
extract2(1) %>%
html_form() %>%
set_values(`ein` = "060646973", # example EIN
`pn` = "001") # example PN
result <- submit_form(session, form)
This leads to a page where there is a list of plans. However, I'm not familiar enough with rvest to know how to navigate the result page and download the attachments. It's easily accomplished in the browser, but I want to write a script to automate the task.
Any help on navigating the resulting webpage and downloading attachments using rvest or any other package in R would be much appreciated. Thank you so much!
Upvotes: 2
Views: 1394
Reputation: 78832
This doesn't solve your problem (there are plenty of RSelenium SO responses and blog posts to help you use RSelenium), but the "why" you have to is ugly for this site (and it provides a pointer for where you have to start URL-wise for the RSelenium approach to work).
The site uses "Java Server Faces" on the server-side along with javascript to maintain state and augment navigation. You'll actually have to start at https://www.efast.dol.gov/portal/app/disseminate so the back end can start your session correctly.
Once you fill in the two fields, it makes a POST
request that looks like this (in "Copy as cURL" format):
curl -i -s -k
-X 'POST'
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:43.0) Gecko/20100101 Firefox/43.0'
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Faces-Request: partial/ajax'
-H 'X-Requested-With: XMLHttpRequest' -H 'Referer: https://www.efast.dol.gov/portal/app/disseminate?execution=e1s1'
-b 'JSESSIONID=0000UG27GxfJ4sVgFVXnUi3Ix9C:18fl2akcj'
--data-binary $'javax.faces.partial.ajax=true&javax.faces.source=form%3Anextbtn&javax.faces.partial.execute=%40all&javax.faces.partial.render=form&form%3Anextbtn=form%3Anextbtn&form=form&planName=&sponsorName=&administratorName=&filingId=&ackId=&ein=060646973&pn=001&form%3Aj_idt939%3Apybcalendar_input=&form%3Aj_idt942%3Apyecalendar_input=&formYear=&form%3AnumResults_input=100&form%3AnumResults_editableInput=100&javax.faces.ViewState=e1s1'
'https://www.efast.dol.gov/portal/app/disseminate?execution=e1s1'
I post that to let you see some of the additional fields it submits that aren't directly in the <form>
initially.
The response to that that POST
is something like:
HTTP/1.1 200 OK
X-Powered-By: Servlet/3.0
Pragma: no-cache
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: no-cache
Cache-Control: no-store
X-Powered-By: JSF/2.0
X-Powered-By: JSF/2.0
X-UA-Compatible: IE=EmulateIE7
Content-Type: application/xml; charset=UTF-8
Content-Language: en-US
Date: Fri, 23 Dec 2016 13:10:26 GMT
Content-Length: 142
Connection: keep-alive
<?xml version='1.0' encoding='UTF-8'?>
<partial-response><redirect url="/portal/app/disseminate?execution=e1s2"></redirect></partial-response>
That's a Java Server Faces AJAX redirect response which ultimately causes you to be redirected to the results page with the actual results in a <<table role="treegrid">
(provided to help you target the table in the horrible HTML it returns).
You'll then need to figure out how to ensure you can click the checkboxes and download the info.
Any mis-step in automated navigation will result in breaking the session. So, you may be in for a tedious trial & error to ensure target selection actions are correct.
Upvotes: 2