wolf7687
wolf7687

Reputation: 155

Finding url to send post request to

I'm trying to grab data from this site: https://www.azjobconnection.gov/ada/mn_warn_dsp.cfm?def=false

I wrote the code below to enter the desired start date then post into the form. I expected to be taken to the page that pops up after you enter in the desired start date; the desired page has all the business you could click on to get additional data. I'm not getting that page. I think site security is blocking me or I'm doing something wrong. At the end of the day, I need to extract data from all of the WARN notices at the site.

I suspect maybe I'm not sending the POST request to the correct url; how do I find out the correct url to send the POST request to.

I got date4 as the date field from the inspect element; I suppose that could be wrong also.

import requests
params = {'date4': '01/01/2020'}
with requests.session() as s:
 r = s.post("https://www.azjobconnection.gov/ada/mn_warn_dsp.cfm?securitysys=off&FormID=0", data=params)
site_text = r.text

Upvotes: 0

Views: 84

Answers (1)

Tyler
Tyler

Reputation: 438

Upon looking at this page, you are missing some request headers, specifically the authenticity token. To grab this, we must parse the HTML of the previous page to find it. Take a look at this quick example:

# Imports
from bs4 import BeautifulSoup
from requests import Session

# Session Object
session = Session()

# Add a user agent, so the request looks more human like.
session.headers.update({
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
})

# Initial sesssion, you need to fetch the url first, so the authenticity
# token can be parsed out of the html
init_session = session.get(url="https://www.azjobconnection.gov/ada/mn_warn_dsp.cfm?def=false")

# Beautiful soup object, used for HTML parsing
soup = BeautifulSoup(init_session.content, "html.parser")

# Get all of the input tags
inputs = soup.findAll('input')


# Upon running, we see that the authenticity token, is the first element in the array.
authenticty_token = inputs[0]['value']

# Now we can make our request!

# Request data
data = {
    "authenticity_token" : authenticty_token,
    "coname": "", 
    "coName_ADAdefault": "", 
    "coName_verify_char[0|50]": "The value you have supplied for Company Name is too long.",
    "city": "", 
    "city_ADAdefault": "", 
    "city_verify_char[0|45]": "The value you have supplied for City is too long.",
    "zip": "", 
    "zip_ADAdefault": "", 
    "zip_verify_char[0|10]": "The value you have supplied for Zip/Postal Code is too long.",
    "sda": "", 
    "startdate": "01/01/2020",
    "startDate_ADAdefault": "mm/dd/yyyy",
    "startDate_verify_date4": "",
    "startDate_verify_char[0|45]": "The value you have supplied for Start Date is too long.",
    "enddate": "mm/dd/yyyy",
    "endDate_ADAdefault": "mm/dd/yyyy",
    "endDate_verify_date4": "", 
    "endDate_verify_char[0|45]": "The value you have supplied for End Date is too long.",
    "layoffType": "y",
    "search": "Search",
    "old_choice": 1,
    "ZIP_prev": "",
    "def_prev": "false",
    "CITY_prev": "",
    "SDA_prev": "",
    "STARTDATE_prev": "", 
    "CONAME_prev": "",
    "ENDDATE_prev": "",
    "FormName": "Form0",
}

# Get the data
get_warn_data = session.post("https://www.azjobconnection.gov/ada/mn_warn_dsp.cfm?securitysys=on&FormID=0", data=data)

# print the data, this looks messy, so lets prettify with bs4!
#print(get_warn_data.content) 


soup = BeautifulSoup(get_warn_data.content, "html.parser")

print(soup.prettify())

This will get you the HTML you are looking for. Now within this HTML, you will need to parse the a href tags to grab the links you need. For example, they will look like this:

<tr class="cfOutputTableRow cfAlternate">
             <td align="left" class="cfPadLeft cfAlternate" colspan="1" valign="top">
              <span class="blTransparent">
               <a href="mn_warn_dsp.cfm?id=399&amp;callingfile=mn_warn_dsp.cfm&amp;hash=0C2428869560C6832A1D929070C0278F">
                Aecom
               </a>
              </span>
             </td>
             <td align="left" class="cfAlternate" colspan="1" valign="top">
              <span class="blTransparent">
               Glendale
              </span>
             </td>
             <td align="left" class="cfAlternate" colspan="1" valign="top">
              <span class="blTransparent">
               85310
              </span>
             </td>
             <td align="left" class="cfAlternate" colspan="1" valign="top">
              <span class="blTransparent">
               7
              </span>
             </td>
             <td align="left" class="cfAlternate cfPadRight" colspan="1" valign="top">
              <span class="blTransparent">
               01/17/2020
              </span>
             </td>
            </tr>

Specifically:

<a href="mn_warn_dsp.cfm?id=399&amp;callingfile=mn_warn_dsp.cfm&amp;hash=0C2428869560C6832A1D929070C0278F">

Once you have grabbed this link, be sure to prepend https://www.azjobconnection.gov/ada/ to it.

https://www.azjobconnection.gov/ada/mn_warn_dsp.cfm?id=399&amp;callingfile=mn_warn_dsp.cfm&amp;hash=0C2428869560C6832A1D929070C0278F

Upvotes: 1

Related Questions