topcat
topcat

Reputation: 584

Scrape .aspx form with Python

i'm trying to scrape: https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx, which in paper seems like a easy task and with a lot of resources from other SO questions. Nonetheless, I'm getting the same error no matter how I change my request.

I've tried the following:

import requests
from bs4 import BeautifulSoup

url = "https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx"

with requests.Session() as s:
    s.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}

    response = s.get(url)
    soup = BeautifulSoup(response.content)

     data = {
         "ctl00$MainContent$rdoCommoditySystem": "ELEC",
         "ctl00$MainContent$lbReportName": "171",
         "ctl00$MainContent$ddlFrom": "01/11/2018 12:00:00 AM",
         "ctl00$MainContent$rdoReportFormat": "Excel",
         "ctl00$MainContent$btnView": "View",
         "__EVENTVALIDATION": soup.find('input', {'name':'__EVENTVALIDATION'}).get('value',''),
         "__VIEWSTATE": soup.find('input', {'name': '__VIEWSTATE'}).get('value', ''),
         "__VIEWSTATEGENERATOR": soup.find('input', {'name': '__VIEWSTATEGENERATOR'}).get('value', '')
     }

    response = requests.post(url, data=data)

When I print the response.contents object, I get this message (tl;dr, it says that "System error occurred. The system will alert technical support of the problem"):

b'\r\n\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n\r\n<html xmlns="http://www.w3.org/1999/xhtml" >\r\n<head><title>\r\n\r\n</title></head>\r\n<body>\r\n   <form name="form1" method="post" action="Error.aspx?ErrorID=86e0c980-7832-4fc5-b5a8-a8254dd8ad69" id="form1">\r\n<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKMTg3NjI4NzkzNmRkaCA5IA9393/t2iMAptLYU1QiPc8=" />\r\n\r\n<input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="9D6BDE45" />\r\n    <div>\r\n        <h4>\r\n            <span id="lblError">Error</span>\r\n        </h4>\r\n        <span id="lblMessage" class="Validator"><font color="Black">System error occurred. The system will alert technical support of the problem.</font></span>\r\n    </div>\r\n    </form>\r\n</body>\r\n</html>\r\n'

I have used other options, like change the __EVENTTARGET argument, as suggested here, and also pass the cookie from the first request to the POST request. Checking the source of the page, I noticed that the form has a "query" function that need the __EVENTTARGET and __EVENTARGUMENT to work:

//<![CDATA[
var theForm = document.forms['aspnetForm'];
if (!theForm) {
    theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}
//]]>

But both arguments are empty (as can be checked in the Chrome developer inspector) in the body of the POST response. Another problem is that I need to either download the file in any of the formats (PDF or Excel), or get the HTML version, but the .ASPX form do not render the information in the same page, it open a new url: https://apps.neb-one.gc.ca/CommodityStatistics/ViewReport.aspx with the information instead.

I am kind of lost here, what I am missing?

Upvotes: 2

Views: 2256

Answers (1)

topcat
topcat

Reputation: 584

I was able to successfully solve this problem by handling the __VIEWSTATE values with more care. In a ASPX form, the page is using the __VIEWSTATE to hash the status of the webpage (i.e. which options of the form has the user already selected, or in our case requested), and allow the next request.

In this case:

  1. Request to get all headers, store those in the payload and add my first selection by updating the dictionary.
  2. Make a second request with an updated __VIEWSTATE value, and add more options into my request.
  3. Same as 2., but adding the final option.

This will five me the same HTML body I get when I make my request using the browser, but still does not show me the data, or allow me to download the files as part of the body of the last request. This problem can be handled with selenium, but I haven't been sucessful. This question in SO describe my problem.

url = 'https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx'

with requests.Session() as s:
        s.headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36",
            "Content-Type": "application/x-www-form-urlencoded",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
            "Referer": "https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-US,en;q=0.9"
        }

        response = s.get(url)
        soup = BeautifulSoup(response.content, 'html5lib')

        data = { tag['name']: tag['value'] 
            for tag in soup.select('input[name^=ctl00]') if tag.get('value')
            }
        state = { tag['name']: tag['value'] 
                for tag in soup.select('input[name^=__]')
            }

        payload = data.copy()
        payload.update(state)

        payload.update({
            "ctl00$MainContent$rdoCommoditySystem": "ELEC",
            "ctl00$MainContent$lbReportName": '76',
            "ctl00$MainContent$rdoReportFormat": 'PDF',
            "ctl00$MainContent$ddlStartYear": "2008",
            "__EVENTTARGET": "ctl00$MainContent$rdoCommoditySystem$2"
        })

        print(payload['__EVENTTARGET'])
        print(payload['__VIEWSTATE'][-20:])

        response = s.post(url, data=payload, allow_redirects=True)
        soup = BeautifulSoup(response.content, 'html5lib')

        state = { tag['name']: tag['value'] 
                 for tag in soup.select('input[name^=__]')
             }

        payload.pop("ctl00$MainContent$ddlStartYear")
        payload.update(state)
        payload.update({
            "__EVENTTARGET": "ctl00$MainContent$lbReportName",
            "ctl00$MainContent$lbReportName": "171",
            "ctl00$MainContent$ddlFrom": "01/12/2018 12:00:00 AM"
        })

        print(payload['__EVENTTARGET'])
        print(payload['__VIEWSTATE'][-20:])

        response = s.post(url, data=payload, allow_redirects=True)
        soup = BeautifulSoup(response.content, 'html5lib')

        state = { tag['name']: tag['value']
                 for tag in soup.select('input[name^=__]')
                }

        payload.update(state)
        payload.update({
            "ctl00$MainContent$ddlFrom": "01/10/1990 12:00:00 AM",
            "ctl00$MainContent$rdoReportFormat": "HTML",
            "ctl00$MainContent$btnView": "View"
        })

        print(payload['__VIEWSTATE'])

        response = s.post(url, data=payload, allow_redirects=True)
        print(response.text)

Upvotes: 2

Related Questions