Reputation: 584
i'm trying to scrape: https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx, which in paper seems like a easy task and with a lot of resources from other SO questions. Nonetheless, I'm getting the same error no matter how I change my request.
I've tried the following:
import requests
from bs4 import BeautifulSoup
url = "https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx"
with requests.Session() as s:
s.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
response = s.get(url)
soup = BeautifulSoup(response.content)
data = {
"ctl00$MainContent$rdoCommoditySystem": "ELEC",
"ctl00$MainContent$lbReportName": "171",
"ctl00$MainContent$ddlFrom": "01/11/2018 12:00:00 AM",
"ctl00$MainContent$rdoReportFormat": "Excel",
"ctl00$MainContent$btnView": "View",
"__EVENTVALIDATION": soup.find('input', {'name':'__EVENTVALIDATION'}).get('value',''),
"__VIEWSTATE": soup.find('input', {'name': '__VIEWSTATE'}).get('value', ''),
"__VIEWSTATEGENERATOR": soup.find('input', {'name': '__VIEWSTATEGENERATOR'}).get('value', '')
}
response = requests.post(url, data=data)
When I print the response.contents
object, I get this message (tl;dr, it says that "System error occurred. The system will alert technical support of the problem"):
b'\r\n\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n\r\n<html xmlns="http://www.w3.org/1999/xhtml" >\r\n<head><title>\r\n\r\n</title></head>\r\n<body>\r\n <form name="form1" method="post" action="Error.aspx?ErrorID=86e0c980-7832-4fc5-b5a8-a8254dd8ad69" id="form1">\r\n<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKMTg3NjI4NzkzNmRkaCA5IA9393/t2iMAptLYU1QiPc8=" />\r\n\r\n<input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="9D6BDE45" />\r\n <div>\r\n <h4>\r\n <span id="lblError">Error</span>\r\n </h4>\r\n <span id="lblMessage" class="Validator"><font color="Black">System error occurred. The system will alert technical support of the problem.</font></span>\r\n </div>\r\n </form>\r\n</body>\r\n</html>\r\n'
I have used other options, like change the __EVENTTARGET
argument, as suggested here, and also pass the cookie from the first request to the POST request. Checking the source of the page, I noticed that the form has a "query" function that need the __EVENTTARGET
and __EVENTARGUMENT
to work:
//<![CDATA[
var theForm = document.forms['aspnetForm'];
if (!theForm) {
theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
But both arguments are empty (as can be checked in the Chrome developer inspector) in the body of the POST response. Another problem is that I need to either download the file in any of the formats (PDF or Excel), or get the HTML version, but the .ASPX form do not render the information in the same page, it open a new url: https://apps.neb-one.gc.ca/CommodityStatistics/ViewReport.aspx with the information instead.
I am kind of lost here, what I am missing?
Upvotes: 2
Views: 2256
Reputation: 584
I was able to successfully solve this problem by handling the __VIEWSTATE
values with more care. In a ASPX form, the page is using the __VIEWSTATE
to hash the status of the webpage (i.e. which options of the form has the user already selected, or in our case requested), and allow the next request.
In this case:
payload
and add my first selection by updating the dictionary. __VIEWSTATE
value, and add more options into my request. This will five me the same HTML body I get when I make my request using the browser, but still does not show me the data, or allow me to download the files as part of the body of the last request. This problem can be handled with selenium
, but I haven't been sucessful. This question in SO describe my problem.
url = 'https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx'
with requests.Session() as s:
s.headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Referer": "https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9"
}
response = s.get(url)
soup = BeautifulSoup(response.content, 'html5lib')
data = { tag['name']: tag['value']
for tag in soup.select('input[name^=ctl00]') if tag.get('value')
}
state = { tag['name']: tag['value']
for tag in soup.select('input[name^=__]')
}
payload = data.copy()
payload.update(state)
payload.update({
"ctl00$MainContent$rdoCommoditySystem": "ELEC",
"ctl00$MainContent$lbReportName": '76',
"ctl00$MainContent$rdoReportFormat": 'PDF',
"ctl00$MainContent$ddlStartYear": "2008",
"__EVENTTARGET": "ctl00$MainContent$rdoCommoditySystem$2"
})
print(payload['__EVENTTARGET'])
print(payload['__VIEWSTATE'][-20:])
response = s.post(url, data=payload, allow_redirects=True)
soup = BeautifulSoup(response.content, 'html5lib')
state = { tag['name']: tag['value']
for tag in soup.select('input[name^=__]')
}
payload.pop("ctl00$MainContent$ddlStartYear")
payload.update(state)
payload.update({
"__EVENTTARGET": "ctl00$MainContent$lbReportName",
"ctl00$MainContent$lbReportName": "171",
"ctl00$MainContent$ddlFrom": "01/12/2018 12:00:00 AM"
})
print(payload['__EVENTTARGET'])
print(payload['__VIEWSTATE'][-20:])
response = s.post(url, data=payload, allow_redirects=True)
soup = BeautifulSoup(response.content, 'html5lib')
state = { tag['name']: tag['value']
for tag in soup.select('input[name^=__]')
}
payload.update(state)
payload.update({
"ctl00$MainContent$ddlFrom": "01/10/1990 12:00:00 AM",
"ctl00$MainContent$rdoReportFormat": "HTML",
"ctl00$MainContent$btnView": "View"
})
print(payload['__VIEWSTATE'])
response = s.post(url, data=payload, allow_redirects=True)
print(response.text)
Upvotes: 2