Aerirprown
Aerirprown

Reputation: 11

Python requests text only returning  instead of HTML

I'm trying to scrape the link to a file to download later from a website.

My code:

outage_page = 'https://www.oasis.oati.com/cgi-bin/webplus.dll?script=/woa/woa-planned-outages-report.html&Provider=MISO'

s = requests.Session()

req = s.get(outage_page, stream=True, verify='my cert path is here')

print(req, '\n', req.headers, '\n', req.raw, '\n', req.encoding, '\n', req.content, '\n', req.text)

This is the output I get:

{'Content-Type': 'text/html', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'Server': 'Microsoft-IIS/7.5', 'X-Powered-By': 'ASP.NET', 'X-Content-Type-Options': 'nosniff', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'Date': 'Mon, 26 Aug 2019 15:48:39 GMT', 'Content-Length': '136'}

ISO-8859-1

b'\xef\xbb\xbf\xef\xbb\xbf\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n'



Process finished with exit code 0

I expected req.text to return the html I could scrape, but it only returns . The other print statements are just for reference here. What am I doing wrong?

Upvotes: 0

Views: 141

Answers (1)

Aerirprown
Aerirprown

Reputation: 11

I'm going to go ahead and post my solution. So I converted my certificate file from .cer to .pem, included the cert in the session instead of the get and added headers to the request. I changed verify to false because it refers to server side certificate not client side.

# create the connection
s = requests.Session()
s.cert = 'path/to/cert.pem'
head = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
}

req = s.get(outage_page, headers=head, verify=False)

Upvotes: 1

Related Questions