Can't get Python to download webpage source code: "browser version not supported"

Question

So I'm trying to write a program that would download the source-code of a webpage in Python 2.7.

The code looks like this:

import urllib2
url = "https://scrap.tf/stranges/47"
req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"}) 
con = urllib2.urlopen(req)
data = con.read()
print data

filename = raw_input("Enter filename here: ") + ".txt"
in_data = open(filename, "w")
in_data.write(data)
in_data.close()

However when I open the output file major chunks of the source code are missing and instead there's a message saying that this version of the browser is unsupported and I should get another one.

Is there a way I can avoid this problem?

Ed King · Accepted Answer

Looking at the url you listed, I did the following:

Downloaded the page using wget
Used urllib with ipython and downloaded the page
Used chrome and saved the url only

All 3 gave me the same resulting file (same size, same contents).

This could be because I'm not logging in, but I do see the site contains lots of javascript which will render the pages.

I understand that you are trying to use urllib -- but given the above, I would use selenium and will detail how to get started with it. This example needs selenium and phantomjs but you could do the same with selenium and firefox.

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities


browser_agent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"

url = 'https://scrap.tf/stranges/47'

dcap = {}
mydriver = None

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = browser_agent
mydriver = webdriver.PhantomJS(desired_capabilities=dcap)
mydriver.implicitly_wait(30)
mydriver.set_window_size(1366,768)

mydriver.get(url)
title = mydriver.title
print (title)
page = mydriver.page_source
# debugging -- get screen shot to see how we look
mydriver.get_screenshot_as_file('/data/screen/test.png')

This downloads the page and all javascript is rendered correctly, but you'll need to log in to steam which will require some interaction.

You can determine what needs to be done by inspecting the page in Chrome or Firefox, finding the css selector or xpath, and using the webdriver find_element function.

This also allows keypresses and clicks.

Can't get Python to download webpage source code: "browser version not supported"

Answers (1)

Related Questions

Can&#39;t get Python to download webpage source code: &quot;browser version not supported&quot;

Answers (1)

Related Questions

Can't get Python to download webpage source code: "browser version not supported"