Reputation: 83
So I'm trying to write a program that would download the source-code of a webpage in Python 2.7.
The code looks like this:
import urllib2
url = "https://scrap.tf/stranges/47"
req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"})
con = urllib2.urlopen(req)
data = con.read()
print data
filename = raw_input("Enter filename here: ") + ".txt"
in_data = open(filename, "w")
in_data.write(data)
in_data.close()
However when I open the output file major chunks of the source code are missing and instead there's a message saying that this version of the browser is unsupported and I should get another one.
Is there a way I can avoid this problem?
Upvotes: 0
Views: 892
Reputation: 454
Looking at the url you listed, I did the following:
All 3 gave me the same resulting file (same size, same contents).
This could be because I'm not logging in, but I do see the site contains lots of javascript which will render the pages.
I understand that you are trying to use urllib -- but given the above, I would use selenium and will detail how to get started with it. This example needs selenium and phantomjs but you could do the same with selenium and firefox.
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
browser_agent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
url = 'https://scrap.tf/stranges/47'
dcap = {}
mydriver = None
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = browser_agent
mydriver = webdriver.PhantomJS(desired_capabilities=dcap)
mydriver.implicitly_wait(30)
mydriver.set_window_size(1366,768)
mydriver.get(url)
title = mydriver.title
print (title)
page = mydriver.page_source
# debugging -- get screen shot to see how we look
mydriver.get_screenshot_as_file('/data/screen/test.png')
This downloads the page and all javascript is rendered correctly, but you'll need to log in to steam which will require some interaction.
You can determine what needs to be done by inspecting the page in Chrome or Firefox, finding the css selector or xpath, and using the webdriver find_element function.
This also allows keypresses and clicks.
Upvotes: 2