user6374210
user6374210

Reputation:

Using Python 2.7 to get a download link from a web page

So I'm making this program to make a repetitive task less annoying. It's suppose to take a link, filter for a "Download STV Demo" button, grab the url from that button and use it to download. The downloading files from url works fine, I just can't get the url open. It will download from stackoverflow, just not the site I want. I get the 403 Forbidden error. Anyone have ideas on how to get this to work on http://sizzlingstats.com/stats/479453 and also to filter for that download stv button?

import random, sys, urllib2, httplib2, win32clipboard, requests, urlparse
from copy import deepcopy
from bs4 import SoupStrainer
from bs4 import BeautifulSoup
from urllib2 import Request
from urllib2 import urlopen
#When I wrote this, only God and I knew what I was writing
#Now only God knows

page = raw_input("Please copy the .ss link and hit enter... ")
win32clipboard.OpenClipboard()
page = win32clipboard.GetClipboardData()
win32clipboard.CloseClipboard()
s = page
try:
    page = s.replace("http://","http://www.")
    print page + " Found..."
except:
    page = s.replace("www.","http://www.")
    print page

req = urllib2.Request(page, '', headers = { 'User-Agent' : 'Mozilla/5.0' })
req.headers['User-agent'] = 'Mozilla/5.0'
req.add_header('User-agent', 'Mozilla/5.0')
print req
soup = BeautifulSoup(page, 'html.parser')
print soup.prettify()
links = soup.find_all("Download STV Demo")
for tag in links:
    link = links.get('href',None)
    if "Download STV Demo" in link:
        print link

file_name = page.split('/')[-1]
u = urllib2.urlopen(page)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break
    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print status,
f.close()

Upvotes: 0

Views: 147

Answers (2)

ands
ands

Reputation: 2036

Lets look at your code: first you are importing many modules that you don't use (maybe this isn't hole code) and some others you use but you won't need them, actually you only need:

from urllib2 import urlopen

(you'll see later why) and maybe win32clipboard for your input, your input is ok so I'll leave this part of code:

import win32clipboard
page = input("Please copy the .ss link and hit enter... ")
win32clipboard.OpenClipboard()
page = win32clipboard.GetClipboardData()
win32clipboard.CloseClipboard()

but I really don't see purpose of these kind of input, isn't it easier to just use something like:

page = raw_input("Please enter the .ss link: ")

then this part of code is really unnecessary:

s = page
try:                                            
    page = s.replace("http://","http://www.")   
    print page + " Found..."                   
except:                                             
    page = s.replace("www.","http://www.")      
    print page   

so I'll just delete it, next part should look like this:

from urllib2 import Request, urlopen
from bs4 import BeautifulSoup
req = Request(page, headers = { 'User-Agent' : 'Mozilla/5.0' })
#req.headers['User-agent'] = 'Mozilla/5.0'      # you don't need this
#req.add_header('User-agent', 'Mozilla/5.0')    # you don't need this
print req
html = urlopen(req)        #you need to open page with urlopen before using BeautifulSoup
# it is to fix this error:
##      UserWarning: "b'http://www.sizzlingstats.com/stats/479453'" looks like a URL.
##      Beautiful Soup is not an HTTP client. You should probably use an HTTP client
##      to get the document behind the URL, and feed that document to Beautiful Soup.
soup = BeautifulSoup(html, 'html.parser')   # variable page changed to html
# print soup.prettify()         # I commented this because you don't need to print html
                                # but if you want to see that it's work just uncomment it

I won't be using this code and I am going to explain why, but but if you need to scrape someother page with BeautifulSoup then you can use it.

You don't need it because of this part:

links = soup.find_all("Download STV Demo")

so the problem is that there is no "Download STV Demo" in html code, at least not in soup html code, because page is created by javascript so you want find any links, you can use print(links) to see that links == [], because of this you don't need this too:

for tag in links:                     
    link = links.get('href',None)      like I said there is no use of this
    if "Download STV Demo" in link:    because variable links is empty list
       print link

so like I said part of page where is link we need is created with javascript, so you could scrape scripts to find it, but it would be lot harder to do it, but if you look at url we are trying to find it looks like this:

http://sizzlingstv.s3.amazonaws.com/stv/479453.zip

so now look at url you have, it looks like this:

http://sizzlingstats.com/stats/479453

to get this link http://sizzlingstv.s3.amazonaws.com/stv/479453.zip you only need to find last part of link, in this case it is 479453, and you have it you link (http://sizzlingstats.com/stats/479453), it is also last part of it. You even use that number as file_name. Here is code which does exactly that:

file_name = page.split('/')[-1]
download_link = 'http://sizzlingstv.s3.amazonaws.com/stv/' + file_name  + '.zip'

after that I'll copy some of your code:

u = urlopen(download_link)
meta = u.info()    
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

this following part works:

f = open(file_name + '.zip', 'wb')    # I added '.zip'
file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break
    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print status
f.close()

and maybe you want to see downloading messages, but I think it is easier to use:

f = open(file_name + '.zip', 'wb') 
f.write(u.read())
print "Downloaded" 
f.close()

and here just code:

from urllib2 import urlopen

import win32clipboard
page = input("Please copy the .ss link and hit enter... ")
win32clipboard.OpenClipboard()
page = win32clipboard.GetClipboardData()
win32clipboard.CloseClipboard()

# or use:
# page = raw_input("Please enter the .ss link: ")

file_name = page.split('/')[-1]
download_link = 'http://sizzlingstv.s3.amazonaws.com/stv/' + file_name  + '.zip'
u = urlopen(download_link)
meta = u.info()    
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

f = open(file_name + '.zip', 'wb')    # I added '.zip'
file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break
    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print(status)
f.close()

# or use:
##f = open(file_name + '.zip', 'wb') 
##f.write(u.read())
##print "Downloaded" 
##f.close()

Upvotes: 1

J. Doe
J. Doe

Reputation: 1577

The content of that page is generated dynamically via Javascript from their API.

>>> import requests
>>>
>>> requests.get('http://sizzlingstats.com/api/stats/479453').json()['stats']['stvUrl']
u'http://sizzlingstv.s3.amazonaws.com/stv/479453.zip'

You're getting a 403 because they're blocking the user-agent.

You've created a req object with a user-agent but you don't use it you use urllib2.urlopen(page) instead.

You're also passing page to BeautifulSoup which is an error.

soup = BeautifulSoup(page, 'html.parser')

Upvotes: 0

Related Questions