JasonG
JasonG

Reputation: 45

How to get redirected URL without downloading file

I'm writing a web scraper and basically what I'm working with using requests and bs4 is a site that provides all content in the style https://downlaod.domain.com/xid_39428423_1 which then redirects you to the actual file. What I want is a command which fetches the redirect link before downloading the file, so I can check if I've already downloaded said file. The current code snippet I have is this:

def download_file(file_url,s,thepath):
    if not os.path.isdir(thepath):
        os.makedirs(thepath)
    print 'getting header'
    i = s.head(file_url)
    urlpath = i.url
    name = urlsplit(urlpath)[2].split('/')
    name = name[len(name)-1]
    if not os.path.exists(thepath + name):
        print urlpath
        i = s.get(urlpath)
        if i.status_code == requests.codes.ok:
            with iopen(thepath + name, 'wb') as file:
                file.write(i.content)
        else:
            return False

If I change the s.head to s.get it works, but it downloads the file twice. Is there any way to get the redirected url without downloading?

SOLVED The final code looks like this, thanks!

def download_file(file_url,s,thepath):
    if not os.path.isdir(thepath):
        os.makedirs(thepath)
    print 'getting header'
    i = s.get(file_url, allow_redirects=False)
    if i.status_code == 302:
        urlpath = i.headers['location']
    else: 
        urlpath = file_url
    name = urlsplit(urlpath)[2].split('/')
    name = name[len(name)-1]
    if not os.path.exists(thepath + name):
        print urlpath
        i = s.get(urlpath)
        if i.status_code == requests.codes.ok:
            with iopen(thepath + name, 'wb') as file:
                file.write(i.content)
        else:
            return False

Upvotes: 2

Views: 1944

Answers (1)

Mark van Lent
Mark van Lent

Reputation: 13011

You could use the allow_redirects flag and set it to False (see the documentation). That way the .get() will not follow the redirect, which allows you to inspect the response before retrieving the file itself.

In other words, instead of this:

i = s.head(file_url)
urlpath = i.url

You could write:

i = s.get(file_url, allow_redirects=False)
urlpath = i.headers['location']

Upvotes: 2

Related Questions