Reputation: 45
I'm writing a web scraper and basically what I'm working with using requests and bs4 is a site that provides all content in the style https://downlaod.domain.com/xid_39428423_1 which then redirects you to the actual file. What I want is a command which fetches the redirect link before downloading the file, so I can check if I've already downloaded said file. The current code snippet I have is this:
def download_file(file_url,s,thepath):
if not os.path.isdir(thepath):
os.makedirs(thepath)
print 'getting header'
i = s.head(file_url)
urlpath = i.url
name = urlsplit(urlpath)[2].split('/')
name = name[len(name)-1]
if not os.path.exists(thepath + name):
print urlpath
i = s.get(urlpath)
if i.status_code == requests.codes.ok:
with iopen(thepath + name, 'wb') as file:
file.write(i.content)
else:
return False
If I change the s.head to s.get it works, but it downloads the file twice. Is there any way to get the redirected url without downloading?
SOLVED The final code looks like this, thanks!
def download_file(file_url,s,thepath):
if not os.path.isdir(thepath):
os.makedirs(thepath)
print 'getting header'
i = s.get(file_url, allow_redirects=False)
if i.status_code == 302:
urlpath = i.headers['location']
else:
urlpath = file_url
name = urlsplit(urlpath)[2].split('/')
name = name[len(name)-1]
if not os.path.exists(thepath + name):
print urlpath
i = s.get(urlpath)
if i.status_code == requests.codes.ok:
with iopen(thepath + name, 'wb') as file:
file.write(i.content)
else:
return False
Upvotes: 2
Views: 1944
Reputation: 13011
You could use the allow_redirects
flag and set it to False
(see the documentation). That way the .get()
will not follow the redirect, which allows you to inspect the response before retrieving the file itself.
In other words, instead of this:
i = s.head(file_url)
urlpath = i.url
You could write:
i = s.get(file_url, allow_redirects=False)
urlpath = i.headers['location']
Upvotes: 2