Reputation: 155
I have an url http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip which "redirects" me to http://images.vbb.de/assets/ftp/file/286316.zip. Redirect in quotes because python says there is no redirect:
In [51]: response = requests.get('http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip')
...: if response.history:
...: print "Request was redirected"
...: for resp in response.history:
...: print resp.status_code, resp.url
...: print "Final destination:"
...: print response.status_code, response.url
...: else:
...: print "Request was not redirected"
...:
Request was not redirected
Status Code is also 200. response.history
gives nothing. response.url
gives the first url and not the real one. But it's possible to get the real url in firefox -> developer tools -> network. How do I make in python 2.7? Thanks in advance!!
Upvotes: 0
Views: 457
Reputation: 46789
You need to first carry out the redirect manually by parsing the new window.location.href
from the first returned HTML. This then creates a 301
reply with the name of the target file contained inside the Location
header that is returned:
import requests
import re
import os
base_url = 'http://www.vbb.de'
response = requests.get(base_url + '/de/datei/GTFS_VBB_Nov2015_Dez2016.zip')
manual_redirect = base_url + re.findall('window.location.href\s+=\s+"(.*?)"', response.text)[0]
response = requests.get(manual_redirect, stream=True)
target_filename = response.history[0].headers['Location'].split('/')[-1]
print "Downloading: '{}'".format(target_filename)
with open(target_filename, 'wb') as f_zip:
for chunk in response.iter_content(chunk_size=1024):
f_zip.write(chunk)
This would display:
Downloading: '286316.zip'
and result in a 29,464,299 byte zip file being created.
Upvotes: 1
Reputation: 2991
You can use BeautifulSoup to read the meta tag in the header of the HTML page and get the redirect URL e.g.
>>> import requests
>>> from bs4 import BeautifulSoup
>>> a = requests.get("http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip")
>>> soup = BeautifulSoup(a.text, 'html.parser')
>>> soup.find_all('meta', attrs={'http-equiv': lambda x:x.lower() == 'refresh'})[0]['content'].split('URL=')[1]
'/de/download/GTFS_VBB_Nov2015_Dez2016.zip'
This URL would be relative to the original URL's domain, making the new URL http://www.vbb.de/de/download/GTFS_VBB_Nov2015_Dez2016.zip
. Downloading this appears to download the ZIP file for me:
>>> a = requests.get("http://www.vbb.de/de/download/GTFS_VBB_Nov2015_Dez2016.zip", stream=True)
>>> with open('test.zip', 'wb') as f:
... a.raw.decode_content = True
... shutil.copyfileobj(a.raw, f)
...
$ unzip -l test.zip
Archive: test.zip
Length Date Time Name
--------- ---------- ----- ----
5554 2015-11-20 15:17 agency.txt
2151517 2015-11-20 15:17 calendar_dates.txt
71731 2015-11-20 15:17 calendar.txt
65424 2015-11-20 15:17 routes.txt
816498 2015-11-20 15:17 stops.txt
196020096 2015-11-20 15:17 stop_times.txt
365499 2015-11-20 15:17 transfers.txt
11765292 2015-11-20 15:17 trips.txt
113 2015-11-20 15:17 logging
--------- -------
211261724 9 files
It is on this redirect that there is a 301 status returned:
>>> a.history
[<Response [301]>]
>>> a
<Response [200]>
>>> a.history[0]
<Response [301]>
>>> a.history[0].url
'http://www.vbb.de/de/download/GTFS_VBB_Nov2015_Dez2016.zip'
>>> a.url
'http://images.vbb.de/assets/ftp/file/286316.zip'
Upvotes: 0