Dmitri
Dmitri

Reputation: 155

How do I get a real file url in python 2.7?

I have an url http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip which "redirects" me to http://images.vbb.de/assets/ftp/file/286316.zip. Redirect in quotes because python says there is no redirect:

    In [51]: response = requests.get('http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip')
        ...: if response.history:
        ...:     print "Request was redirected"
        ...:     for resp in response.history:
        ...:         print resp.status_code, resp.url
        ...:     print "Final destination:"
        ...:     print response.status_code, response.url
        ...: else:
        ...:     print "Request was not redirected"
        ...:     
    Request was not redirected

Status Code is also 200. response.history gives nothing. response.url gives the first url and not the real one. But it's possible to get the real url in firefox -> developer tools -> network. How do I make in python 2.7? Thanks in advance!!

Upvotes: 0

Views: 457

Answers (2)

Martin Evans
Martin Evans

Reputation: 46789

You need to first carry out the redirect manually by parsing the new window.location.href from the first returned HTML. This then creates a 301 reply with the name of the target file contained inside the Location header that is returned:

import requests
import re
import os

base_url = 'http://www.vbb.de'
response = requests.get(base_url + '/de/datei/GTFS_VBB_Nov2015_Dez2016.zip')
manual_redirect = base_url + re.findall('window.location.href\s+=\s+"(.*?)"', response.text)[0]
response = requests.get(manual_redirect, stream=True)
target_filename = response.history[0].headers['Location'].split('/')[-1]

print "Downloading: '{}'".format(target_filename)
with open(target_filename, 'wb') as f_zip:
    for chunk in response.iter_content(chunk_size=1024):
        f_zip.write(chunk)

This would display:

Downloading: '286316.zip'

and result in a 29,464,299 byte zip file being created.

Upvotes: 1

Jonathon McMurray
Jonathon McMurray

Reputation: 2991

You can use BeautifulSoup to read the meta tag in the header of the HTML page and get the redirect URL e.g.

>>> import requests
>>> from bs4 import BeautifulSoup
>>> a = requests.get("http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip")
>>> soup = BeautifulSoup(a.text, 'html.parser')
>>> soup.find_all('meta', attrs={'http-equiv': lambda x:x.lower() == 'refresh'})[0]['content'].split('URL=')[1]
'/de/download/GTFS_VBB_Nov2015_Dez2016.zip'

This URL would be relative to the original URL's domain, making the new URL http://www.vbb.de/de/download/GTFS_VBB_Nov2015_Dez2016.zip. Downloading this appears to download the ZIP file for me:

>>> a = requests.get("http://www.vbb.de/de/download/GTFS_VBB_Nov2015_Dez2016.zip", stream=True)
>>> with open('test.zip', 'wb') as f:
...     a.raw.decode_content = True
...     shutil.copyfileobj(a.raw, f)
...

 $ unzip -l test.zip
Archive:  test.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     5554  2015-11-20 15:17   agency.txt
  2151517  2015-11-20 15:17   calendar_dates.txt
    71731  2015-11-20 15:17   calendar.txt
    65424  2015-11-20 15:17   routes.txt
   816498  2015-11-20 15:17   stops.txt
196020096  2015-11-20 15:17   stop_times.txt
   365499  2015-11-20 15:17   transfers.txt
 11765292  2015-11-20 15:17   trips.txt
      113  2015-11-20 15:17   logging
---------                     -------
211261724                     9 files

It is on this redirect that there is a 301 status returned:

>>> a.history
[<Response [301]>]
>>> a
<Response [200]>
>>> a.history[0]
<Response [301]>
>>> a.history[0].url
'http://www.vbb.de/de/download/GTFS_VBB_Nov2015_Dez2016.zip'
>>> a.url
'http://images.vbb.de/assets/ftp/file/286316.zip'

Upvotes: 0

Related Questions