Reputation: 3515
I'm trying to retrieve websites and save them on the local disk using Python Mechanize. the problem is many websites redirect to links other than html/asp/php. Is there any accurate method to detect what extension a URL has and what type of files it will retrieve?
for instance: http://www.yahoo.com should be saved as html file.
http://www.microsoft.com/en-us/download/confirmation.aspx?id=3745 should be saved as .exe file as it redirects and then downloads an exe file. the content-type is however declared as text/html so that is not the most reliable method i guess.
how can I accurately detect a a file extensions the way browsers do while saving a file?
Thanks heaps
Upvotes: 0
Views: 361
Reputation: 3009
http://www.microsoft.com/en-us/download/confirmation.aspx?id=3745 should be saved as .exe file as it redirects and then downloads an exe file. the content-type is however declared as text/html so that is not the most reliable method i guess.
That's not quite correct. It doesn't use an HTTP Redirect. The problem is that Microsoft uses javascript to cause the browser to download the file. The actual file is:
http://download.microsoft.com/download/4/4/9/449b0038-ac27-4b24-bf11-dd8ebdf5cca6/sonar_setup.exe
Since mechanize can't run javascript for you, you'll have to resort to parsing the html and javascript files for links. That might be reasonable if you're only scraping one site that downloads files the same way. If you're looking for a general method, you'll have to find another way entirely.
The only way a browser can know what a downloaded file is:
Upvotes: 1