Eternity
Eternity

Reputation: 3515

Python mechanize detect downloaded file extension

I'm trying to retrieve websites and save them on the local disk using Python Mechanize. the problem is many websites redirect to links other than html/asp/php. Is there any accurate method to detect what extension a URL has and what type of files it will retrieve?

for instance: http://www.yahoo.com should be saved as html file.

http://www.microsoft.com/en-us/download/confirmation.aspx?id=3745 should be saved as .exe file as it redirects and then downloads an exe file. the content-type is however declared as text/html so that is not the most reliable method i guess.

how can I accurately detect a a file extensions the way browsers do while saving a file?

Thanks heaps

Upvotes: 0

Views: 361

Answers (1)

korylprince
korylprince

Reputation: 3009

http://www.microsoft.com/en-us/download/confirmation.aspx?id=3745 should be saved as .exe file as it redirects and then downloads an exe file. the content-type is however declared as text/html so that is not the most reliable method i guess.

That's not quite correct. It doesn't use an HTTP Redirect. The problem is that Microsoft uses javascript to cause the browser to download the file. The actual file is:

http://download.microsoft.com/download/4/4/9/449b0038-ac27-4b24-bf11-dd8ebdf5cca6/sonar_setup.exe

Since mechanize can't run javascript for you, you'll have to resort to parsing the html and javascript files for links. That might be reasonable if you're only scraping one site that downloads files the same way. If you're looking for a general method, you'll have to find another way entirely.

The only way a browser can know what a downloaded file is:

  1. Check the Content-Type
  2. Check the path extension (I'm not sure if browsers even do 2.)

Upvotes: 1

Related Questions