Reputation: 1821
Consider the following URLs
http://m3u.com/tunein.m3u http://asxsomeurl.com/listen.asx:8024 http://www.plssomeotherurl.com/station.pls?id=111 http://22.198.133.16:8024
Whats the proper way to determine the file extensions (.m3u/.asx/.pls)? Obviously the last one doesn't have a file extension.
EDIT: I forgot to mention that m3u/asx/pls are playlists (textfiles) for audio streams and must be parsed differently. The goal determine the extension and then send the url to the proper parsing-function. E.g.
url = argv[1]
ext = GetExtension(url)
if ext == "pls":
realurl = ParsePLS(url)
elif ext == "asx":
realurl = ParseASX(url)
(etc.)
else:
realurl = url
Play(realurl)
GetExtension() should return the file extension (if any), preferrably without connecting to the URL.
Upvotes: 49
Views: 54384
Reputation: 14177
Use urlparse
to parse the path out of the URL, then os.path.splitext
to get the extension.
import os
try:
# This should work for Python 2
from urlparse import urlparse
except ImportError:
# If that failed, you are on Python 3
from urllib.parse import urlparse
url = 'http://www.plssomeotherurl.com/station.pls?id=111'
path = urlparse(url).path
ext = os.path.splitext(path)[1]
Note that the extension may not be a reliable indicator of the type of the file. The HTTP Content-Type
header may be better.
Upvotes: 58
Reputation: 1
This is quite an old topic, but this oneliner is what did:
file_ext = "."+ url.split("/")[-1:][0].split(".")[-1:][0]
Assumption is that there is a file extension.
Upvotes: 0
Reputation: 993163
The real proper way is to not use file extensions at all. Do a GET (or HEAD) request to the URL in question, and use the returned "Content-type" HTTP header to get the content type. File extensions are unreliable.
See MIME types (IANA media types) for more information and a list of useful MIME types.
Upvotes: 25
Reputation: 21
A different approach that takes nothing else into account except for the actual file extension from a url:
def fileExt( url ):
# compile regular expressions
reQuery = re.compile( r'\?.*$', re.IGNORECASE )
rePort = re.compile( r':[0-9]+', re.IGNORECASE )
reExt = re.compile( r'(\.[A-Za-z0-9]+$)', re.IGNORECASE )
# remove query string
url = reQuery.sub( "", url )
# remove port
url = rePort.sub( "", url )
# extract extension
matches = reExt.search( url )
if None != matches:
return matches.group( 1 )
return None
edit: added handling of explicit ports from :1234
Upvotes: 1
Reputation: 1
you can try the rfc6266 module like:
import requests
import rfc6266
req = requests.head(downloadLink)
headersContent = req.headers['Content-Disposition']
rfcFilename = rfc6266.parse_headers(headersContent, relaxed=True).filename_unsafe
filename = requests.utils.unquote(rfcFilename)
Upvotes: 0
Reputation: 6832
This is easiest with requests
and mimetypes
:
import requests
import mimetypes
response = requests.get(url)
content_type = response.headers['content-type']
extension = mimetypes.guess_extension(content_type)
The extension includes a dot prefix. For example, extension
is '.png'
for content type 'image/png'
.
Upvotes: 50
Reputation: 862
To get the content-type you can write a function one like I have written using urllib2. If you need to utilize page content anyway it is likely that you will use urllib2 so no need to import os.
import urllib2
def getContentType(pageUrl):
page = urllib2.urlopen(pageUrl)
pageHeaders = page.headers
contentType = pageHeaders.getheader('content-type')
return contentType
Upvotes: 2
Reputation: 60604
$ python3
Python 3.1.2 (release31-maint, Sep 17 2010, 20:27:33)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from os.path import splitext
>>> from urllib.parse import urlparse
>>>
>>> urls = [
... 'http://m3u.com/tunein.m3u',
... 'http://asxsomeurl.com/listen.asx:8024',
... 'http://www.plssomeotherurl.com/station.pls?id=111',
... 'http://22.198.133.16:8024',
... ]
>>>
>>> for url in urls:
... path = urlparse(url).path
... ext = splitext(path)[1]
... print(ext)
...
.m3u
.asx:8024
.pls
>>>
Upvotes: 4
Reputation: 143154
File extensions are basically meaningless in URLs. For example, if you go to http://code.google.com/p/unladen-swallow/source/browse/branches/release-2009Q1-maint/Lib/psyco/support.py?r=292 do you want the extension to be ".py" despite the fact that the page is HTML, not Python?
Use the Content-Type header to determine the "type" of a URL.
Upvotes: 6
Reputation: 94202
Use urlparse, that'll get most of the above sorted:
http://docs.python.org/library/urlparse.html
then split the "path" up. You might be able to split the path up using os.path.split, but your example 2 with the :8024 on the end needs manual handling. Are your file extensions always three letters? Or always letters and numbers? Use a regular expression.
Upvotes: 1