pynovice
pynovice

Reputation: 7752

What's the way to extract file extension from file name in Python?

The file names are dynamic and I need to extract the file extension. The file names look like this: parallels-workstation-parallels-en_US-6.0.13976.769982.run.sh

20090209.02s1.1_sequence.txt
SRR002321.fastq.bz2
hello.tar.gz
ok.txt

For the first one I want to extract txt, for the second one I want to extract fastq.bz2, for the third one I want to extract tar.gz.

I am using os module to get the file extension as:

import os.path
extension = os.path.splitext('hello.tar.gz')[1][1:]

This gives me only gz which is fine if the file name is ok.txt but for this one I want the extension to be tar.gz.

Upvotes: 4

Views: 8658

Answers (6)

Xavi
Xavi

Reputation: 189

splittext usually is not a good option if you expect that your filenames contain dots, instead I prefer:

>> import re
>> re.compile("(?P<name>.+?)(\.(?P<extension>.{1,4}))?$").search("blabla.blublu.tmp").groupdict()
{'extension': 'tmp', 'name': 'blabla.blublu'}
>> re.compile("(?P<name>.+?)(\.(?P<extension>.{1,4}))?$").search("blabla.blublu.tmpmoreblabla").groupdict()
{'extension': None, 'name': 'blabla.blublu.tmpmoreblabla'}
>> re.compile("(?P<name>.+?)(\.(?P<extension>.{1,4}))?$").search("blabla.blublu.tmpmoreblabla.ext").groupdict()
{'extension': 'ext', 'name': 'blabla.blublu.tmpmoreblabla'}

just check the second case "blabla.blublu.tmpmoreblabla", if that is a filename without extension, splittext still return tmpmoreblabla as extension, the only assumptions that you have with this code are:

  1. You always have non-empty string as input
  2. Your filename and extension could have any possible character
  3. Your file extension length is between 1 or 4 characters (if it has more characters and it won't be considered an extension but part of the name)
  4. Your string ends with the extension file

Of course you can use unnamed groups just removing ?P<> but I prefer named groups in this case

Upvotes: 0

Dolf Andringa
Dolf Andringa

Reputation: 2170

I know this is a very old topic, but for others coming across this topic I want to share my solution (I agree it depends on your program logic).

I only needed the base name without the extension, and you can splitext as often as you want, which makes spitext return (base,ext) where base is always the basename and ext only contains an extension if it found one. So for files with a single or double period (.tar.gz and .txt for instance) the following returns the base name always:

base = os.path.splitext(os.path.splitext(filename)[0])[0]

Upvotes: 0

falsetru
falsetru

Reputation: 368944

import os

def splitext(path):
    for ext in ['.tar.gz', '.tar.bz2']:
        if path.endswith(ext):
            return path[:-len(ext)], path[-len(ext):]
    return os.path.splitext(path)

assert splitext('20090209.02s1.1_sequence.txt')[1] == '.txt'
assert splitext('SRR002321.fastq.bz2')[1] == '.bz2'
assert splitext('hello.tar.gz')[1] == '.tar.gz'
assert splitext('ok.txt')[1] == '.txt'

Removing dot:

import os

def splitext(path):
    for ext in ['.tar.gz', '.tar.bz2']:
        if path.endswith(ext):
            path, ext = path[:-len(ext)], path[-len(ext):]
            break
    else:
        path, ext = os.path.splitext(path)
    return path, ext[1:]

assert splitext('20090209.02s1.1_sequence.txt')[1] == 'txt'
assert splitext('SRR002321.fastq.bz2')[1] == 'bz2'
assert splitext('hello.tar.gz')[1] == 'tar.gz'
assert splitext('ok.txt')[1] == 'txt'

Upvotes: 4

John La Rooy
John La Rooy

Reputation: 304137

Your rules are arbitrary, how is the computer supposed to guess when it's ok for the extension to have a . in it?

At best you'll have to have a set of exceptional extensions, eg {'.bz2', '.gz'} and add some extra logic yourself

>>> paths = """20090209.02s1.1_sequence.txt
... SRR002321.fastq.bz2
... hello.tar.gz
... ok.txt""".splitlines()
>>> import os
>>> def my_split_ext(path):
...     name, ext = os.path.splitext(path)
...     if ext in {'.bz2', '.gz'}:
...         name, ext2 = os.path.splitext(name)
...         ext = ext2 + ext
...     return name, ext
... 
>>> map(my_split_ext, paths)
[('20090209.02s1.1_sequence', '.txt'), ('SRR002321', '.fastq.bz2'), ('hello', '.tar.gz'), ('ok', '.txt')]

Upvotes: 2

Jeff Tratner
Jeff Tratner

Reputation: 17076

Well, you could keep iterating on root until ext is empty. In other words:

filename = "hello.tar.gz"
extensions = []
root, ext = os.path.splitext(filename)
while ext:
    extensions.append(ext)
    root, ext = os.path.splitext(root)

# do something if extensions length is greater than 1

Upvotes: 0

U2EF1
U2EF1

Reputation: 13261

> import re
> re.search(r'\.(.*)', 'hello.tar.gz').groups()[0]
'tar.gz'

Obviously the above assumes there's a ., but it doesn't look like os.path will do what you want here.

Upvotes: 1

Related Questions