Reputation: 7752
The file names are dynamic and I need to extract the file extension. The file names look like this: parallels-workstation-parallels-en_US-6.0.13976.769982.run.sh
20090209.02s1.1_sequence.txt
SRR002321.fastq.bz2
hello.tar.gz
ok.txt
For the first one I want to extract txt
, for the second one I want to extract fastq.bz2
, for the third one I want to extract tar.gz
.
I am using os module to get the file extension as:
import os.path
extension = os.path.splitext('hello.tar.gz')[1][1:]
This gives me only gz which is fine if the file name is ok.txt
but for this one I want the extension to be tar.gz
.
Upvotes: 4
Views: 8658
Reputation: 189
splittext usually is not a good option if you expect that your filenames contain dots, instead I prefer:
>> import re
>> re.compile("(?P<name>.+?)(\.(?P<extension>.{1,4}))?$").search("blabla.blublu.tmp").groupdict()
{'extension': 'tmp', 'name': 'blabla.blublu'}
>> re.compile("(?P<name>.+?)(\.(?P<extension>.{1,4}))?$").search("blabla.blublu.tmpmoreblabla").groupdict()
{'extension': None, 'name': 'blabla.blublu.tmpmoreblabla'}
>> re.compile("(?P<name>.+?)(\.(?P<extension>.{1,4}))?$").search("blabla.blublu.tmpmoreblabla.ext").groupdict()
{'extension': 'ext', 'name': 'blabla.blublu.tmpmoreblabla'}
just check the second case "blabla.blublu.tmpmoreblabla"
, if that is a filename without extension, splittext still return tmpmoreblabla
as extension, the only assumptions that you have with this code are:
Of course you can use unnamed groups just removing ?P<>
but I prefer named groups in this case
Upvotes: 0
Reputation: 2170
I know this is a very old topic, but for others coming across this topic I want to share my solution (I agree it depends on your program logic).
I only needed the base name without the extension, and you can splitext as often as you want, which makes spitext return (base,ext) where base is always the basename and ext only contains an extension if it found one. So for files with a single or double period (.tar.gz and .txt for instance) the following returns the base name always:
base = os.path.splitext(os.path.splitext(filename)[0])[0]
Upvotes: 0
Reputation: 368944
import os
def splitext(path):
for ext in ['.tar.gz', '.tar.bz2']:
if path.endswith(ext):
return path[:-len(ext)], path[-len(ext):]
return os.path.splitext(path)
assert splitext('20090209.02s1.1_sequence.txt')[1] == '.txt'
assert splitext('SRR002321.fastq.bz2')[1] == '.bz2'
assert splitext('hello.tar.gz')[1] == '.tar.gz'
assert splitext('ok.txt')[1] == '.txt'
Removing dot:
import os
def splitext(path):
for ext in ['.tar.gz', '.tar.bz2']:
if path.endswith(ext):
path, ext = path[:-len(ext)], path[-len(ext):]
break
else:
path, ext = os.path.splitext(path)
return path, ext[1:]
assert splitext('20090209.02s1.1_sequence.txt')[1] == 'txt'
assert splitext('SRR002321.fastq.bz2')[1] == 'bz2'
assert splitext('hello.tar.gz')[1] == 'tar.gz'
assert splitext('ok.txt')[1] == 'txt'
Upvotes: 4
Reputation: 304137
Your rules are arbitrary, how is the computer supposed to guess when it's ok for the extension to have a .
in it?
At best you'll have to have a set of exceptional extensions, eg {'.bz2', '.gz'}
and add some extra logic yourself
>>> paths = """20090209.02s1.1_sequence.txt
... SRR002321.fastq.bz2
... hello.tar.gz
... ok.txt""".splitlines()
>>> import os
>>> def my_split_ext(path):
... name, ext = os.path.splitext(path)
... if ext in {'.bz2', '.gz'}:
... name, ext2 = os.path.splitext(name)
... ext = ext2 + ext
... return name, ext
...
>>> map(my_split_ext, paths)
[('20090209.02s1.1_sequence', '.txt'), ('SRR002321', '.fastq.bz2'), ('hello', '.tar.gz'), ('ok', '.txt')]
Upvotes: 2
Reputation: 17076
Well, you could keep iterating on root until ext
is empty. In other words:
filename = "hello.tar.gz"
extensions = []
root, ext = os.path.splitext(filename)
while ext:
extensions.append(ext)
root, ext = os.path.splitext(root)
# do something if extensions length is greater than 1
Upvotes: 0
Reputation: 13261
> import re
> re.search(r'\.(.*)', 'hello.tar.gz').groups()[0]
'tar.gz'
Obviously the above assumes there's a .
, but it doesn't look like os.path will do what you want here.
Upvotes: 1