Reputation: 108

Extracting the suffix of a filename in Python

I'm using Python to create HTML links from a listing of filenames. The file names are formatted like: song1_lead.pdf, song1_lyrics.pdf. They could also have names like song2_with_extra_underscores_vocals.pdf. But the common thing is they will all end with _someText.pdf

My goal is to extract just the someText part, after the last underscore, and without the .pdf extension. So song1_lyrics.pdf results with just: lyrics

I have the following Python code getting to my goal, but seems like I'm doing it the hard way. Is there is a more efficient way to do this?

testString = 'file1_with_extra_underscores_lead.pdf'

#Step 1: Separate string using last occurrence of under_score
HTMLtext = testString.rpartition('_')
# Result: ('file1_with_extra_underscores', '_', 'lyrics.pdf')

#Step 2: Separate the suffix and .pdf extension. 
HTMLtext = HTMLtext[2].rpartition('.')
#Result: ('lead', '.', 'pdf')

#Step 3: Use the first item as the end result.
HTMLtext = HTMLtext[0] #Result: lead

I'm thinking what I'm trying to do is possible with much fewer lines of code, and not having to set HTMLtext multiple times as I'm doing now.

Upvotes: 3

Answers (4)

kederrac

Reputation: 17322

you can use Path from pathlib to extract the final path component, without its suffix:

from path import Path
Path('file1_with_extra_underscores_lead.pdf').stem.split('_')[-1]

outout:

'lead'

Upvotes: 1

Tryph

Reputation: 6209

As @wwii said in its comment, you should use os.path.splitext which is especially designed to separate filenames from their extension and str.split/str.rsplit which are especially designed to cut strings at a character. Using thoses functions there is several ways to achieve what you want.

Unlike @wwii, I would start by discarding the extension:

test_string = 'file1_with_extra_underscores_lead.pdf'
filename = os.path.splitext(test_string)[0]
print(filename)  # 'file1_with_extra_underscores_lead'

Then I would use split or rsplit, with the maxsplit argument or selecting the last (or the second index) of the resulting list (according to what method have been used). Every following line are equivalent (in term of functionality at least):

filename.split('_')[-1]  # splits at each underscore and selects the last chunk
filename.rsplit('_')[-1]  # same as previous line except it splits from the right of the string
filename.rsplit('_', maxsplit=1)[-1]  # split only one time from the right of the string and selects the last chunk
filename.rsplit('_', maxsplit=1)[1]  # same as previous line except it select the second chunks (which is the last since only one split occured)

The best is probably one of the two last solutions since it will not do useless splits.

Why is this answer better than others? (in my opinion at least)

Using pathlib is fine but a bit overkill for separating a filename from its extension, os.path.splitext could be more efficient.
Using a slice with rfind works but is does not clearly express the code intention and it is not so readable.
Using endswith('.pdf') is OK if you are sure you will never use anything else than PDF. If one day you use a .txt, you will have to rework your code.
I love regex but in this case it suffers from the same caveheats than the 2 two previously discussed solutions: no clear intention, not very readable and you will have to rework it if one day you use an other extension.

Using splitext clearly indicates that you do something with the extension, and the first item selection is quite explicit. This will still work with any other extension. Using rsplit('_', maxsplit=1) and selecting the last index is also quite expressive and far more clear than a arbitrary looking slice.

Upvotes: 1

Booboo

Reputation: 44043

This will work with "..._lead.pdf" or "..._lead.pDf":

import re
testString = 'file1_with_extra_underscores_lead.pdf'
m = re.search('_([^_]+)\.pdf$', testString, flags=re.I)
print(m.group(1) if m else "No match")

Upvotes: 0

Sheradil

Reputation: 467

This should do fine:

testString = 'file1_with_extra_underscores_lead.pdf'
testString[testString.rfind('_') + 1:-4]

But, no error checking in here. Will fail if there is no "_" in the string. You could use a regex as well. That shouldn't be difficult.

Basically I wouldn't do it this way myself. It's better to do some exception handling unless you are 100% sure that there is no need for exception handling.

Upvotes: 0

Extracting the suffix of a filename in Python

Answers (4)

Related Questions