Reputation: 108
I'm using Python to create HTML links from a listing of filenames. The file names are formatted like: song1_lead.pdf, song1_lyrics.pdf. They could also have names like song2_with_extra_underscores_vocals.pdf. But the common thing is they will all end with _someText.pdf
My goal is to extract just the someText part, after the last underscore, and without the .pdf extension. So song1_lyrics.pdf results with just: lyrics
I have the following Python code getting to my goal, but seems like I'm doing it the hard way. Is there is a more efficient way to do this?
testString = 'file1_with_extra_underscores_lead.pdf'
#Step 1: Separate string using last occurrence of under_score
HTMLtext = testString.rpartition('_')
# Result: ('file1_with_extra_underscores', '_', 'lyrics.pdf')
#Step 2: Separate the suffix and .pdf extension.
HTMLtext = HTMLtext[2].rpartition('.')
#Result: ('lead', '.', 'pdf')
#Step 3: Use the first item as the end result.
HTMLtext = HTMLtext[0] #Result: lead
I'm thinking what I'm trying to do is possible with much fewer lines of code, and not having to set HTMLtext multiple times as I'm doing now.
Upvotes: 3
Views: 2075
Reputation: 17322
you can use Path from pathlib to extract the final path component, without its suffix:
from path import Path
Path('file1_with_extra_underscores_lead.pdf').stem.split('_')[-1]
outout:
'lead'
Upvotes: 1
Reputation: 6209
As @wwii said in its comment, you should use os.path.splitext
which is especially designed to separate filenames from their extension and str.split
/str.rsplit
which are especially designed to cut strings at a character. Using thoses functions there is several ways to achieve what you want.
Unlike @wwii, I would start by discarding the extension:
test_string = 'file1_with_extra_underscores_lead.pdf'
filename = os.path.splitext(test_string)[0]
print(filename) # 'file1_with_extra_underscores_lead'
Then I would use split
or rsplit
, with the maxsplit
argument or selecting the last (or the second index) of the resulting list (according to what method have been used). Every following line are equivalent (in term of functionality at least):
filename.split('_')[-1] # splits at each underscore and selects the last chunk
filename.rsplit('_')[-1] # same as previous line except it splits from the right of the string
filename.rsplit('_', maxsplit=1)[-1] # split only one time from the right of the string and selects the last chunk
filename.rsplit('_', maxsplit=1)[1] # same as previous line except it select the second chunks (which is the last since only one split occured)
The best is probably one of the two last solutions since it will not do useless splits.
Why is this answer better than others? (in my opinion at least)
pathlib
is fine but a bit overkill for separating a filename from its extension, os.path.splitext
could be more efficient.rfind
works but is does not clearly express the code intention and it is not so readable.endswith('.pdf')
is OK if you are sure you will never use anything else than PDF. If one day you use a .txt
, you will have to rework your code.Using splitext
clearly indicates that you do something with the extension, and the first item selection is quite explicit. This will still work with any other extension.
Using rsplit('_', maxsplit=1)
and selecting the last index is also quite expressive and far more clear than a arbitrary looking slice.
Upvotes: 1
Reputation: 44043
This will work with "..._lead.pdf" or "..._lead.pDf":
import re
testString = 'file1_with_extra_underscores_lead.pdf'
m = re.search('_([^_]+)\.pdf$', testString, flags=re.I)
print(m.group(1) if m else "No match")
Upvotes: 0
Reputation: 467
This should do fine:
testString = 'file1_with_extra_underscores_lead.pdf'
testString[testString.rfind('_') + 1:-4]
But, no error checking in here. Will fail if there is no "_" in the string. You could use a regex as well. That shouldn't be difficult.
Basically I wouldn't do it this way myself. It's better to do some exception handling unless you are 100% sure that there is no need for exception handling.
Upvotes: 0