Reputation: 8904
I would like to get the alphabetic parts of a file from some file paths.
files = ['data/Conversion/201406/MM_CLD_Conversion_Advertiser_96337_Daily_140606.zip',
'data/Match/201406/MM_CLD_Match_Advertiser_111423_Daily_140608.csv.zip',
'data/AQlog/201406/orx140605.csv.zip',
'data/AQlog/201406/orx140605.csv.zip/']
Currently I do this:
os.path.splitext()
to remove a possible 2 file extensionsCode:
for f in files:
a = os.path.splitext(os.path.splitext(os.path.split(f.rstrip('/\\'))[1])[0])[0]
b = re.sub('\d+', '', a).replace('_','')
Result:
'MMCLDConversionAdvertiserDaily'
'MMCLDMatchAdvertiserDaily'
'orx'
'orx'
Is there a faster or more pythonic way, using a compiled regex function? Or is trying to use the library os.path()
very reasonable? I also do not have to do this more than 100 times, so it's not a speed problem, this is just for clarity.
Upvotes: 0
Views: 4159
Reputation: 2041
Without using regular expressions:
import os
import string
trans = string.maketrans('_', ' ')
def get_filename(path):
# If you need to keep the directory, use os.path.split
filename = os.path.basename(path.rstrip('/'))
try:
# If the extension starts at the last period, use
# os.path.splitext
# If the extension starts at the 2nd to last period,
# use os.path.splitext twice
# Continuing this pattern (since it sounds like you
# don't know how many extensions a filename may have,
# it may be safer to assume the file extension starts
# at the first period. In which case, use
# filename.split('.', 1).
filename_without_ext, extension = filename.split('.', 1)
except ValueError:
filename_without_ext = filename
extension = ''
filename_cleaned = filename_without_ext.translate(trans, string.digits)
return filename_cleaned
>>> path = 'data/Match/201406/MM_CLD_Match_Advertiser_111423_Daily_140608.csv.zip/'
>>> get_filename(path)
'MM CLD Match Advertiser Daily '
Do whatever approach is more readable. I usually avoid regular expressions if the problem doesn't require it. In this case, regular string operations can do everything you want to do.
If you want to remove extra spaces (as indicated in your Result), use filename.replace(' ', '')
. If you are likely to have other kinds of whitespace, it can be removed by ''.join(filename.split())
Note: If you are using Python 3, replace trans=string.maketrans('_', ' ')
with trans=str.maketrans('_', ' ', string.digits)
, and filename_without_ext.translate(trans, string.digits)
with filename_without_ext.translate(trans)
. This change was made as part of improving unicode language support. See more: How come string.maketrans does not work in Python 3.1?
Here's the Python 3 code:
import os
import string
trans = string.maketrans('_', ' ', string,digits)
def get_filename(path):
filename = os.path.basename(path.rstrip('/'))
filename_without_ext = filename.split('.', 1)[0]
filename_cleaned = filename_without_ext.translate(trans)
return filename_cleaned
Upvotes: 2
Reputation: 366103
You can simplify this by using the appropriate functions from os.path
.
First, f you call normpath
you no longer have to worry about both kinds of path separators, just os.sep
(note that this is a bad thing if you're trying to, e.g., process Windows paths on Linux… but if you're trying to process native paths on any given platform, it's exactly what you want). And it also removes any trailing slashes.
Next, if you call basename
instead of split
, you no longer have to throw in those trailing [1]
s.
Unfortunately, there's no equivalent of basename
vs. split
for splitext
… but you can write one easily, which will make your code more readable in the exact same way as using basename
.
As for the rest of it… regexp is the obvious way to strip out any digits (although you really don't need the +
there). And, since you've already got a regexp, it might be simpler to toss the _
in there instead of doing it separately.
So:
def stripext(p):
return os.path.splitext(p)[0]
for f in files:
a = stripext(stripext(os.path.basename(os.path.normpath(f))))
b = re.sub(r'[\d_]', '', a)
Of course the whole thing is probably more readable if you wrap if up as a function:
def process_path(p):
a = stripext(stripext(os.path.basename(os.path.normpath(f))))
return re.sub(r'[\d_]', '', a)
for f in files:
b = process_path(f)
Especially since you can now turn your loop into a list comprehension or generator expression or map
call:
processed_files = map(process_path, files)
I'm simply curious about the speed, as I'm under the impression compiled regex functions are very fast.
Well, yes, in general. However, uncompiled string patterns are also very fast.
When you use a string pattern instead of a compiled regexp object, what happens is this:
re
module looks up the pattern in a cache of compiled regular expressions.So, assuming you don't use many dozens of regular expressions in your app, either way, your pattern gets compiled exactly once, and run as a compiled expression repeatedly. The only additional cost to using the uncompiled expressions is looking it up in that cache dictionary, which is incredibly cheap—especially when it's a string literal, so it's guaranteed to be the exact same string object every time, so its hash will be cached as well, so after the first time the dict lookup turns into just a mod
and an array lookup.
For most apps, you can just assume the re
cache is good enough, so the main reason for deciding whether to pre-compile regular expressions or not is readability. For example, when you've got, e.g., a function that runs a slew of complicated regular expressions whose purpose is hard to understand, it can definitely help to give each one of them a name, so you can write for r in (r_phone_numbers, r_addresses, r_names): …
, in which case it would be almost silly not to compile them.
Upvotes: 2