regex expression for extracting a base file name from a path

Question

I would like to get the alphabetic parts of a file from some file paths.

files = ['data/Conversion/201406/MM_CLD_Conversion_Advertiser_96337_Daily_140606.zip', 
         'data/Match/201406/MM_CLD_Match_Advertiser_111423_Daily_140608.csv.zip', 
         'data/AQlog/201406/orx140605.csv.zip',
         'data/AQlog/201406/orx140605.csv.zip/']

Currently I do this:

strip end slashes
os.path.split()[1] to get filename
two os.path.splitext() to remove a possible 2 file extensions
lose the numbers
lose the underscores

Code:

for f in files:
    a = os.path.splitext(os.path.splitext(os.path.split(f.rstrip('/\'))[1])[0])[0]
    b = re.sub('\d+', '', a).replace('_','')

Result:

'MMCLDConversionAdvertiserDaily'
'MMCLDMatchAdvertiserDaily'
'orx'
'orx'

Is there a faster or more pythonic way, using a compiled regex function? Or is trying to use the library os.path() very reasonable? I also do not have to do this more than 100 times, so it's not a speed problem, this is just for clarity.

abarnert · Accepted Answer

You can simplify this by using the appropriate functions from os.path.

First, f you call normpath you no longer have to worry about both kinds of path separators, just os.sep (note that this is a bad thing if you're trying to, e.g., process Windows paths on Linux… but if you're trying to process native paths on any given platform, it's exactly what you want). And it also removes any trailing slashes.

Next, if you call basename instead of split, you no longer have to throw in those trailing [1]s.

Unfortunately, there's no equivalent of basename vs. split for splitext… but you can write one easily, which will make your code more readable in the exact same way as using basename.

As for the rest of it… regexp is the obvious way to strip out any digits (although you really don't need the + there). And, since you've already got a regexp, it might be simpler to toss the _ in there instead of doing it separately.

So:

def stripext(p):
    return os.path.splitext(p)[0]

for f in files:
    a = stripext(stripext(os.path.basename(os.path.normpath(f))))
    b = re.sub(r'[\d_]', '', a)

Of course the whole thing is probably more readable if you wrap if up as a function:

def process_path(p):
    a = stripext(stripext(os.path.basename(os.path.normpath(f))))
    return re.sub(r'[\d_]', '', a)

for f in files:
    b = process_path(f)

Especially since you can now turn your loop into a list comprehension or generator expression or map call:

processed_files = map(process_path, files)

I'm simply curious about the speed, as I'm under the impression compiled regex functions are very fast.

Well, yes, in general. However, uncompiled string patterns are also very fast.

When you use a string pattern instead of a compiled regexp object, what happens is this:

The re module looks up the pattern in a cache of compiled regular expressions.
If not found, the string is compiled and the result added to the cache.

So, assuming you don't use many dozens of regular expressions in your app, either way, your pattern gets compiled exactly once, and run as a compiled expression repeatedly. The only additional cost to using the uncompiled expressions is looking it up in that cache dictionary, which is incredibly cheap—especially when it's a string literal, so it's guaranteed to be the exact same string object every time, so its hash will be cached as well, so after the first time the dict lookup turns into just a mod and an array lookup.

For most apps, you can just assume the re cache is good enough, so the main reason for deciding whether to pre-compile regular expressions or not is readability. For example, when you've got, e.g., a function that runs a slew of complicated regular expressions whose purpose is hard to understand, it can definitely help to give each one of them a name, so you can write for r in (r_phone_numbers, r_addresses, r_names): …, in which case it would be almost silly not to compile them.

regex expression for extracting a base file name from a path

Answers (2)

Related Questions