Jonathan E. Landrum
Jonathan E. Landrum

Reputation: 3182

Extract substring from filename in Python?

I have a directory full of files that have date strings as part of the filenames:

file_type_1_20140722_foo.txt
file_type_two_20140723_bar.txt
filetypethree20140724qux.txt

I need to get these date strings from the filenames and save them in an array:

['20140722', '20140723', '20140724']

But they can appear at various places in the filename, so I can't just use substring notation and extract it directly. In the past, the way I've done something similar to this in Bash is like so:

date=$(echo $file | egrep -o '[[:digit:]]{8}' | head -n1)

But I can't use Bash for this because it sucks at math (I need to be able to add and subtract floating point numbers). I've tried glob.glob() and re.match(), but both return empty sets:

>>> dates = [file for file in sorted(os.listdir('.')) if re.match("[0-9]{8}", file)]
>>> print dates
>>> []

I know the problem is it's looking for complete file names that are eight digits long, but I have no idea how to make it look for substrings instead. Any ideas?

Upvotes: 1

Views: 14536

Answers (3)

unutbu
unutbu

Reputation: 881037

>>> import re
>>> import os
>>> [date for file in os.listdir('.') for date in re.findall("(\d{8})", file)]
['20140722', '20140723']

Note that if a filename has a 9-digit substring, then only the first 8 digits will be matched. If a filename contains a 16-digit substring, there will be 2 non-overlapping matches.

Upvotes: 6

Daniel
Daniel

Reputation: 42788

re.match matches from the beginning of the string. re.search matches the pattern anywhere. Or you can try this:

extract_dates = re.compile("[0-9]{8}").findall
dates = [dates[0] for dates in sorted(
    extract_dates(filename) for filename in os.listdir('.')) if dates]

Upvotes: 1

Andrew Johnson
Andrew Johnson

Reputation: 3196

Your regular expression looks good, but you should be using re.search instead of re.match so that it will search for that expression anywhere in the string:

import re
r = re.compile("[0-9]{8}")
m = r.search(filename)
if m:
    print m.group(0)

Upvotes: 2

Related Questions