yangyang
yangyang

Reputation: 531

Extract date from file name with import re in python

My file name looks like as follow:

show_data_paris_12112019.xlsx

I want to extract the date only and I have tried this script:

date = os.path.basename(xls)
pattern = r'(?<=_)_*(?=\.xlsx)'

re.search(pattern, date).group(0)
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%Y%m%d')

but It gives me errors. How can I fix this?

Thank you.

Upvotes: 2

Views: 363

Answers (2)

Javad
Javad

Reputation: 92

It looks to me like the regex you're using is also at fault, and so it fails when trying to group(0) from the empty return.

Assuming all your dates are stored as digits the following regex i've made seems to work quite well.

(?!.+_)\d+(?=\.xlsx)

The next issue is when formatting the date it experiences an issue with the way you're formatting the date, to me it looks like 12112019 would be the 12/11/2019 obviously this could also be the 11/12/2019 but the basic is that we change the way strftime formats the date.

So for the date / month / year format we would use

# %d%m%Y
event_date_obj = datetime.strptime(event_date, '%d%m%Y')

And we would simply swap %d and %m for the month / date / year format. So your complete code would look something like this:

date = os.path.basename(xls)
pattern = "(?!.+_)\d+(?=\.xlsx)"
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%d%m%Y')

For further information on how to use strftime see https://strftime.org/.

Upvotes: 1

Barmar
Barmar

Reputation: 781068

_* matches a sequence of zero or more underscore characters.

(?<=_) means that it has to be preceded by an underscore.

(?=\.xlsx) means that it has to be followed by .xlsx.

So this will match the ___ in foo____.xlsx. But it doesn't match your filename, because the data is between the underscore and .xlsx.

You should match \d+ rather than _* between the lookbehind and lookahead.

pattern = r'(?<=_)\d+(?=\.xlsx)'

And if the data is always 8 digits, use \d{8} to be more specific.

Upvotes: 1

Related Questions