Reputation: 531
My file name looks like as follow:
show_data_paris_12112019.xlsx
I want to extract the date only and I have tried this script:
date = os.path.basename(xls)
pattern = r'(?<=_)_*(?=\.xlsx)'
re.search(pattern, date).group(0)
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%Y%m%d')
but It gives me errors. How can I fix this?
Thank you.
Upvotes: 2
Views: 363
Reputation: 92
It looks to me like the regex you're using is also at fault, and so it fails when trying to group(0)
from the empty return.
Assuming all your dates are stored as digits the following regex i've made seems to work quite well.
(?!.+_)\d+(?=\.xlsx)
The next issue is when formatting the date it experiences an issue with the way you're formatting the date, to me it looks like 12112019 would be the 12/11/2019 obviously this could also be the 11/12/2019 but the basic is that we change the way strftime formats the date.
So for the date / month / year format we would use
# %d%m%Y
event_date_obj = datetime.strptime(event_date, '%d%m%Y')
And we would simply swap %d and %m for the month / date / year format. So your complete code would look something like this:
date = os.path.basename(xls)
pattern = "(?!.+_)\d+(?=\.xlsx)"
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%d%m%Y')
For further information on how to use strftime see https://strftime.org/.
Upvotes: 1
Reputation: 781068
_*
matches a sequence of zero or more underscore characters.
(?<=_)
means that it has to be preceded by an underscore.
(?=\.xlsx)
means that it has to be followed by .xlsx
.
So this will match the ___
in foo____.xlsx
. But it doesn't match your filename, because the data is between the underscore and .xlsx
.
You should match \d+
rather than _*
between the lookbehind and lookahead.
pattern = r'(?<=_)\d+(?=\.xlsx)'
And if the data is always 8 digits, use \d{8}
to be more specific.
Upvotes: 1