Reputation: 1081
I'm working on a block of Python code that is meant to test inputs to determine whether they're numeric, timestamps, free text, etc. To detect dates, it uses the dateutil parser, then checks if the parse succeeded or an exception was thrown.
However, the dateutil parser is too forgiving and will turn all manner of values into date objects, such as time ranges, eg "12:00-16:00", being converted into timestamps on the current day, eg "2023-08-22T12:00-16:00" (which isn't even a valid timezone offset).
We'd like to only treat inputs as dates if they actually have a day-month-year component, not if they're just hours and minutes - but we still want to accept various date formats, yyyy-MM-ddThh:mm:ss or dd/MM/yyyy or whatever the input uses. Is there another library better suited to this, or some way to make dateutil stricter?
Upvotes: 1
Views: 247
Reputation: 96257
Looking at the source code it doesn't look like there is any way to make the parser stricter.
But, it is open source. So we can understand what is going on. The main magic happens in the parser._parse
method. Basically, a bunch of logic is used to resolve a date. It eventually always returns a tuple:
except (IndexError, ValueError):
return None, None
if not info.validate(res):
return None, None
if fuzzy_with_tokens:
skipped_tokens = self._recombine_skipped(l, skipped_idxs)
return res, tuple(skipped_tokens)
else:
return res, None
Then in parser.parse we see that the default
values are filled in to this result (and appropriate errors are raised if the result is None
):
if default is None:
default = datetime.datetime.now().replace(hour=0, minute=0,
second=0, microsecond=0)
res, skipped_tokens = self._parse(timestr, **kwargs)
if res is None:
raise ParserError("Unknown string format: %s", timestr)
if len(res) == 0:
raise ParserError("String does not contain a date: %s", timestr)
try:
ret = self._build_naive(res, default)
except ValueError as e:
six.raise_from(ParserError(str(e) + ": %s", timestr), e)
The filling in happens in _build_naive
.
So, I am not suggesting this is the best route to go. But you can monkey-patch parser._parse
to raise an error if we don't find all of a day, month and year attributes on the result. To make this slightly safer, we can wrap our patching in a context manager:
import contextlib
import dateutil.parser
@contextlib.contextmanager
def strict_parser():
original_parse = dateutil.parser._parser.parser._parse
def _parse_patch(self, *args, **kwargs):
return_value = original_parse(self, *args, **kwargs)
parsed_result = return_value[0]
for attr in "year", "month", "day":
if not getattr(parsed_result, attr, None):
raise dateutil.parser.ParserError(
f"Require a full year, month, and day, did not find a {attr}"
)
return return_value
dateutil.parser._parser.parser._parse = _parse_patch # do the monkey patch
try:
yield
finally:
dateutil.parser._parser.parser._parse = original_parse
So, here is how this would work:
In [1]: import contextlib
...: import dateutil.parser
...:
...: @contextlib.contextmanager
...: def strict_parser():
...: original_parse = dateutil.parser._parser.parser._parse
...: def _parse_patch(self, *args, **kwargs):
...: return_value = original_parse(self, *args, **kwargs)
...: parsed_result = return_value[0]
...: for attr in "year", "month", "day":
...: if not getattr(parsed_result, attr, None):
...: raise dateutil.parser.ParserError(
...: f"Require a full year, month, and day, did not find a {attr}"
...: )
...: return return_value
...: dateutil.parser._parser.parser._parse = _parse_patch # do the monkey patch
...: try:
...: yield
...: finally:
...: dateutil.parser._parser.parser._parse = original_parse
...:
In [2]: dateutil.__version__
Out[2]: '2.8.2'
In [3]: dateutil.parser.parse("12:00-16:00")
Out[3]: datetime.datetime(2023, 8, 22, 12, 0, tzinfo=tzoffset(None, -57600))
In [4]: with strict_parser():
...: print(dateutil.parser.parse("12:00-16:00"))
...:
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
Cell In[4], line 2
1 with strict_parser():
----> 2 print(dateutil.parser.parse("12:00-16:00"))
File ~/miniconda3/envs/py311/lib/python3.11/site-packages/dateutil/parser/_parser.py:1368, in parse(timestr, parserinfo, **kwargs)
1366 return parser(parserinfo).parse(timestr, **kwargs)
1367 else:
-> 1368 return DEFAULTPARSER.parse(timestr, **kwargs)
File ~/miniconda3/envs/py311/lib/python3.11/site-packages/dateutil/parser/_parser.py:640, in parser.parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
636 if default is None:
637 default = datetime.datetime.now().replace(hour=0, minute=0,
638 second=0, microsecond=0)
--> 640 res, skipped_tokens = self._parse(timestr, **kwargs)
642 if res is None:
643 raise ParserError("Unknown string format: %s", timestr)
Cell In[1], line 12, in strict_parser.<locals>._parse_patch(self, *args, **kwargs)
10 for attr in "year", "month", "day":
11 if not getattr(parsed_result, attr, None):
---> 12 raise dateutil.parser.ParserError(
13 f"Require a full year, month, and day, did not find a {attr}"
14 )
15 return return_value
ParserError: Require a full year, month, and day, did not find a year
In [5]: with strict_parser():
...: print(dateutil.parser.parse("10/08/1988"))
...:
1988-10-08 00:00:00
In [6]: dateutil.parser.parse("12:00-16:00")
Out[6]: datetime.datetime(2023, 8, 22, 12, 0, tzinfo=tzoffset(None, -57600))
Again, monkey-patching is always hack. But it is relatively easy to do. of course, you take on the responsibility now of maintaining this patch because it uses internal, implementation details.
Upvotes: 1
Reputation: 71
How about the python's re module. You can check string with regular expression to determine whether the string is valid date/datetime data and then you can use dateutil module.
for example, following snippets will determine whether the input string has the proper date pattern.
import re
def check_date(text)
date_regex = re.compile(r"(/d{4}-/d{2}-/d{2}") # for "yyyy-mm-dd" pattern
if re.search(data_regx, text):
return True
else:
return False
Now, depending on the function's return you can use dateutil or datetime module to parse the string into date/datetime object.
Upvotes: 1