Reputation: 21
I am trying to determine whether a given date string includes all three components: day, month, and year.
Example Inputs and Expected Outputs:
Approach I Am Using:
I am currently using the dateutil.parser.parse
function to parse the date string and then check whether the year, month, and day attributes in the resulting datetime object are valid. However, the parse function fills in missing components with default values, making it hard to determine whether these components were explicitly provided in the input string. So, I wrote the below logic to get around that problem.
def parse_date(date_string):
try:
default_dt1 = datetime(1,1,1)
default_dt2 = datetime(2,2,2)
parsed_date1 = dateutil.parser.parse(date_string,default=default_dt1)
parsed_date2 = dateutil.parser.parse(date_string,default=default_dt2)
if parsed_date1 == parsed_date2:
# print(parsed_date1.month)
return True
return False
except (ValueError,TypeError):
return False
How it works:
Default Dates:
The function uses two default dates, datetime(1, 1, 1)
and datetime(2, 2, 2)
, to fill in missing components when parsing the input date string.
Parsing:
It parses the date_string
twice using the parser.parse
function (from the dateutil.parser
module) with the two different default dates.
If the date_string
is missing components (like a day or month), parser.parse
will use the defaults to fill in the gaps.
Comparison:
If both parsed results are the same, it indicates that the date_string has all components, as missing components would lead to differences in the two parsed dates due to the different defaults.
Question:
How can I reliably check if a date string explicitly contains all three components without relying on default values added by dateutil? Is there a better way to achieve this?. I tried giving default values as none but it is not working. surprisingly there is no default functionality for achieving this.
Note: the input format is not consistent.
Upvotes: 2
Views: 115
Reputation: 8676
Chuckle, my sense is that a test framework is needed here.
The other aspect that is not apparent is the data's input form. Are you reading from a text stream, database, or file? Why? There is an ambiguity about whether multiple dates can occur in the input data, for example, paragraphs containing multiple dates. This is why I ask whether we should look at streaming data.
Can the input form be infinite in size, etc.?
From a logical perspective, I'd recommend approaching the problem from a fuzzy search approach. Firstly, scan for the three components; if they do not exist, pass. If they do exist, then check to see if an actual date exists.
Here is an example with a test framework. I updated the answer to handle cases where we want to return only one date or fail. How? We wrap extract_all_valid_dates()
with extract_single_date()
.
If we encounter multiple valid dates via extract_single_date()
, we return an empty string; otherwise, we return the valid date string.
Mac OS Version: 15.2
Python Version: Python 3.8.12
Shell Version: zsh 5.9 (arm64-apple-darwin24.0)
import re
from datetime import datetime
from dateutil import parser
import unittest
def extract_date_parts(date_string: str) -> tuple:
"""
Extract possible matches for year, month, and day from the string.
Returns tuple of (years, months, days).
"""
text = date_string.lower()
# Find all 4-digit years
years = [m.group() for m in re.finditer(r'\b\d{4}\b', text)]
# Build month pattern and find all months (both names and numbers)
months_pattern = (
r'january|february|march|april|may|june|july|august|september|'
r'october|november|december|jan|feb|mar|apr|jun|jul|aug|sep|oct|nov|dec'
)
month_pattern = rf'\b(?:{months_pattern}|\b(?:0?[1-9]|1[0-2])\b)'
months_found = [m.group() for m in re.finditer(month_pattern, text, re.IGNORECASE)]
# Find all potential days (with optional ordinal indicators: 1st, 2nd, 3rd, etc.)
days = [
m.group()
for m in re.finditer(r'\b(?:[0-2]?[1-9]|[12]\d|3[01])(?:st|nd|rd|th)?\b', text)
]
return years, months_found, days
def has_all_components(date_string: str) -> bool:
"""
Determines whether a date string explicitly contains year, month, and day
and can be parsed as a complete date.
"""
if not date_string:
return False
# 1. Extract date parts via regex
years, months, days = extract_date_parts(date_string)
# Check if at least one valid year, month, and day substring is present
if not (years and months and days):
return False
# 2. If no month name is found, ensure numeric date format is valid
text_lower = date_string.lower()
month_names = (
r'january|february|march|april|may|june|july|august|september|'
r'october|november|december|jan|feb|mar|apr|jun|jul|aug|sep|oct|nov|dec'
)
if not re.search(month_names, text_lower, re.IGNORECASE):
# Check if there's a valid numeric date format
has_numeric_format = any(
re.search(pattern, text_lower)
for pattern in [
r'\b\d{4}[-/]\d{1,2}[-/]\d{1,2}\b', # YYYY-MM-DD or YYYY/MM/DD
r'\b\d{1,2}[-/]\d{1,2}[-/]\d{4}\b', # DD-MM-YYYY or MM-DD-YYYY
]
)
if not has_numeric_format:
return False
# 3. Attempt full parse with fuzzy=True to allow extra text
try:
parser.parse(date_string, fuzzy=True)
except (ValueError, TypeError):
return False
return True
def extract_all_valid_dates(text: str) -> list:
"""
Scans the entire text for potential date-like substrings and returns
a list of those that pass the `has_all_components` check.
"""
# Extended pattern to catch:
# 1) Numeric formats (YYYY-MM-DD, DD/MM/YYYY, etc.)
# 2) Day-first + month name + year (e.g., "31 January 2026" or "1st of January 2026")
# 3) Month name + day + year (e.g., "January 31, 2026")
# 4) Mixed/abbreviated formats (e.g., "31-Jan-2027" or "Jan-31-2027")
date_candidate_pattern = re.compile(
r"""
# Pattern 1: Purely numeric, e.g. 2025-01-01 or 01/01/2025
(?:\b\d{1,4}[-/]\d{1,2}[-/]\d{1,4}\b)
| # Pattern 2: Day (w/ possible ordinal) + 'of'? + Month name + Year, e.g. "31 January 2026", "1st of January 2026"
(?:\b\d{1,2}(?:st|nd|rd|th)?\s+(?:of\s+)?(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|
january|february|march|april|may|june|july|august|
september|october|november|december),?\s*\d{4}\b)
| # Pattern 3: Month name + Day (w/ optional ordinal), + Year, e.g. "January 31, 2026"
(?:\b(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|
january|february|march|april|may|june|july|august|
september|october|november|december)\s+\d{1,2}
(?:st|nd|rd|th)?,?\s*\d{4}\b)
| # Pattern 4: Day-month-year with abbreviated month, e.g. "31-Jan-2027" or "Jan-31-2027"
(?:\b\d{1,2}[-/](?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[-/]\d{4}\b
| \b(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[-/]\d{1,2}[-/]\d{4}\b)
""",
re.IGNORECASE | re.VERBOSE
)
candidates = date_candidate_pattern.findall(text)
valid_dates = [c.strip() for c in candidates if has_all_components(c.strip())]
return valid_dates
def extract_single_date(text: str) -> str:
"""
Extracts a single valid date from the input text.
Returns the date string if exactly one valid date is found, otherwise returns empty string.
Args:
text (str): Input text to search for dates
Returns:
str: The valid date string if exactly one is found, empty string otherwise
"""
if not text:
return ""
valid_dates = extract_all_valid_dates(text)
# Return the date only if exactly one valid date is found
return valid_dates[0] if len(valid_dates) == 1 else ""
# ------------------------- TESTS -------------------------
class TestDateParser(unittest.TestCase):
def test_valid_dates(self):
valid_dates = [
"2025-01-01",
"January 31, 2026",
"31 January 2026",
"2025/01/01",
"1st January 2026",
"31-Jan-2026",
"Jan-31-2026",
"The 1st of January, 2026",
"Today is 2025-01-01",
]
for date in valid_dates:
with self.subTest(date=date):
result = has_all_components(date)
self.assertTrue(result, f"Failed to validate: {date}")
def test_invalid_dates(self):
invalid_dates = [
"2025",
"January 2026",
"01/2026",
"2025-01",
"Yesterday",
"Next Monday",
"31/12",
"January",
"31st",
"In January 2026",
"Only 2026 matters",
]
for date in invalid_dates:
with self.subTest(date=date):
self.assertFalse(has_all_components(date), f"Should have failed: {date}")
def test_edge_cases(self):
edge_cases = {
"2025-13-01": False,
"2025-01-32": False,
"": False,
"Not a date": False,
"2025-01-01 12:00": True,
"01-01-2025": True,
"20250101": False
}
for date, expected in edge_cases.items():
with self.subTest(date=date):
self.assertEqual(
has_all_components(date),
expected,
f"Failed for edge case: {date}"
)
# ------------------------- NEW TEST: MULTIPLE DATES -------------------------
def test_extract_multiple_dates(self):
"""
Checks if we can extract multiple valid dates from a two-paragraph string.
"""
text = (
"Paragraph 1 says the project begins on 2025-01-01. Then we will deliver "
"phase two by January 31, 2026, and possibly an update on 31-Jan-2027.\n\n"
"Paragraph 2 mentions that a final review is slated for the 1st of February, 2028. "
"Some random text follows without a valid date."
)
dates_found = extract_all_valid_dates(text)
# We expect 4 valid dates:
# 1) 2025-01-01
# 2) January 31, 2026
# 3) 31-Jan-2027
# 4) 1st of February, 2028
self.assertEqual(len(dates_found), 4, f"Should find exactly 4 valid dates, found {dates_found}")
for d in dates_found:
self.assertTrue(has_all_components(d), f"Extracted invalid date: {d}")
def test_additional_scenarios(self):
"""
Test two specific date strings:
1) A date with extra words and ordinal day.
2) A date with a time component.
"""
test_cases = {
"The 1st of January, 2026": True, # Contains extra words but still a valid date
"2025-01-01 12:00": True, # Date plus time
}
for date_str, expected in test_cases.items():
with self.subTest(date_str=date_str):
self.assertEqual(
has_all_components(date_str),
expected,
f"Should have returned {expected} for {date_str!r}"
)
class TestSingleDateExtraction(unittest.TestCase):
def test_single_date_extraction(self):
# Test cases with exactly one date (should return the date)
single_date_cases = {
"The meeting is on January 15th, 2024": "January 15th, 2024",
"2025-01-01 is the deadline": "2025-01-01",
"Submit by the 1st of February, 2026": "1st of February, 2026",
"Due date: 31-Jan-2026": "31-Jan-2026"
}
for text, expected in single_date_cases.items():
with self.subTest(text=text):
result = extract_single_date(text)
self.assertEqual(result, expected,
f"Failed to extract single date from: {text}")
def test_multiple_dates_return_empty(self):
# Test cases with multiple dates (should return empty string)
multiple_dates_texts = [
"Start date is 2025-01-01 and end date is 2025-12-31",
"Meeting on January 15th, 2024 and follow-up on February 1st, 2024",
"From 31-Jan-2026 to 15-Feb-2026",
"First deadline is the 1st of January, 2026, second is the 1st of February, 2026"
]
for text in multiple_dates_texts:
with self.subTest(text=text):
result = extract_single_date(text)
self.assertEqual(result, "",
f"Should return empty string for multiple dates: {text}")
def test_no_dates_return_empty(self):
# Test cases with no valid dates (should return empty string)
no_date_texts = [
"No dates in this text",
"Meeting is next week",
"Only a year 2024 mentioned",
"Just a month January",
"" # Empty string
]
for text in no_date_texts:
with self.subTest(text=text):
result = extract_single_date(text)
self.assertEqual(result, "",
f"Should return empty string for no dates: {text}")
def test_edge_cases(self):
edge_cases = {
"Same date twice: 2025-01-01 and 2025-01-01": "", # Duplicate dates
"Today is 2025-01-01 12:00": "2025-01-01", # Date with time (time is stripped)
"The date is : 2025-13-01": "", # Invalid date
"Meeting on 2025-01-01.": "2025-01-01", # Date with punctuation
" 2025-01-01 ": "2025-01-01" # Extra whitespace
}
for text, expected in edge_cases.items():
with self.subTest(text=text):
result = extract_single_date(text)
self.assertEqual(result, expected,
f"Failed for edge case: {text}")
def test_variations_with_time(self):
# Additional test cases specifically for dates with time
time_cases = {
"Meeting at 2025-01-01 15:00": "2025-01-01",
"January 15th, 2024 10:30 AM": "January 15th, 2024",
"31-Jan-2026 23:59": "31-Jan-2026"
}
for text, expected in time_cases.items():
with self.subTest(text=text):
result = extract_single_date(text)
self.assertEqual(result, expected,
f"Failed to handle time correctly in: {text}")
if __name__ == "__main__":
unittest.main()
Upvotes: 1
Reputation: 12425
For the examples you have shown, and for most others, there is no need for specialized date parsing. All you need is a simple re.split
to find if the date string can be split into exactly 3 "word" components:
import re
date_strs = ["2025-01-01", "January 31, 2026", "January 2026", "2026", "2025-01",]
def has_3_components(date_str):
# Use date_str.strip() instead of date_str if you want
# to strip leading and trailing whitespace:
date_lst = re.split(r'\W+', date_str)
return len(date_lst) == 3
for date_str in date_strs:
print(f"{date_str}: {has_3_components(date_str)}")
Prints:
2025-01-01: True
January 31, 2026: True
January 2026: False
2026: False
2025-01: False
Of course, this trivial method does not try to determine if the string is a proper, valid date, so the following nonsense strings (and many others) will return True
as well: "Foo 1, 2025", "Feb-31-2024", or just "foo bar baz".
Upvotes: 2