Reputation: 1607
How can I extract the date from a string like "monkey 2010-07-10 love banana"? Thanks!
Upvotes: 125
Views: 202588
Reputation: 4689
HARD MODE:
If your dates are not separated by whitespace from surrounding text, combining datefinder
with wordninja
will solve this problem:
>>>import datefinder
>>>import wordninja
>>>example = '04.02.22ILeftMyHeartInSF ---> I Left My Heart In Sf - blah blah blah'
>>>list(datefinder.find_dates(' '.join(wordninja.split(example))))
[datetime.datetime(2022, 4, 22, 0, 0)]
Well sorta. That date was actually February 2004 not April 2022, but any tool would have to guess.
Just to be clear, this is what wordninja
does to squishedtogethertext:
>>>wordninja.split(example)
['04', '02', '22', 'I', 'Left', 'My', 'Heart', 'In', 'SF', 'I', 'Left', 'My', 'Heart', 'In', 'Sf', 'blah', 'blah', 'blah']
Upvotes: 0
Reputation: 670
There are two good modules on PyPI and GitHub, that make this task easier for us. Those are
Installation
pip install datefinder
import datefinder
input_string = "monkey 2010-07-10 love banana"
# a generator will be returned by the datefinder module. I'm typecasting it to a list. Please read the note of caution provided at the bottom.
matches = list(datefinder.find_dates(input_string))
if len(matches) > 0:
# date returned will be a datetime.datetime object. here we are only using the first match.
date = matches[0]
print date
else:
print 'No dates found'
SOURCE: Finny Abraham
Generic parsing of dates in over 200 language locales plus numerous formats in a language agnostic
fashion.
Generic parsing of relative dates like: '1 min ago'
, '2 weeks ago'
, '3 months
, 1 week and 1 day ago'
, 'in 2 days'
, 'tomorrow'.
Generic parsing of dates with time zones abbreviations or UTC offsets like: 'August 14, 2015 EST', 'July 4, 2013 PST', '21 July 2013 10:15 pm +0500'.
Date lookup in longer texts.
Support for non-Gregorian calendar systems. See Supported Calendars.
Extensive test coverage.
>>> parse('1 hour ago')
datetime.datetime(2015, 5, 31, 23, 0)
>>> parse('Il ya 2 heures') # French (2 hours ago)
datetime.datetime(2015, 5, 31, 22, 0)
>>> parse('1 anno 2 mesi') # Italian (1 year 2 months)
datetime.datetime(2014, 4, 1, 0, 0)
>>> parse('yaklaşık 23 saat önce') # Turkish (23 hours ago)
datetime.datetime(2015, 5, 31, 1, 0)
>>> parse('Hace una semana') # Spanish (a week ago)
datetime.datetime(2015, 5, 25, 0, 0)
>>> parse('2小时前') # Chinese (2 hours ago)
datetime.datetime(2015, 5, 31, 22, 0)
Upvotes: 3
Reputation: 879201
Using python-dateutil:
In [1]: import dateutil.parser as dparser
In [18]: dparser.parse("monkey 2010-07-10 love banana",fuzzy=True)
Out[18]: datetime.datetime(2010, 7, 10, 0, 0)
Invalid dates raise a ValueError
:
In [19]: dparser.parse("monkey 2010-07-32 love banana",fuzzy=True)
# ValueError: day is out of range for month
It can recognize dates in many formats:
In [20]: dparser.parse("monkey 20/01/1980 love banana",fuzzy=True)
Out[20]: datetime.datetime(1980, 1, 20, 0, 0)
Note that it makes a guess if the date is ambiguous:
In [23]: dparser.parse("monkey 10/01/1980 love banana",fuzzy=True)
Out[23]: datetime.datetime(1980, 10, 1, 0, 0)
But the way it parses ambiguous dates is customizable:
In [21]: dparser.parse("monkey 10/01/1980 love banana",fuzzy=True, dayfirst=True)
Out[21]: datetime.datetime(1980, 1, 10, 0, 0)
Upvotes: 199
Reputation:
If the date is given in a fixed form, you can simply use a regular expression to extract the date and "datetime.datetime.strptime" to parse the date:
import re
from datetime import datetime
match = re.search(r'\d{4}-\d{2}-\d{2}', text)
date = datetime.strptime(match.group(), '%Y-%m-%d').date()
Otherwise, if the date is given in an arbitrary form, you can't extract it easily.
Upvotes: 121
Reputation: 93
You could also try the dateparser module, which may be slower than datefinder on free text but which should cover more potential cases and date formats, as well as a significant number of languages.
Upvotes: 1
Reputation: 29
If you know the position of the date object in the string (for example in a log file), you can use .split()[index] to extract the date without fully knowing the format.
For example:
>>> string = 'monkey 2010-07-10 love banana'
>>> date = string.split()[1]
>>> date
'2010-07-10'
Upvotes: -9
Reputation: 652
Using Pygrok, you can define abstracted extensions to the Regular Expression syntax.
The custom patterns can be included in your regex in the format %{PATTERN_NAME}
.
You can also create a label for that pattern, by separating with a colon: %s{PATTERN_NAME:matched_string}
. If the pattern matches, the value will be returned as part of the resulting dictionary (e.g. result.get('matched_string')
)
For example:
from pygrok import Grok
input_string = 'monkey 2010-07-10 love banana'
date_pattern = '%{YEAR:year}-%{MONTHNUM:month}-%{MONTHDAY:day}'
grok = Grok(date_pattern)
print(grok.match(input_string))
The resulting value will be a dictionary:
{'month': '07', 'day': '10', 'year': '2010'}
If the date_pattern does not exist in the input_string, the return value will be None
. By contrast, if your pattern does not have any labels, it will return an empty dictionary {}
References:
Upvotes: 3
Reputation: 880
For extracting the date from a string in Python; the best module available is the datefinder module.
You can use it in your Python project by following the easy steps given below.
pip install datefinder
import datefinder
input_string = "monkey 2010-07-10 love banana"
# a generator will be returned by the datefinder module. I'm typecasting it to a list. Please read the note of caution provided at the bottom.
matches = list(datefinder.find_dates(input_string))
if len(matches) > 0:
# date returned will be a datetime.datetime object. here we are only using the first match.
date = matches[0]
print date
else:
print 'No dates found'
note: if you are expecting a large number of matches; then typecasting to list won't be a recommended way as it will be having a big performance overhead.
Upvotes: 41