Reputation: 7898
Given a string with a date in an unknown format and other text, how can I separate the two?
>>dparser.parse("monkey 2010-07-10 love banana",fuzzy=True)
datetime.datetime(2010, 7, 10, 0, 0)
from Extracting date from a string in Python is a step in the right direction, but what I want is the non-date text, for example:
date = 2010-07-10
str_a = 'monkey', str_b = 'love banana'
If the date string didn't have spaces in it, I could split the string and test each substring, but how about 'monkey Feb 20, 2015 loves 2014 bananas'
? 2014
and 2015
would both "pass" parse(), but only one of them is part of a date.
EDIT: there doesn't seem any reasonable way to deal with 'monkey Feb 20, 2015 loves 2014 bananas'
That leaves 'monkey Feb 20, 2015 loves bananas'
or 'monkey 2/20/2015 loves bananas'
or 'monkey 20 Feb 2015 loves 2014 bananas'
or other variants as things parse() can deal with.
Upvotes: 1
Views: 518
Reputation: 107287
You can use regex to extract the words , and for get ride of month names you can check that your strings not in calendar.month_abbr
and calendar.month_name
:
>>> import clalendar
>>> def word_find(s):
... return [i for i in re.findall(r'[a-zA-Z]+',s) if i.capitalize() not in calendar.month_name and i.capitalize() not in calendar.month_abbr]
Demo:
>>> s1='monkey Feb 20, 2015 loves 2014 bananas'
>>> s2='monkey Feb 20, 2015 loves bananas'
>>> s3='monkey 2/20/2015 loves bananas'
>>> s4='monkey 20 Feb 2015 loves 2014 bananas'
>>> print word_find(s1)
['monkey', 'loves', 'bananas']
>>> print word_find(s2)
['monkey', 'loves', 'bananas']
>>> print word_find(s3)
['monkey', 'loves', 'bananas']
>>> print word_find(s4)
['monkey', 'loves', 'bananas']
and this :
>>> s5='monkey 20 January 2015 loves 2014 bananas'
>>> print word_find(s5)
['monkey', 'loves', 'bananas']
Upvotes: 1
Reputation: 414129
To find date/time in a natural language text and to return their positions in the input text and thus allowing to get non-date text:
#!/usr/bin/env python
import parsedatetime # $ pip install parsedatetime
cal = parsedatetime.Calendar()
for text in ['monkey 2010-07-10 love banana',
'monkey Feb 20, 2015 loves 2014 bananas']:
indices = [0]
for parsed_datetime, type, start, end, matched_text in cal.nlp(text) or []:
indices.extend((start, end))
print([parsed_datetime, matched_text])
indices.append(len(text))
print([text[i:j] for i, j in zip(indices[::2], indices[1::2])])
[datetime.datetime(2015, 2, 21, 20, 10), '2010']
['monkey ', '-07-10 love banana']
[datetime.datetime(2015, 2, 20, 0, 0), ' Feb 20, 2015']
[datetime.datetime(2015, 2, 21, 20, 14), '2014']
['monkey', ' loves ', ' bananas']
Note: parsedatetime
failed to recognized 2010-07-10
as a date in the first string. 2010
and 2014
are recognized as a time (20:10
and 20:14
) in both strings.
Upvotes: 0