foosion
foosion

Reputation: 7898

Separating a date from a string in Python

Given a string with a date in an unknown format and other text, how can I separate the two?

>>dparser.parse("monkey 2010-07-10 love banana",fuzzy=True)
datetime.datetime(2010, 7, 10, 0, 0)

from Extracting date from a string in Python is a step in the right direction, but what I want is the non-date text, for example:

date = 2010-07-10
str_a = 'monkey', str_b = 'love banana'

If the date string didn't have spaces in it, I could split the string and test each substring, but how about 'monkey Feb 20, 2015 loves 2014 bananas'? 2014 and 2015 would both "pass" parse(), but only one of them is part of a date.

EDIT: there doesn't seem any reasonable way to deal with 'monkey Feb 20, 2015 loves 2014 bananas' That leaves 'monkey Feb 20, 2015 loves bananas' or 'monkey 2/20/2015 loves bananas' or 'monkey 20 Feb 2015 loves 2014 bananas' or other variants as things parse() can deal with.

Upvotes: 1

Views: 518

Answers (2)

Kasravnd
Kasravnd

Reputation: 107287

You can use regex to extract the words , and for get ride of month names you can check that your strings not in calendar.month_abbr and calendar.month_name:

>>> import clalendar
>>> def word_find(s):
...       return [i for i in re.findall(r'[a-zA-Z]+',s) if i.capitalize() not in calendar.month_name and i.capitalize() not in calendar.month_abbr]

Demo:

>>> s1='monkey Feb 20, 2015 loves 2014 bananas'
>>> s2='monkey Feb 20, 2015 loves bananas'
>>> s3='monkey 2/20/2015 loves bananas'
>>> s4='monkey 20 Feb 2015 loves 2014 bananas'
>>> print word_find(s1)
['monkey', 'loves', 'bananas']
>>> print word_find(s2)
['monkey', 'loves', 'bananas']
>>> print word_find(s3)
['monkey', 'loves', 'bananas']
>>> print word_find(s4)
['monkey', 'loves', 'bananas']

and this :

>>> s5='monkey 20 January 2015 loves 2014 bananas'
>>> print word_find(s5)
['monkey', 'loves', 'bananas']

Upvotes: 1

jfs
jfs

Reputation: 414129

To find date/time in a natural language text and to return their positions in the input text and thus allowing to get non-date text:

 #!/usr/bin/env python
 import parsedatetime # $ pip install parsedatetime

 cal = parsedatetime.Calendar()
 for text in ['monkey 2010-07-10 love banana',
              'monkey Feb 20, 2015 loves 2014 bananas']:
     indices = [0]
     for parsed_datetime, type, start, end, matched_text in cal.nlp(text) or []:
         indices.extend((start, end))
         print([parsed_datetime, matched_text])
     indices.append(len(text))
     print([text[i:j] for i, j in zip(indices[::2], indices[1::2])])

Output

[datetime.datetime(2015, 2, 21, 20, 10), '2010']
['monkey ', '-07-10 love banana']
[datetime.datetime(2015, 2, 20, 0, 0), ' Feb 20, 2015']
[datetime.datetime(2015, 2, 21, 20, 14), '2014']
['monkey', ' loves ', ' bananas']

Note: parsedatetime failed to recognized 2010-07-10 as a date in the first string. 2010 and 2014 are recognized as a time (20:10 and 20:14) in both strings.

Upvotes: 0

Related Questions