orange
orange

Reputation: 8068

parsing dates and times

I'm trying to tokenize strings like month/year/day T hour:minute to ['month', '/', 'year', '/', 'day', ' T ', 'hour', ':', 'minute'], but have no luck with the regex I am trying. Could anyone please shed some light on this and let me know what's wrong?

>>> _tokenize_regex = 'year|month|day|hour|minute|second|.+'
>>> re.findall(_tokenize_regex, 'month/year/day T hour:minute')
['month', '/year/day T hour:minute']

The last option .+ finds the 2nd findall-result item, but I would have thought these options are ranked, so that .+ only matches if none of the others do...

More examples:

'month.year somestring' -> ['month', '.', 'year', ' somestring']
'year-month-day hour:minute.second' -> ['year', '-', 'month', '-', 'day', ' ', 'hour', ':', 'minute', '.', 'second']

Upvotes: 2

Views: 105

Answers (4)

falsetru
falsetru

Reputation: 368944

How about using \w+ to match words, and [^\w\s]+ to match non-word, non-space characters?

>>> re.findall(r'\w+|[^\w\s]+', 'month/year/day T hour:minute')
['month', '/', 'year', '/', 'day', 'T', 'hour', ':', 'minute']

/ matches none of year, month, ... second, but matches .. .+ matches up to the end of the string.

UPDATE

alternative approach using re.split with captured group to preserve separtors:

list(filter(None,
    re.split(r'(month|year|day|hour|minute|second|[^\w\s]+)', text)
))

exmaple:

>>> import re 
>>> def tokenize(text):
...     tokens = re.split(r'(month|year|day|hour|minute|second|[^\w\s]+)', text)
...     return list(filter(None, tokens))
... 
>>> tokenize('month/year/day T hour:minute') 
['month', '/', 'year', '/', 'day', ' T ', 'hour', ':', 'minute']
>>> tokenize('month.year somestring') 
['month', '.', 'year', ' somestring']
>>> tokenize('year-month-day hour:minute.second') 
['year', '-', 'month', '-', 'day', ' ', 'hour', ':', 'minute', '.', 'second']

UPDATE 2

re.findall with negative look-ahead assertion:

re.findall(
    r'[^\w\s]+|\s+(?!(?:month|year|day|hour|minute|second))\w*\s*|\s+|\w+',
    text
)

Upvotes: 2

Eugene Lisitsky
Eugene Lisitsky

Reputation: 12845

If you are working with real dates you may need to check whether it is a real date o just a combination of digits. I can recommend use special datetime module which can parse dates and check them. Like this:

    >>> import datetime
    >>> s='16/2016/03 T 23:52'
    >>> d = datetime.datetime.strptime(s, '%d/%Y/%m T %H:%M')
    >>> type(d)
    <class 'datetime.datetime'>
    >>> print(d)
    2016-03-16 23:52:00

Here you get special datetime object which is very comfortable for operations with dates. More info and examples are here: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior

Upvotes: 0

Tristan Bodding-Long
Tristan Bodding-Long

Reputation: 280

Not everything is best done in one line with a messy regex in python. You could try doing this in steps

>>> s = 'month/year/day T hour:minute'
>>> date,t,time = s.partition(' T ')
>>> month, year, day = date.split('/')
>>> hours, minutes = time.split(':')
>>> month, year, day, hours, minutes
('month', 'year', 'day', 'hour', 'minute')

For consistency with your expected output you can define separators and use those instead of strings in the partition and split functions.

dateSeparator = '/'
timeSeparator = ':'
tSeperator = ' T '

Variable names are nicer to work with than list indices and self-documenting for the next person who looks at your code. You can always form the list yourself.

Upvotes: 1

sasmith
sasmith

Reputation: 445

The problem in your regular expression is the .+. In particular, after month is matched, the remaining string is matched against year|month|day|hour|minute|second|.+. The only expression that matches the remaining string is .+. But since this is greedy, it matches the rest of the string.

Based on what I think you're trying to do, you should swap the . out for [/ T:].

Also, if you're actually trying to match timestamp strings, you should consider using strptime.

Upvotes: 2

Related Questions