Reputation: 8068
I'm trying to tokenize strings like month/year/day T hour:minute
to ['month', '/', 'year', '/', 'day', ' T ', 'hour', ':', 'minute']
, but have no luck with the regex I am trying. Could anyone please shed some light on this and let me know what's wrong?
>>> _tokenize_regex = 'year|month|day|hour|minute|second|.+'
>>> re.findall(_tokenize_regex, 'month/year/day T hour:minute')
['month', '/year/day T hour:minute']
The last option .+
finds the 2nd findall
-result item, but I would have thought these options are ranked, so that .+
only matches if none of the others do...
More examples:
'month.year somestring' -> ['month', '.', 'year', ' somestring']
'year-month-day hour:minute.second' -> ['year', '-', 'month', '-', 'day', ' ', 'hour', ':', 'minute', '.', 'second']
Upvotes: 2
Views: 105
Reputation: 368944
How about using \w+
to match words, and [^\w\s]+
to match non-word, non-space characters?
>>> re.findall(r'\w+|[^\w\s]+', 'month/year/day T hour:minute')
['month', '/', 'year', '/', 'day', 'T', 'hour', ':', 'minute']
/
matches none of year
, month
, ... second
, but matches .
. .+
matches up to the end of the string.
UPDATE
alternative approach using re.split
with captured group to preserve separtors:
list(filter(None,
re.split(r'(month|year|day|hour|minute|second|[^\w\s]+)', text)
))
exmaple:
>>> import re
>>> def tokenize(text):
... tokens = re.split(r'(month|year|day|hour|minute|second|[^\w\s]+)', text)
... return list(filter(None, tokens))
...
>>> tokenize('month/year/day T hour:minute')
['month', '/', 'year', '/', 'day', ' T ', 'hour', ':', 'minute']
>>> tokenize('month.year somestring')
['month', '.', 'year', ' somestring']
>>> tokenize('year-month-day hour:minute.second')
['year', '-', 'month', '-', 'day', ' ', 'hour', ':', 'minute', '.', 'second']
UPDATE 2
re.findall
with negative look-ahead assertion:
re.findall(
r'[^\w\s]+|\s+(?!(?:month|year|day|hour|minute|second))\w*\s*|\s+|\w+',
text
)
Upvotes: 2
Reputation: 12845
If you are working with real dates you may need to check whether it is a real date o just a combination of digits. I can recommend use special datetime
module which can parse dates and check them. Like this:
>>> import datetime
>>> s='16/2016/03 T 23:52'
>>> d = datetime.datetime.strptime(s, '%d/%Y/%m T %H:%M')
>>> type(d)
<class 'datetime.datetime'>
>>> print(d)
2016-03-16 23:52:00
Here you get special datetime object which is very comfortable for operations with dates. More info and examples are here: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
Upvotes: 0
Reputation: 280
Not everything is best done in one line with a messy regex in python. You could try doing this in steps
>>> s = 'month/year/day T hour:minute'
>>> date,t,time = s.partition(' T ')
>>> month, year, day = date.split('/')
>>> hours, minutes = time.split(':')
>>> month, year, day, hours, minutes
('month', 'year', 'day', 'hour', 'minute')
For consistency with your expected output you can define separators and use those instead of strings in the partition and split functions.
dateSeparator = '/'
timeSeparator = ':'
tSeperator = ' T '
Variable names are nicer to work with than list indices and self-documenting for the next person who looks at your code. You can always form the list yourself.
Upvotes: 1
Reputation: 445
The problem in your regular expression is the .+
. In particular, after month
is matched, the remaining string is matched against year|month|day|hour|minute|second|.+
. The only expression that matches the remaining string is .+
. But since this is greedy, it matches the rest of the string.
Based on what I think you're trying to do, you should swap the .
out for [/ T:]
.
Also, if you're actually trying to match timestamp strings, you should consider using strptime
.
Upvotes: 2