Reputation: 649
I'm a RegEx newbie and this has been driving me nuts for the past 48 hours. I tried everything I could while reading hundreds of examples and documents. I want to learn.
I need to extract the month name from strings like these, with the month being the word in the middle (multilingual):
10 july 2014
9 dicembre2014
1januar2011
18août2002 (note: non-[A-z] character in the month if it matters)
The closest I got was [\D]{3,}(?=.{4,})
yielding:
' july '
' dicembre'
'januar'
'août'
But it still matches the spaces around the name. I tried adding [^\s]
but obviously it's not that simple.
What is the simplest RegEx way to find the right match?
Upvotes: 2
Views: 117
Reputation: 6511
If you set re.UNICODE
flag, you can use unicode properties, and thus a \w
also matches all letters from all scripts (including û
, ñ
, á
, etc.). Then, [^\W\d_]
would match only letters, but from any script:
\w
matches word characters (letters, digits or underscore "_
")\W
is the negated shorthand, it matches non-word characters (same as [^\w]
)\d
matches digits[^\W\d_]
will match anything EXCEPT non-word characters, digits or "_
"... which means it will only match letters#python 3.4.3
import re
str = u"10 july 2014 \n 9 dicembre2014 \n 1januar2011\n 18août2002"
pattern = r'([0-3]?\d)\s*([^\W\d_]{3,})\s*((?:\d{2}){1,2})'
result = re.findall(pattern, str, re.UNICODE)
for date in result :
print(date)
('10', 'july', '2014')
('9', 'dicembre', '2014')
('1', 'januar', '2011')
('18', 'août', '2002')
Check online here
Upvotes: 2