Dealing with spaces in regex

Question

I'm a RegEx newbie and this has been driving me nuts for the past 48 hours. I tried everything I could while reading hundreds of examples and documents. I want to learn.

I need to extract the month name from strings like these, with the month being the word in the middle (multilingual):

10 july  2014
9 dicembre2014
1januar2011
18août2002 (note: non-[A-z] character in the month if it matters)

The closest I got was [\D]{3,}(?=.{4,}) yielding:

' july '
' dicembre'
'januar'
'août'

But it still matches the spaces around the name. I tried adding [^\s] but obviously it's not that simple.

What is the simplest RegEx way to find the right match?

Mariano · Accepted Answer

If you set re.UNICODE flag, you can use unicode properties, and thus a \w also matches all letters from all scripts (including û, ñ, á, etc.). Then, [^\W\d_] would match only letters, but from any script:

\w matches word characters (letters, digits or underscore "_")
\W is the negated shorthand, it matches non-word characters (same as [^\w])
\d matches digits
So [^\W\d_] will match anything EXCEPT non-word characters, digits or "_"... which means it will only match letters

Code:

#python 3.4.3
import re

str = u"10 july  2014 
 9 dicembre2014 
 1januar2011
 18août2002"
pattern = r'([0-3]?\d)\s*([^\W\d_]{3,})\s*((?:\d{2}){1,2})'
result = re.findall(pattern, str, re.UNICODE)

for date in result :
    print(date)

Output:

('10', 'july', '2014')
('9', 'dicembre', '2014')
('1', 'januar', '2011')
('18', 'août', '2002')

Check online here

Dealing with spaces in regex

Answers (1)

Code:

Output:

Related Questions