chris
chris

Reputation: 649

Dealing with spaces in regex

I'm a RegEx newbie and this has been driving me nuts for the past 48 hours. I tried everything I could while reading hundreds of examples and documents. I want to learn.

I need to extract the month name from strings like these, with the month being the word in the middle (multilingual):

10 july  2014
9 dicembre2014
1januar2011
18août2002 (note: non-[A-z] character in the month if it matters)

The closest I got was [\D]{3,}(?=.{4,}) yielding:

' july '
' dicembre'
'januar'
'août'

But it still matches the spaces around the name. I tried adding [^\s] but obviously it's not that simple.

What is the simplest RegEx way to find the right match?

Upvotes: 2

Views: 117

Answers (1)

Mariano
Mariano

Reputation: 6511

If you set re.UNICODE flag, you can use unicode properties, and thus a \w also matches all letters from all scripts (including û, ñ, á, etc.). Then, [^\W\d_] would match only letters, but from any script:

  • \w matches word characters (letters, digits or underscore "_")
  • \W is the negated shorthand, it matches non-word characters (same as [^\w])
  • \d matches digits
  • So [^\W\d_] will match anything EXCEPT non-word characters, digits or "_"... which means it will only match letters

Code:

#python 3.4.3
import re

str = u"10 july  2014 \n 9 dicembre2014 \n 1januar2011\n 18août2002"
pattern = r'([0-3]?\d)\s*([^\W\d_]{3,})\s*((?:\d{2}){1,2})'
result = re.findall(pattern, str, re.UNICODE)

for date in result :
    print(date)

Output:

('10', 'july', '2014')
('9', 'dicembre', '2014')
('1', 'januar', '2011')
('18', 'août', '2002')

Check online here

Upvotes: 2

Related Questions