user3246693
user3246693

Reputation: 783

Python Regular expression not returning as expected

I am having trouble understanding the output of this regular expression. I am using the following regex to find a dates in text:

^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$

It appears to be matching the pattern within text correctly, but I'm confused by the return values.

For this test string:

TestString = "10-20-2015"

It's returning this:

[('10', '20', '', '')]

If I put () around the entire regex, I get this returned:

[('10-20-2015', '10', '20', '', '')]

I would expect it to simply return the full date string, but it appears to be breaking the results up and I don't understand why. Wrapping my regex in () returns the full date string, but it also returns 4 extra values.

How do I make this ONLY match the full date string and not small parts of the string?

from my console:

Python 3.4.2 (default, Oct  8 2014, 10:45:20) 
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = "^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$"
>>> TestString = "10-20-2015"
>>> re.findall(pattern, TestString, re.I)
[('10', '20', '', '')]
>>> pattern = "(^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$)"
>>> re.findall(pattern, TestString, re.I)
[('10-20-2015', '10', '20', '', '')]
>>> 
>>> TestString = "10--2015"
>>> re.findall(pattern, TestString, re.I)
[]
>>> pattern = "^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$"
>>> re.findall(pattern, TestString, re.I)
[]

Based on the the response, here was my answer: ((?:(?:1[0-2]|0[1-9])-(?:3[01]|[12][0-9]|0[1-9])|(?:3[01]|[12][0-9]|0[1-9])-(?:1[0-2]|0[1-9]))-(?:[0-9]{2})?[0-9]{2})

Upvotes: 2

Views: 249

Answers (2)

Saily_Shah
Saily_Shah

Reputation: 85

We can do that using one of the most important re functions - search(). This function scans through a string, looking for any location where this RE matches.

import re

text = "10-20-2015"

date_regex = '(\d{1,2})-(\d{1,2})-(\d{4})'

""" 
\d in above pattern stands for numerical characters [0-9].
The numbers in curly brackets {} indicates the count of numbers permitted.
Parentheses/round brackets are used for capturing groups so that we can treat 
multiple characters as a single unit.

"""

search_date = re.search(date_regex, text)

# for entire match
print(search_date.group())
# also print(search_date.group(0)) can be used
 
# for the first parenthesized subgroup
print(search_date.group(1))
 
# for the second parenthesized subgroup
print(search_date.group(2))
 
# for the third parenthesized subgroup
print(search_date.group(3))
 
# for a tuple of all matched subgroups
print(search_date.group(1, 2, 3))

Output for each of the print statement mentioned above:

10-20-2015
10
20
2015
('10', '20', '2015')

Hope this answer clears your doubt :-)

Upvotes: 0

Fabricator
Fabricator

Reputation: 12772

Every () is a captured group, (1[0-2]|0?[1-9]) captures 10, (3[01]|[12][0-9]|0?[1-9]) captures 20, and so on. When you surround everything in (), it came before the other () and matched everything. You can ignore a captured group, which is called non-captured group, use (?:) instead of ().

Upvotes: 2

Related Questions