SpicyClubSauce
SpicyClubSauce

Reputation: 4276

regex noob questions

so this is my string:

"""$10. 2109 W. Chicago Ave., 773-772-0406, <a href="http://www.theoldoaktap.com/">theoldoaktap.com</a>"""

and i know that this is the proper regex formula to give me what I want (output follows):

age = re.match(r'\$([\d.]+)\. (.+), ([\d-]+)', example)
print age.groups()

output ====> ('10', '2109 W. Chicago Ave.', '773-772-0406')

but i have some questions about the regex formula even after reading the doc:

  1. When grouped with the ()parenthesis, those are the separate tuple values the regex is ultimately returning, right?
  2. If I delete the $ sign, why does the whole thing completely break down with error:unbalanced parenthesis? shouldn't the regular expression be able to grab the price after the $ regardless of if I specified $ beforehand? And building off that, if I want the output to be $10, not 10, why can't i move the $ inside and simply run r'\($[\d.]+)? it throws me another unbalanced parenthesis error.
  3. after the (.+), in the middle, is the comma the only way python knows we are done with the value to be slotted into the second tuple value slot? So, (.+) doesn't really mean 'any character' does it? a comma would move it on to the next character if it happened to be follow by a digit, right?
  4. could someone explain the placement of the + signs inside the parenthesis rather than outside and how that makes a difference?

sorry for the terribly noob questions. ill get good one day. thanks in advance.

Upvotes: 0

Views: 101

Answers (1)

Martin Konecny
Martin Konecny

Reputation: 59691

When grouped with the ()parenthesis, those are the separate tuple values the regex is ultimately returning, right?

Correct

If I delete the $ sign, why does the whole thing completely break down with error:unbalanced parenthesis? shouldn't the regular expression be able to grab the price after the $ regardless of if I specified $ beforehand?

If you delete the dollar sign, your escape character \ escapes the opening parentheses character (, tell the regex engine not to treat it as a literal character it needs to search for in your string.

after the (.+), in the middle, is the comma the only way python knows we are done with the value to be slotted into the second tuple value slot?

Yes it tells Python to capture 1 or more of almost any character up until the last comma. . match almost any single character. .+ matches 1 or more of almost any character.

Note that .+ is greedy meaning it will keep capturing commas up until before the last one. If you want it to stop before the first comma, you can make it lazy using .+?

could someone explain the placement of the + signs inside the parenthesis rather than outside and how that makes a difference?

It doesn't change the behaviour of the +, whether its on the inside or outside. It just changes what gets captured into the group.

EDIT:

Why can't i move the $ inside and simply run r'($[\d.]+)? it throws me another unbalanced parenthesis error.

This is because $ also has a special meaning (means match end-of-line) just like ( and ) in regex, meaning you need to escape it you want to match the literal character just like you escaped your parenthesis: \$.

Upvotes: 3

Related Questions