The Wizard of Dos
The Wizard of Dos

Reputation: 63

Regex giving tuple and not full match

I'm trying to use regex to find proxy address on a website. Currently I'm using this piece of regex (\d{1,3}\.){3}\d{1,3}:(\d+). It works on regexr.com and in sublime text, but when I try to use it in Python it doesn't work as expected.

This is the piece of code I'm using:

p = re.compile("(\d{1,3}\.){3}\d{1,3}:(\d+)")
ipCandidates = p.findall(soupString)

It should return proxies like this 120.206.182.172:8123 but it returns tuples like this ('44.', '3128'). What can I do to fix this?

Thank you.

Upvotes: 3

Views: 1066

Answers (1)

Tim Pietzcker
Tim Pietzcker

Reputation: 336128

re.findall() only returns the contents of capturing groups instead of the whole match (if you have such groups in your regex).

Then, you're repeating a capturing group three times, which means that only the third repetition is preserved (the other two are overwritten).

Change your regex to

p = re.compile(r"(?:\d{1,3}\.){3}\d{1,3}:\d+")

and you'll get whole matches.

If you do want tuples of the separate submatches (without the dots and colon), you can do that, too, but you can't use repetition then:

p = re.compile(r"(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}):(\d+)")

Also, always use raw strings for regexes, so regex escape sequences and string escape sequences can't be confused.

Upvotes: 4

Related Questions