Reputation: 63
I'm trying to use regex to find proxy address on a website. Currently I'm using this piece of regex (\d{1,3}\.){3}\d{1,3}:(\d+)
. It works on regexr.com and in sublime text, but when I try to use it in Python it doesn't work as expected.
This is the piece of code I'm using:
p = re.compile("(\d{1,3}\.){3}\d{1,3}:(\d+)")
ipCandidates = p.findall(soupString)
It should return proxies like this 120.206.182.172:8123
but it returns tuples like this ('44.', '3128')
. What can I do to fix this?
Thank you.
Upvotes: 3
Views: 1066
Reputation: 336128
re.findall()
only returns the contents of capturing groups instead of the whole match (if you have such groups in your regex).
Then, you're repeating a capturing group three times, which means that only the third repetition is preserved (the other two are overwritten).
Change your regex to
p = re.compile(r"(?:\d{1,3}\.){3}\d{1,3}:\d+")
and you'll get whole matches.
If you do want tuples of the separate submatches (without the dots and colon), you can do that, too, but you can't use repetition then:
p = re.compile(r"(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}):(\d+)")
Also, always use raw strings for regexes, so regex escape sequences and string escape sequences can't be confused.
Upvotes: 4