Reputation: 41

Python regex issues

Im trying to grab proxies from a site using python by scanning through the page with urlib and finding proxies using regex.

A proxy on the page looks something like this:

<a href="/ip/190.207.169.184/free_Venezuela_proxy_servers_VE_Venezuela">190.207.169.184</a></td><td>8080</td><td>

My code looks like this:

for site in sites:
content = urllib.urlopen(site).read()
e = re.findall("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\<\/\a\>\<\/td\>\<td\>\d+", content)
#\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d+

for proxy in e:
    s.append(proxy)
    amount += 1

Regex:

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\<\/\a\>\<\/td\>\<td\>\d+

I know that the code works but that the Regex is wrong.

Any idea on how I could fix this?

EDIT: http://www.regexr.com/ seems to thing my Regex is fine?

Upvotes: 0

Answers (1)

alecxe

Reputation: 474161

One option would be to use an HTML parser to find IP addresses and ports.

Example (using BeautifulSoup HTML parser):

import re
import urllib2
from bs4 import BeautifulSoup

data = urllib2.urlopen('http://letushide.com/protocol/http/3/list_of_free_HTTP_proxy_servers')

IP_RE = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
PORT_RE = re.compile(r'\d+')

soup = BeautifulSoup(data)
for ip in soup.find_all('a', text=IP_RE):
    port = ip.parent.find_next_sibling('td', text=PORT_RE)
    print ip.text, port.text

Prints:

80.193.214.231 3128
186.88.37.204 8080
180.254.72.33 80
201.209.27.119 8080
...

The idea here is to find all a tags with the text matching \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} regular expression. For each link, find the parent's next td sibling with the text matching \d+.

Alternatively, since you know the table structure and the columns where there are IPs and ports, you can just get the cell values from each row by index, no need to dive into regular expressions here:

import urllib2
from bs4 import BeautifulSoup

data = urllib2.urlopen('http://letushide.com/protocol/http/3/list_of_free_HTTP_proxy_servers')

soup = BeautifulSoup(data)
for row in soup.find_all('tr', id='data'):
    print [cell.text for cell in row('td')[1:3]]

Prints:

[u'80.193.214.231', u'3128']
[u'186.88.37.204', u'8080']
[u'180.254.72.33', u'80']
[u'201.209.27.119', u'8080']
[u'190.204.96.72', u'8080']
[u'190.207.169.184', u'8080']
[u'79.172.242.188', u'8080']
[u'1.168.171.100', u'8088']
[u'27.105.26.162', u'9064']
[u'190.199.92.174', u'8080']
...

Upvotes: 3

Python regex issues

Answers (1)

Related Questions