wayne
wayne

Reputation: 13

python 2.7 grab IPs from Source code

I've already googled my question but i havent any solution at the moment. I want to grab the IPs and ports from this html content: (I've got this content as a String)

Ive read about beautiful soup and regexp - ive tried both but i cant get a solution - and beautiful soup is very slow. sry for my bad english.

<tr class="proxyListOdd">
<td><a href="http://whois.sc/81.196.122.86" target="_blank">81.196.122.86</a></td>
<td>8080</td>
<td>Nein</td>
<td>3</td>
<td class="proxyList_Ping" >0.44 Sek.</td>
<td><img height="24px" width="24px" alt="Rumänien" title="Rumänien" src="http://static2.proxy-listen.de/0_proxy/images/flags/ro.png"></td>
<td class="proxyList_Online arrowUp">97% </td>
<td>22:06</td>
<td><input style="align: center" title="Proxyserver übernehmen" type="image" src="/0_proxy/images/ProxyswitcherButtonOn.png" onclick="de.proxy_listen.setProxy({'U2a66iQA': '70ODEuMTk2LjEyMi44Ng==', 'uhSRlFfS': '96ODA4MA==', 'h0zMxtxH':'21MQ=='}, 'https://addons.mozilla.org/addon/proxy-listen-de_proxyswitcher/');"></td>
<td><a href='proxy:name=Proxy-listen.de&host=81.196.122.86&port=8080&foxyProxyMode=this&confirmation=popup' title="Proxyserver in FoxyProxy übernehmen."><img height="24px" width="22px" alt="FoxyProxy" src="http://static.proxy-listen.de/0_proxy/images/foxyproxy.png"></a></td>
</tr>
<tr class="proxyListEven">
<td><a href="http://whois.sc/94.126.17.68" target="_blank">94.126.17.68</a></td>
<td>3128</td>
<td>Nein</td>
<td>3</td>
<td class="proxyList_Ping" >0.95 Sek.</td>
<td><img height="24px" width="24px" alt="Schweiz" title="Schweiz" src="http://static2.proxy-listen.de/0_proxy/images/flags/ch.png"></td>
<td class="proxyList_Online arrowUp">86% </td>
<td>22:06</td>
<td><input style="align: center" title="Proxyserver übernehmen" type="image" src="/0_proxy/images/ProxyswitcherButtonOn.png" onclick="de.proxy_listen.setProxy({'U2a66iQA': '65OTQuMTI2LjE3LjY4', 'uhSRlFfS': '78MzEyOA==', 'h0zMxtxH':'52MQ=='}, 'https://addons.mozilla.org/addon/proxy-listen-de_proxyswitcher/');"></td>
<td><a href='proxy:name=Proxy-listen.de&host=94.126.17.68&port=3128&foxyProxyMode=this&confirmation=popup' title="Proxyserver in FoxyProxy übernehmen."><img height="24px" width="22px" alt="FoxyProxy" src="http://static.proxy-listen.de/0_proxy/images/foxyproxy.png"></a></td>
</tr>
<tr class="proxyListOdd">
<td><a href="http://whois.sc/89.105.247.13" target="_blank">89.105.247.13</a></td>
<td>3128</td>
<td>Nein</td>

hope you can help me ;) mfg henry

Upvotes: 0

Views: 202

Answers (4)

anonymous
anonymous

Reputation: 1532

Have you considered using something like minidom? From the documentation:

xml.dom.minidom is a light-weight implementation of the Document Object Model interface. It is intended to be simpler than the full DOM and also significantly smaller.

Upvotes: 0

phihag
phihag

Reputation: 288180

Use a regular expression:

>>> import re
>>> set(m.group(0) for m in re.finditer(r'([0-9]{1,3}\.){3}[0-9]{1,3}', s))
{'81.196.122.86', '94.126.17.68', '89.105.247.13'}

Note that this regular expression is simplified, and doesn't actually capture all IP addresses (and captures some values that are not). If you want a more precise match, according to inet_addr(3) and RFC 4291, the whole regular expression looks like:

# IPv4, common format
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])|
# IPv4, dotted hexadecimal
(?:0x[0-9a-fA-F]{2}\.){3}0x[0-9a-fA-F]{2}|
# IPv4, dotted octal
0[0-7]{3}\.){3}0[0-7]{3}|
# IPv4, one number, hexadecimal
0x[0-9a-fA-F]{1,8})|
# IPv4, one number, octal
0[0-7]{1,11})|
# IPv4, one number, hexadecimal
[1-4][0-9]{9}|0|[1-9][0-9]{0,7}|
# IPv6, preferred form (RFC 4291 2.2.1)
(?:[0-9a-fA-F]{1,4}){7}[0-9a-fA-F]{1,4}|
# IPv6, compressed syntax (RFC 4291 2.2.2)
(?:
  [0-9a-fA-F]{0,4}::(?:[0-9a-fA-F]{1,4}:){,6}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){1}::(?:[0-9a-fA-F]{1,4}:){,4}[0-9a-fA-F]{0,4}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){2}::(?:[0-9a-fA-F]{1,4}:){,3}[0-9a-fA-F]{0,4}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){3}::(?:[0-9a-fA-F]{1,4}:){,2}[0-9a-fA-F]{0,4}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){4}::(?:[0-9a-fA-F]{1,4}:){,1}[0-9a-fA-F]{0,4}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){5}::[0-9a-fA-F]{0,4}
)|
# IPv6, alternative form (RFC 4291 2.2.3, uncompressed)
(?:[0-9a-fA-F]{1,4}){6}|(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]))|
# IPv6, alternative form (RFC 4291 2.2.3, compressed)
(?:
  [0-9a-fA-F]{0,4}::(?:[0-9a-fA-F]{1,4}:){,4}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){1}::(?:[0-9a-fA-F]{1,4}:){,3}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){2}::(?:[0-9a-fA-F]{1,4}:){,2}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){3}::(?:[0-9a-fA-F]{1,4}:){,1}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){4}::
)
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]))

As you can see, if you really want to match all IP addresses, you should search for the approximate format, and then (if necessary) validate the addresses, for example with ipaddress. Note that the above regular expression is incomplete for your case as it does not cover possible HTML character encodings such as &#x31; for 1.

Upvotes: 3

Trickfire
Trickfire

Reputation: 443

See Similar question

EDIT: For this particular case, you would have to do something different and regex out the data from this particular set of HTML data (as IP's appear multiple times):

print [ ":".join((y,z)) for x,y,z in re.findall('proxyList((?=Even)|(?=Odd)).*?_blank">(.*?)</a></td>.*?<td>([0-9]+)</td>',data,flags=re.DOTALL | re.MULTILINE)]

You could have also regex'ed on the 'proxy:name=Proxy-listen' Part, which Marco de Wit does.

Otherwise:

re.findall('(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)',data)

which finds all IPv4 addresses, to add ports onto that, modify it to be:

re.findall('((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)):([0-9]{1,5})*',data)

Which should find all IP's and ports in this format: XXX.XXX.XXX.XXX:YYYYY (That stated, it doesn't check if the ports are valid.

Upvotes: 0

Marco de Wit
Marco de Wit

Reputation: 2804

this works only with IPv4:

re.findall('(\d+\.\d+\.\d+\.\d+)&port=(\d+)',s)

Upvotes: 1

Related Questions