Pablo
Pablo

Reputation: 165

Trying to extract 'text' from a tag using Python

I'm trying to extract the Proxy IP number on the first column in this page (https://www.proxynova.com/proxy-server-list/country-fr/), just the number for example: "178.33.62.155" but when I try to extract all the text content on the relevant tag , it doesn't get the Ip text.

The html tag on the website is:

<td align="left"><script>document.write('23178.3'.substr(2) + '3.62.155');</script>178.33.62.155</td>

Then I believe the Ip number above (after the tag script, inside the tag ) should appears when I print the text content, but it doesn't, following the code below I have done so far the only information that doesn't appears is the IP number.

Any idea on how to extract this specific Ip information and why it is not appearing when I extract all the text content of this tag?

from lxml import html
import requests
import re

page = requests.get('https://www.proxynova.com/proxy-server-list/country-fr/')
tree = html.fromstring(page.content.decode('utf-8'))

for elem in tree.xpath('//table[@class="table"]//tbody//td[@align="left"]'):
print elem.text_content()

Upvotes: 1

Views: 577

Answers (2)

Bill Bell
Bill Bell

Reputation: 21643

I admit that I wouldn't have got this without tell's answer because I missed how the IP addresses were coded in the scripts.

import re
import requests
from lxml import etree

page = requests.get('https://www.proxynova.com/proxy-server-list/country-fr/').text
parser = etree.HTMLParser()
tree = etree.fromstring(page, parser=parser)
table = tree.xpath('.//table[@id="tbl_proxy_list"]//script/text()')

for item in table:
    m = re.match(r"document.write\('23([0-9.]+)'[^']+'([0-9.]+)'",item)
    if m:
        print (''.join(m.groups()))

Upvotes: 1

tell k
tell k

Reputation: 615

I recommend using BeautifulSoup. like this.

import requests
import re
from bs4 import BeautifulSoup

res = requests.get('https://www.proxynova.com/proxy-server-list/country-fr/')
soup = BeautifulSoup(res.content, "lxml")

REGEX_JS = re.compile("^document\.write\('([^']+)'\.substr\(2\) \+ '([^']+)'\);$")

proxy_ip_list = []
for table in soup.find_all("table", id="tbl_proxy_list"):
    for script in table.find_all("script"):
        m = REGEX_JS.search(script.text)
        if m:
            proxy_ip_list.append(m.group(1)[2:] + m.group(2))

for ip in proxy_ip_list:
    print(ip)

Upvotes: 1

Related Questions