Reputation: 8574
I am trying to scrap a webpage and extract prefixes and their names out of it. however, for some of the tags, I cannot extract them and my guess is that there are invisible tags. Here is my python code:
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open('http://bgp.he.net/AS23028#_prefixes')
html = response.read()
soup = BeautifulSoup(html)
soup_1 = soup.find("table", id = "table_prefixes4")
soup_2 = soup_1.findAll("td")
print soup_2
Does anybody have any idea how to get the name after tags? Here is the html content of the page:
<div class="flag alignright floatright"><img alt="United States" src="/images/flags/us.gif?1282328089" title="United States"/></div>
</td>, <td class="nowrap">
<a href="/net/209.176.111.0/24">209.176.111.0/24</a>
</td>, <td>Savvis
and I want to extract prefix "209.176.111.0/24" and "Savvis" from the HTML
Upvotes: 1
Views: 3183
Reputation: 1125398
The data is right there; nothing is missing in the page. The HTML doesn't appear to be broken (enough) for tags to be lost, nor is there any JavaScript altering the page in the browser:
for row in soup.select('table#table_prefixes4 tr'):
print row.get_text(' - ', strip=True)
prints the whole table including the headers.
To get just the cells:
for row in soup.select('table#table_prefixes4 tr'):
cells = row.find_all('td')
if not cells:
continue
print [cell.get_text(strip=True) for cell in cells]
The latter produces:
>>> for row in soup.select('table#table_prefixes4 tr'):
... cells = row.find_all('td')
... if not cells:
... continue
... print [cell.get_text(strip=True) for cell in cells]
...
[u'38.229.0.0/16', u'PSINet, Inc.']
[u'38.229.0.0/19', u'PSINet, Inc.']
[u'38.229.32.0/19', u'PSINet, Inc.']
[u'38.229.64.0/19', u'PSINet, Inc.']
[u'38.229.128.0/17', u'PSINet, Inc.']
[u'38.229.252.0/22', u'PSINet, Inc.']
[u'68.22.187.0/24', u'AS23028.NET']
[u'192.138.226.0/24', u'Computer Systems Consulting Services']
[u'203.28.18.0/24', u'Information Technology Services']
[u'204.74.64.0/24', u'SAUNET']
[u'209.176.111.0/24', u'Savvis']
[u'216.90.108.0/24', u'Savvis']
Upvotes: 1