Reputation: 260
I am trying to get a list of servers from this sample.
from bs4 import BeautifulSoup as bs
with open('html.txt', 'r') as html:
soup = bs(html, 'html.parser')
div = soup.find('div', class_='grid_8')
for tag in div:
tag = div.find_all('td', class_='StatTDLabel')[2].text
print(tag)
I can get the first server in the list, but I'm unable to iterate over them. When I try to use for loop i get the same result.
Upvotes: 0
Views: 96
Reputation: 20042
Is this what you want?
from bs4 import BeautifulSoup
from tabulate import tabulate
sample_html = """The contents of your pastebin"""
soup = BeautifulSoup(sample_html, "html.parser").find_all("tr")
servers = [
[i.getText(strip=True) for i in row.find_all("td")] for row in soup[1:]
]
print(tabulate(servers, headers=["Country", "Location", "Address", "Status"]))
Output:
Country Location Address Status
--------- ------------ -------------------- -------------
ZA Johannesburg jnb-c17.ipvanish.com 15 % capacity
ZA Johannesburg jnb-c18.ipvanish.com 15 % capacity
ZA Johannesburg jnb-c19.ipvanish.com 31 % capacity
ZA Johannesburg jnb-c20.ipvanish.com 12 % capacity
ZA Johannesburg jnb-c21.ipvanish.com 9 % capacity
ZA Johannesburg jnb-c22.ipvanish.com 10 % capacity
AL Tirana tia-c02.ipvanish.com 17 % capacity
AL Tirana tia-c03.ipvanish.com 23 % capacity
AL Tirana tia-c04.ipvanish.com 19 % capacity
AL Tirana tia-c05.ipvanish.com 15 % capacity
AE Dubai dxb-c01.ipvanish.com 30 % capacity
AE Dubai dxb-c02.ipvanish.com 26 % capacity
To get the server addresses only, pick the third column, which has index of 2
.
For example:
servers = [
[i.getText(strip=True) for i in row.find_all("td")][2] for row in soup[1:]
]
print("\n".join(servers))
Output:
jnb-c17.ipvanish.com
jnb-c18.ipvanish.com
jnb-c19.ipvanish.com
jnb-c20.ipvanish.com
jnb-c21.ipvanish.com
jnb-c22.ipvanish.com
tia-c02.ipvanish.com
tia-c03.ipvanish.com
tia-c04.ipvanish.com
tia-c05.ipvanish.com
dxb-c01.ipvanish.com
dxb-c02.ipvanish.com
Upvotes: 2
Reputation: 719
Try this:
from bs4 import BeautifulSoup as bs
with open('html.txt', 'r') as html:
soup = bs(html, 'html.parser')
tags = div.find_all('td', class_='StatTDLabel')
for tag in tags:
tagtext = tag.find(text=True, recursive=False) #take only immediate text of the element and ignore child element texts
if tagtext:
print(tagtext)
Upvotes: 0