Mujtaba
Mujtaba

Reputation: 260

Unable to extract data using BeautifulSoup

I am trying to get a list of servers from this sample.

https://pastebin.com/eHGwhVmz

from bs4 import BeautifulSoup as bs

with open('html.txt', 'r') as html:
    soup = bs(html, 'html.parser')
    div = soup.find('div', class_='grid_8')
    for tag in div:
        tag = div.find_all('td', class_='StatTDLabel')[2].text
        print(tag)

I can get the first server in the list, but I'm unable to iterate over them. When I try to use for loop i get the same result.

Upvotes: 0

Views: 96

Answers (2)

baduker
baduker

Reputation: 20042

Is this what you want?

from bs4 import BeautifulSoup
from tabulate import tabulate

sample_html = """The contents of your pastebin"""

soup = BeautifulSoup(sample_html, "html.parser").find_all("tr")
servers = [
    [i.getText(strip=True) for i in row.find_all("td")] for row in soup[1:]
]
print(tabulate(servers, headers=["Country", "Location", "Address", "Status"]))

Output:

Country    Location      Address               Status
---------  ------------  --------------------  -------------
ZA         Johannesburg  jnb-c17.ipvanish.com  15 % capacity
ZA         Johannesburg  jnb-c18.ipvanish.com  15 % capacity
ZA         Johannesburg  jnb-c19.ipvanish.com  31 % capacity
ZA         Johannesburg  jnb-c20.ipvanish.com  12 % capacity
ZA         Johannesburg  jnb-c21.ipvanish.com  9 % capacity
ZA         Johannesburg  jnb-c22.ipvanish.com  10 % capacity
AL         Tirana        tia-c02.ipvanish.com  17 % capacity
AL         Tirana        tia-c03.ipvanish.com  23 % capacity
AL         Tirana        tia-c04.ipvanish.com  19 % capacity
AL         Tirana        tia-c05.ipvanish.com  15 % capacity
AE         Dubai         dxb-c01.ipvanish.com  30 % capacity
AE         Dubai         dxb-c02.ipvanish.com  26 % capacity

To get the server addresses only, pick the third column, which has index of 2.

For example:

servers = [
    [i.getText(strip=True) for i in row.find_all("td")][2] for row in soup[1:]
]
print("\n".join(servers))

Output:

jnb-c17.ipvanish.com
jnb-c18.ipvanish.com
jnb-c19.ipvanish.com
jnb-c20.ipvanish.com
jnb-c21.ipvanish.com
jnb-c22.ipvanish.com
tia-c02.ipvanish.com
tia-c03.ipvanish.com
tia-c04.ipvanish.com
tia-c05.ipvanish.com
dxb-c01.ipvanish.com
dxb-c02.ipvanish.com

Upvotes: 2

Shreyesh Desai
Shreyesh Desai

Reputation: 719

Try this:

from bs4 import BeautifulSoup as bs

with open('html.txt', 'r') as html:
    soup = bs(html, 'html.parser')
    tags = div.find_all('td', class_='StatTDLabel')
    for tag in tags:
        tagtext = tag.find(text=True, recursive=False) #take only immediate text of the element and ignore child element texts
        if tagtext:
            print(tagtext)

Upvotes: 0

Related Questions