Gert Lõhmus
Gert Lõhmus

Reputation: 79

Python BeautifulSoup to scrape tables from a webpage

I am trying to gather information from a website that has a database for ships.

I was trying to get the information with BeautifulSoup. But at the moment it does not seem to be working. I tried searching the web and tried different solutions, but did not manage to get the code working.

I was wondering to I have to change table = soup.find_all("table", { "class" : "table1" }) --- line as there are 5 tables with class='table1', but my code only finds 1.

Do I have to create a loop for the tables? As I tried this I could not get it working. Also the next line table_body = table.find('tbody') it gives an error:

AttributeError: 'ResultSet' object has no attribute 'find'

This should be the conflict between BeautifulSoup's source code, that ResultSet subclasses list and my code. Do I have to iterate over that list?

from urllib import urlopen

shipUrl = 'http://www.veristar.com/portal/veristarinfo/generalinfo/registers/seaGoingShips?portal:componentId=p_efff31ac-af4c-4e89-83bc-55e6d477d131&interactionstate=JBPNS_rO0ABXdRAAZudW1iZXIAAAABAAYwODkxME0AFGphdmF4LnBvcnRsZXQuYWN0aW9uAAAAAQAYc2hpcFNlYXJjaFJlc3VsdHNTZXRTaGlwAAdfX0VPRl9f&portal:type=action&portal:isSecure=false'
shipPage = urlopen(shipUrl)

from bs4 import BeautifulSoup
soup = BeautifulSoup(shipPage)
table = soup.find_all("table", { "class" : "table1" })
print table
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for tr in rows:
    cols = tr.find_all('td')
    for td in cols:
        print td
    print

Upvotes: 2

Views: 8889

Answers (1)

dstudeba
dstudeba

Reputation: 9038

A couple of things:

As Kevin mentioned, you need to use a for loop to iterate through the list returned by find_all.

Not all of the tables have a tbody so you have to wrap the result of the find in a try block.

When you do a print you want to use the .text method so you print the text value and not the tag itself.

Here is the revised code:

shipUrl = 'http://www.veristar.com/portal/veristarinfo/generalinfo/registers/seaGoingShips?portal:componentId=p_efff31ac-af4c-4e89-83bc-55e6d477d131&interactionstate=JBPNS_rO0ABXdRAAZudW1iZXIAAAABAAYwODkxME0AFGphdmF4LnBvcnRsZXQuYWN0aW9uAAAAAQAYc2hpcFNlYXJjaFJlc3VsdHNTZXRTaGlwAAdfX0VPRl9f&portal:type=action&portal:isSecure=false'
shipPage = urlopen(shipUrl)

soup = BeautifulSoup(shipPage)
table = soup.find_all("table", { "class" : "table1" })
for mytable in table:
    table_body = mytable.find('tbody')
    try:
        rows = table_body.find_all('tr')
        for tr in rows:
            cols = tr.find_all('td')
            for td in cols:
                print td.text
    except:
        print "no tbody"

Which produces the below output:

Register Number:
08910M
IMO Number:
9365398
Ship Name:
SUPERSTAR
Call Sign:
ESIY
.....

Upvotes: 3

Related Questions