Fusilli Jerry
Fusilli Jerry

Reputation: 745

Scrape a series of tables with BeautifulSoup

I am trying to learn about web scraping and python (and programming for that matter) and have found the BeautifulSoup library which seems to offer a lot of possibilities.

I am trying to find out how to best pull the pertinent information from this page:

http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113

I can go into more detail on this, but basically the company name, the description about it, contact details, the various company details / statistics e.t.c.

At this stage looking at how to cleanly isolate this data and scrape it, with the view to put it all in a CSV or something later.

I am confused how to use BS to grab the different table data. There are lots of tr and td tags and not sure how to anchor on to anything unique.

The best I have come up with is the following code as a start:

from bs4 import BeautifulSoup
import urllib2

html = urllib2.urlopen("http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113")
soup = BeautifulSoup(html)
soupie = soup.prettify()
print soupie

and then from there use regex e.t.c. to pull data from the cleaned up text.

But there must be a better way to do this using the BS tree? Or is this site formatted in a way that BS won't provide much more help?

Not looking for a full solution as that is a big ask and I want to learn, but any code snippets to get me on my way would be much appreciated.

Update

Thanks to @ZeroPiraeus below I am starting to understand how to parse through the tables. Here is the output from his code:

=== Personnel ===
bodytext    Ms Gail Morgan CEO
bodytext    Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422
bodytext    Lisa Mayoh Sales Manager
bodytext    Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422 Email: [email protected]

=== Company Details ===
bodytext    ACN: 007 350 807 ABN: 71 007 350 807 Australian Owned Annual Turnover: $5M - $10M Number of Employees: 6-10 QA: ISO9001-2008, AS9120B, Export Percentage: 5 % Industry Categories: AerospaceLand (Vehicles, etc)LogisticsMarineProcurement Company Email: [email protected] Company Website: http://www.aerospacematerials.com.au Office: 2/6 Ovata Drive Tullamarine VIC 3043 Post: PO Box 188 TullamarineVIC 3043 Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422
paraheading ACN:
bodytext    007 350 807
paraheading ABN:
bodytext    71 007 350 807
paraheading 
bodytext    Australian Owned
paraheading Annual Turnover:
bodytext    $5M - $10M
paraheading Number of Employees:
bodytext    6-10
paraheading QA:
bodytext    ISO9001-2008, AS9120B,
paraheading Export Percentage:
bodytext    5 %
paraheading Industry Categories:
bodytext    AerospaceLand (Vehicles, etc)LogisticsMarineProcurement
paraheading Company Email:
bodytext    [email protected]
paraheading Company Website:
bodytext    http://www.aerospacematerials.com.au
paraheading Office:
bodytext    2/6 Ovata Drive Tullamarine VIC 3043
paraheading Post:
bodytext    PO Box 188 TullamarineVIC 3043
paraheading Phone:
bodytext    +61.3. 9464 4455
paraheading Fax:
bodytext    +61.3. 9464 4422

My next question is, what is the best way to put this data into a CSV which would be suitable for importing into a spreadsheet? For example having things like 'ABN' 'ACN' 'Company Website' e.t.c. as column headings and then the corresponding data as row information.

Thanks for any help.

Upvotes: 3

Views: 1695

Answers (2)

schesis
schesis

Reputation: 59118

Your code will depend on exactly what you want and how you want to store it, but this snippet should give you an idea how you can get the relevant information out of the page:

import requests

from bs4 import BeautifulSoup

url = "http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113"
html = requests.get(url).text
soup = BeautifulSoup(html)

for feature_heading in soup.find_all("td", {"class": "Feature-Heading"}):
    print "\n=== %s ===" % feature_heading.text
    details = feature_heading.find_next_sibling("td")
    for item in details.find_all("td", {"class": ["bodytext", "paraheading"]}):
        print("\t".join([item["class"][0], " ".join(item.text.split())]))

I find requests a more pleasant library to work with than urllib2, but of course that's up to you.

EDIT:

In response to your followup question, here's something you could use to write a CSV file from the scraped data:

import csv
import requests

from bs4 import BeautifulSoup

columns = ["ACN", "ABN", "Annual Turnover", "QA"]
urls = ["http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113", ] # ... etc.

with open("data.csv", "w") as csv_file:
    writer = csv.DictWriter(csv_file, columns)
    writer.writeheader()
    for url in urls:
        soup = BeautifulSoup(requests.get(url).text)
        row = {}
        for heading in soup.find_all("td", {"class": "paraheading"}):
            key = " ".join(heading.text.split()).rstrip(":")
            if key in columns:
                next_td = heading.find_next_sibling("td", {"class": "bodytext"})
                value = " ".join(next_td.text.split())
                row[key] = value
        writer.writerow(row)

Upvotes: 3

Dave_750
Dave_750

Reputation: 1245

I was down this road once before. The html page I was using was always the same format with a table and was internal to the company. We made sure the customer knew that if they changed the page, it would most likely break the programming. With that stipulation, was able to be certain of where everything was going to be by index value from a list of tr and td's. It was far from the ideal situation of having the XML data they couldn't or wouldn't provide, but has worked perfectly for almost a year now. If someone out there knows a better answer, I would like to know it also. That was the first and only time I used Beautiful Soup, never had a need for it since but it works very well.

Upvotes: 0

Related Questions