Will
Will

Reputation: 287

Python beautifulsoup iterate over table

I am trying to scrape table data into a CSV file. Unfortunately, I've hit a road block and the following code simply repeats the TD from the first TR for all subsequent TRs.

import urllib.request
from bs4 import BeautifulSoup

f = open('out.txt','w')

url = "http://www.international.gc.ca/about-a_propos/atip-aiprp/reports-rapports/2012/02-atip_aiprp.aspx"
page = urllib.request.urlopen(url)

soup = BeautifulSoup(page)

soup.unicode

table1 = soup.find("table", border=1)
table2 = soup.find('tbody')
table3 = soup.find_all('tr')

for td in table3:
    rn = soup.find_all("td")[0].get_text()
    sr = soup.find_all("td")[1].get_text()
    d = soup.find_all("td")[2].get_text()
    n = soup.find_all("td")[3].get_text()

    print(rn + "," + sr + "," + d + ",", file=f)

This is my first ever Python script so any help would be appreciated! I have looked over other question answers but cannot figure out what I am doing wrong here.

Upvotes: 26

Views: 55296

Answers (2)

Andrew Gorcester
Andrew Gorcester

Reputation: 19973

The problem is that every time you're trying to narrow down your search (get the first td in this tr, etc) you're instead just calling back to soup. Soup is the top-level object -- it represents the entire document. You only need to call soup once, and then use the result of that in place of soup for the next step.

For instance (with variable names changed to be more clear),

table = soup.find('table', border=1)
rows = table.find_all('tr')

for row in rows:
    data = row.find_all("td")
    rn = data[0].get_text()
    sr = data[1].get_text()
    d = data[2].get_text()
    n = data[3].get_text()

    print(rn + "," + sr + "," + d + ",", file=f)

I'm not sure that print statement is the best way to do what you're trying to do here (at the very least, you should use string formatting instead of addition), but I'm leaving it as is because it's not the core issue.

Also, for completion: soup.unicode won't do anything. You're not calling a method there, and there's no assignment. I don't remember BeautifulSoup having a method named unicode in the first place, but I'm used to BS 3.0 so it may be new in 4.

Upvotes: 10

kindall
kindall

Reputation: 184191

You're starting at the top level of your document each time you use find() or find_all(), so when you ask for, for example, all the "td"` tags you're getting all the "td" tags in the document, not just those in the table and row you have searched for. You might as well not search for those because they're not being used the way your code is written.

I think you want to do something like this:

table1 = soup.find("table", border=1)
table2 = table1.find('tbody')
table3 = table2.find_all('tr')

Or, you know, something more like this, with more descriptive variable names to boot:

rows = soup.find("table", border=1).find("tbody").find_all("tr")

for row in rows:
    cells = row.find_all("td")
    rn = cells[0].get_text()
    # and so on

Upvotes: 65

Related Questions