Jake Bourne
Jake Bourne

Reputation: 753

Beautiful Soup Parsing table within a div

I'm working on using bs4 to pull information from listings on ebay to obtain details on products, I'm attempting to produce a result using this listing as an example, the code I'm feeling is most accurate is as below:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, 'html.parser')
attributes = page_soup.findAll("div",{'class':'itemAttr'})
attribute = attributes [0]
row = attribute.tr.contents

The idea being, pull the webpage, parse the appropriate div (itemattr), and attempt to pull content from here using the tr/td tags or combination there of. Not included above is my numerous variations of work this, but I can see I hit this roadblock of the parse producing a list (with one item) and navigation through this list is met with road blocks. I did look at directly parsing the table, but unfortunately they haven't given it a class. I'm wondering if there is any ideas on how to pull a table from a div tag, or perhaps create a new subset of html from parse (as opposed to a list?). Or tell me if I've gone insane and should go to bed.

Upvotes: 1

Views: 1998

Answers (1)

Harald Nordgren
Harald Nordgren

Reputation: 12381

I think your current work makes a lot of sense, good job!

To move ahead, we can leverage the structure of the td elements on the eBay page, and the fact that they come in two's with a attrLabels class on the header to extract the specific data.

This gives you the data in the same order as it appears on the page:

tds = attribute.findAll("td")
ordered_data = []
for i in range(0, len(tds), 2):
    if tds[i].get('class') == ['attrLabels']:
        key = tds[i].text.strip().strip(":")
        value = tds[i+1].span.text
        ordered_data.append({ key: value })

And this gives you the same thing but in a dict with key-value pairs so that you can easily access each attribute:

tds = attribute.findAll("td")
searchable_data = {}
for i in range(0, len(tds), 2):
    if tds[i].get('class') == ['attrLabels']:
        key = tds[i].text.strip().strip(":")
        value = tds[i+1].span.text
        searchable_data[key] = value

Upvotes: 3

Related Questions