Greg
Greg

Reputation: 3

Beautiful Soup sorting output

I have code to scrape a webpage and it returns multiple instances of this:

<div class="post"><a title="Brass-plated door knob" href="http:URL-LINK">
<img src="IMAGE SOURCE LINK" alt="IMAGE ALTERNATE TEXT" />
<span class="det"><em class="fl">3.87 dollars</em><em class="fr">Housewares</em></span>
<strong class="vtitle">Brass-plated door knob</strong></a>
<div class="desc"><p>Brass-plated door knob</p></div></div>

I would like to get the href links and corresponding prices from each one and sort them, with the ideal output being

HIGHEST PRICE, URL-LINK

'...

LOWEST PRICE, URL-LINK

I can pull the prices (though they come with the word "dollars" which I could do without) with

price = soup.find_all("em", class_="fl")

but not sure how to get the corresponding href link, then sort and list all of them.

Right now I iterate through the output as follows:

if len(price) < 100:
    for x in range(1, len(price)):
        print price[x]
else:
    print len(price)**

Ideas?

Upvotes: 0

Views: 2347

Answers (2)

salmanwahed
salmanwahed

Reputation: 9657

From your html, you can get the corresponding links of prices like this,

prices = soup.find_all("em", class_="fl")
for price in prices:
    print price.findParent('a').get('href'), price.text.split()[0]

It will not be convenient to sort while scraping. You can store the price and link in a dictionary. Make the price float as in alecxe's answer and sort them after scraping.

Upvotes: 0

alecxe
alecxe

Reputation: 474151

The idea is to iterate over all posts and get the link and price for each one.

Working example based on your input:

from bs4 import BeautifulSoup

data = """
<div>
    <div class="post">
        <a title="Brass-plated door knob" href="http:URL-LINK">
            <img src="IMAGE SOURCE LINK" alt="IMAGE ALTERNATE TEXT"/>
            <span class="det"><em class="fl">3.87 dollars</em><em class="fr">Housewares</em></span>
            <strong class="vtitle">Brass-plated door knob</strong>
        </a>

        <div class="desc"><p>Brass-plated door knob</p></div>
    </div>
    <div class="post">
        <a title="Brass-plated door knob2" href="http:URL-LINK2">
            <img src="IMAGE SOURCE LINK" alt="IMAGE ALTERNATE TEXT"/>
            <span class="det"><em class="fl">410.25 dollars</em><em class="fr">Housewares</em></span>
            <strong class="vtitle">Brass-plated door knob2</strong>
        </a>

        <div class="desc"><p>Brass-plated door knob2</p></div>
    </div>
</div>
"""

soup = BeautifulSoup(data)
result = []
for post in soup.select('div.post'):
    link = post.a.get('href')
    price = float(post.find('em', class_='fl').text.split(' ')[0])
    result.append({'link': link, 'price': price})

print result

Prints:

[
    {'price': 3.87, 'link': 'http:URL-LINK'}, 
    {'price': 410.25, 'link': 'http:URL-LINK2'}
]

Upvotes: 1

Related Questions