Reputation: 3
I have code to scrape a webpage and it returns multiple instances of this:
<div class="post"><a title="Brass-plated door knob" href="http:URL-LINK">
<img src="IMAGE SOURCE LINK" alt="IMAGE ALTERNATE TEXT" />
<span class="det"><em class="fl">3.87 dollars</em><em class="fr">Housewares</em></span>
<strong class="vtitle">Brass-plated door knob</strong></a>
<div class="desc"><p>Brass-plated door knob</p></div></div>
I would like to get the href links and corresponding prices from each one and sort them, with the ideal output being
HIGHEST PRICE, URL-LINK
'...
LOWEST PRICE, URL-LINK
I can pull the prices (though they come with the word "dollars" which I could do without) with
price = soup.find_all("em", class_="fl")
but not sure how to get the corresponding href link, then sort and list all of them.
Right now I iterate through the output as follows:
if len(price) < 100:
for x in range(1, len(price)):
print price[x]
else:
print len(price)**
Ideas?
Upvotes: 0
Views: 2347
Reputation: 9657
From your html, you can get the corresponding links of prices like this,
prices = soup.find_all("em", class_="fl")
for price in prices:
print price.findParent('a').get('href'), price.text.split()[0]
It will not be convenient to sort while scraping. You can store the price and link in a dictionary. Make the price float as in alecxe's answer and sort them after scraping.
Upvotes: 0
Reputation: 474151
The idea is to iterate over all posts and get the link and price for each one.
Working example based on your input:
from bs4 import BeautifulSoup
data = """
<div>
<div class="post">
<a title="Brass-plated door knob" href="http:URL-LINK">
<img src="IMAGE SOURCE LINK" alt="IMAGE ALTERNATE TEXT"/>
<span class="det"><em class="fl">3.87 dollars</em><em class="fr">Housewares</em></span>
<strong class="vtitle">Brass-plated door knob</strong>
</a>
<div class="desc"><p>Brass-plated door knob</p></div>
</div>
<div class="post">
<a title="Brass-plated door knob2" href="http:URL-LINK2">
<img src="IMAGE SOURCE LINK" alt="IMAGE ALTERNATE TEXT"/>
<span class="det"><em class="fl">410.25 dollars</em><em class="fr">Housewares</em></span>
<strong class="vtitle">Brass-plated door knob2</strong>
</a>
<div class="desc"><p>Brass-plated door knob2</p></div>
</div>
</div>
"""
soup = BeautifulSoup(data)
result = []
for post in soup.select('div.post'):
link = post.a.get('href')
price = float(post.find('em', class_='fl').text.split(' ')[0])
result.append({'link': link, 'price': price})
print result
Prints:
[
{'price': 3.87, 'link': 'http:URL-LINK'},
{'price': 410.25, 'link': 'http:URL-LINK2'}
]
Upvotes: 1