whatf
whatf

Reputation: 6458

use beautiful soup to parse a href from given html structure

I have the following given html structure

<li class="g">
 <div class="vsc">    
  <div class="alpha"></div>
  <div class="beta"></div>
  <h3 class="r">
   <a href="http://www.stackoverflow.com"></a>
  </h3>
 </div>
</li> 

The above html structure keeps repeating, what can be the easiest way to parse all the links(stackoverflow.com) from the above html structure using BeautifulSoup and Python?

Upvotes: 1

Views: 374

Answers (2)

root
root

Reputation: 80346

Using CSS selectors as proposed by Petri is probably the best way to do it using BS. However, i can't hold myself back to recommend using lxml.html and xpath, that are pretty much perfect for the job.

Test html:

html="""
<html>
<li class="g">
<div class="vsc"></div>    
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
<a href="http://www.correct.com"></a>
</h3>
</li>
<li class="g">
<div class="vsc"></div>    
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
<a href="http://www.correct.com"></a>
</h3>
</li>
<li class="g">
<div class="vsc"></div>    
<div class="gamma"></div>
<div class="beta"></div>
<h3 class="r">
<a href="http://www.incorrect.com"></a>
</h3>
</li>
</html>"""

and it's basically a oneliner:

    import lxml.html as lh
    doc=lh.fromstring(html)
    doc.xpath('.//li[@class="g"][div/@class = "vsc"][div/@class = "alpha"][div/@class = "beta"][h3/@class = "r"]/h3/a/@href')

    Out[264]:
    ['http://www.correct.com', 'http://www.correct.com']

Upvotes: 1

Petri
Petri

Reputation: 5006

BeautifulSoup 4 offers a convenient way of accomplishing this, using CSS selectors:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print [a["href"] for a in soup.select('h3.r a')]

This also has the advantage of constraining the selection by context: it selects only those anchor nodes that are children of a h3 node with class r.

Omitting the constraint or choosing one most suitable for the need is easy by just tweaking the selector; see the CSS selector docs for that.

Upvotes: 2

Related Questions