XVirtusX
XVirtusX

Reputation: 709

Parsing HTML with Beautiful Soup

I have a tuple defined with the keywords of links that I'm interested in a given page, so:

categories = ('car', 'planes', ...)

I'm trying to get into a list all links inside a given class that matches any value of my category tuple. The document is as following:

<div class='content'>
    <ul class='side-panel'>
        <li><a href='page1.html'>page 1</a></li>
        <li><a href='page2.html'>page 2</a></li>
        <li><a href='best_car_2013.html'>Best Cars</a></li>
        ...
    </ul>
</div>

for now I'm doing:

found = []

for link in soup.find_all(class_='side-panel'):
    for category in categories:
        if re.search(category, link.get('href')):
            found.append(link)

I get a type error "expected string or buffer". Debugging the script, I know that I'm getting all 'li' with their respective anchor tags but I'm having trouble iterating on all this resultset to get the 'href' of each link that matches my tuple inside a list.

Upvotes: 0

Views: 617

Answers (1)

roippi
roippi

Reputation: 25954

Whenever you find yourself manually iterating over tags to do some additional filtering, it's usually better to just use the bs4 API instead. In this case, you can pass a regex to find_all.

soup.find(class_='side-panel').find_all(href=re.compile('|'.join(categories)))
Out[86]: [<a href="best_car_2013.html">Best Cars</a>]

If it's unclear, joining categories with pipes into one expression lets the re engine decide if any of the categories match the href attribute. This replaces explicitly looping over each category and individually doing a re search.

edit: (referring to link in comments) it looks like the page you're scraping has two class='side-panel categories' tags (???) so a loop over the initial find_all operation doing more find_all operations is probably appropriate:

[t for tags in soup.find_all(class_='side-panel categories') 
    for t in tags.find_all(href=re.compile('|'.join(selected_links)))]
Out[24]: 
[<a href="/animals__birds-desktop-wallpapers.html">Animals &amp; Birds</a>,
 <a href="/beach-desktop-wallpapers.html">Beach</a>,
 <a href="/bikes__motorcycles-desktop-wallpapers.html">Bikes</a>,
 <a href="/cars-desktop-wallpapers.html">Cars</a>,
 <a href="/digital_universe-desktop-wallpapers.html">Digital Universe</a>,
 <a href="/flowers-desktop-wallpapers.html">Flowers</a>,
 <a href="/nature__landscape-desktop-wallpapers.html">Nature</a>,
 <a href="/planes-desktop-wallpapers.html">Planes</a>,
 <a href="/travel__world-desktop-wallpapers.html">Travel &amp; World</a>,
 <a href="/vector__designs-desktop-wallpapers.html">Vector &amp; Designs</a>]

Upvotes: 2

Related Questions