Reputation: 709
I have a tuple defined with the keywords of links that I'm interested in a given page, so:
categories = ('car', 'planes', ...)
I'm trying to get into a list all links inside a given class that matches any value of my category tuple. The document is as following:
<div class='content'>
<ul class='side-panel'>
<li><a href='page1.html'>page 1</a></li>
<li><a href='page2.html'>page 2</a></li>
<li><a href='best_car_2013.html'>Best Cars</a></li>
...
</ul>
</div>
for now I'm doing:
found = []
for link in soup.find_all(class_='side-panel'):
for category in categories:
if re.search(category, link.get('href')):
found.append(link)
I get a type error "expected string or buffer". Debugging the script, I know that I'm getting all 'li' with their respective anchor tags but I'm having trouble iterating on all this resultset to get the 'href' of each link that matches my tuple inside a list.
Upvotes: 0
Views: 617
Reputation: 25954
Whenever you find yourself manually iterating over tags to do some additional filtering, it's usually better to just use the bs4
API instead. In this case, you can pass a regex to find_all
.
soup.find(class_='side-panel').find_all(href=re.compile('|'.join(categories)))
Out[86]: [<a href="best_car_2013.html">Best Cars</a>]
If it's unclear, joining categories
with pipes into one expression lets the re
engine decide if any of the categories match the href attribute. This replaces explicitly looping over each category and individually doing a re
search.
edit: (referring to link in comments) it looks like the page you're scraping has two class='side-panel categories'
tags (???) so a loop over the initial find_all
operation doing more find_all
operations is probably appropriate:
[t for tags in soup.find_all(class_='side-panel categories')
for t in tags.find_all(href=re.compile('|'.join(selected_links)))]
Out[24]:
[<a href="/animals__birds-desktop-wallpapers.html">Animals & Birds</a>,
<a href="/beach-desktop-wallpapers.html">Beach</a>,
<a href="/bikes__motorcycles-desktop-wallpapers.html">Bikes</a>,
<a href="/cars-desktop-wallpapers.html">Cars</a>,
<a href="/digital_universe-desktop-wallpapers.html">Digital Universe</a>,
<a href="/flowers-desktop-wallpapers.html">Flowers</a>,
<a href="/nature__landscape-desktop-wallpapers.html">Nature</a>,
<a href="/planes-desktop-wallpapers.html">Planes</a>,
<a href="/travel__world-desktop-wallpapers.html">Travel & World</a>,
<a href="/vector__designs-desktop-wallpapers.html">Vector & Designs</a>]
Upvotes: 2