Reputation: 799
I am trying to scrape some hrefs from from a html list, some of the source code as follows:
<ul class="sub-menu">
<li id="menu-item-4019" class="menu-item menu-item-type-post_type menu-item-object-properties menu-item-4019"><a href="http://www.universalstudentliving.com/properties/belfast/">Belfast</a></li>
<li id="menu-item-186" class="menu-item menu-item-type-post_type menu-item-object-properties menu-item-186"><a href="http://www.universalstudentliving.com/properties/birmingham/">Birmingham</a></li>
<li id="menu-item-184" class="menu-item menu-item-type-post_type menu-item-object-properties menu-item-184"><a href="http://www.universalstudentliving.com/properties/canterbury/">Canterbury</a></li>
<li id="menu-item-4544" class="menu-item menu-item-type-post_type menu-item-object-properties menu-item-4544"><a href="http://www.universalstudentliving.com/properties/the-clink-durham/">Durham</a></li>
</ul>
I have tried using the following code to get the href:
for ul in soup.find_all( class_="sub-menu"):
for the_href in ul.find_all(class_="menu-item menu-item-type-post_type menu-item-object-properties menu-item-4542"):
print(the_href.a.get('href'))
But I then realised that the last bit of the class_="menu-item menu-item-type-post_type menu-item-object-properties menu-item-xxxx
i.e. the number which should be in place of xxxx, is different for each list item.
So I have 2 questions really:
1) Given the source code, is this the most efficient way to obtain the hrefs?
2) If yes, or just for general knowledge actually, how would I go about getting them as the last few digits at the end of the class attribute changes?
Sorry if this is a duplicate, I can't seem to find it on so.
Upvotes: 0
Views: 882
Reputation: 2568
I don't know if your real HTML is more complicated than the HTML you provided in the question, but why to mess with classes and not only use the tag elements names for getting your desired results?
Generally, you should use some class names or even better some ids (which are unique) in order to reduce the HTML to the real fields that you are interesting in.
But the actual code that does the magic is:
from bs4 import BeautifulSoup as Soup
html_str = """
<ul class="sub-menu">
<li id="menu-item-4019" class="menu-item menu-item-type-post_type menu-item-object-properties menu-item-4019">
<a href="http://www.universalstudentliving.com/properties/belfast/">Belfast</a>
</li>
<li id="menu-item-186" class="menu-item menu-item-type-post_type menu-item-object-properties menu-item-186">
<a href="http://www.universalstudentliving.com/properties/birmingham/">Birmingham</a>
</li>
<li id="menu-item-184" class="menu-item menu-item-type-post_type menu-item-object-properties menu-item-184">
<a href="http://www.universalstudentliving.com/properties/canterbury/">Canterbury</a>
</li>
<li id="menu-item-4544" class="menu-item menu-item-type-post_type menu-item-object-properties menu-item-4544">
<a href="http://www.universalstudentliving.com/properties/the-clink-durham/">Durham</a>
</li>
</ul>"""
soup = Soup(html_str, 'html.parser')
for ul in soup.find_all('ul'):
for the_href in ul.find_all('li'):
print(the_href.a.get('href'))
Upvotes: 1
Reputation: 8382
In this particular case you can use regex when using find_all.
Example:
import re
from bs4 import BeautifulSoup
example = """<ul class="sub-menu">
<li id="menu-item-4019" class="menu-item menu-item-type-post_type menu-item-object-properties menu-item-4019"><a href="http://www.universalstudentliving.com/properties/belfast/">Belfast</a></li>
<li id="menu-item-186" class="menu-item menu-item-type-post_type menu-item-object-properties menu-item-186"><a href="http://www.universalstudentliving.com/properties/birmingham/">Birmingham</a></li>
<li id="menu-item-184" class="menu-item menu-item-type-post_type menu-item-object-properties menu-item-184"><a href="http://www.universalstudentliving.com/properties/canterbury/">Canterbury</a></li>
<li id="menu-item-4544" class="menu-item menu-item-type-post_type menu-item-object-properties menu-item-4544"><a href="http://www.universalstudentliving.com/properties/the-clink-durham/">Durham</a></li>
</ul>"""
soup = BeautifulSoup(example, "html.parser")
for o in soup.find_all('li', class_=re.compile(r'menu-item menu-item-type-
post_type menu-item-object-properties menu-item-')):
print (o.a["href"])
Outputs
http://www.universalstudentliving.com/properties/belfast/ http://www.universalstudentliving.com/properties/birmingham/ http://www.universalstudentliving.com/properties/canterbury/ http://www.universalstudentliving.com/properties/the-clink-durham/
Upvotes: 3