Python scraping from variable class attribute

Question

I am trying to scrape some hrefs from from a html list, some of the source code as follows:


    Belfast
    Birmingham
    Canterbury
    Durham

I have tried using the following code to get the href:

for ul in soup.find_all( class_="sub-menu"):
    for the_href in ul.find_all(class_="menu-item menu-item-type-post_type menu-item-object-properties menu-item-4542"):
        print(the_href.a.get('href'))

But I then realised that the last bit of the class_="menu-item menu-item-type-post_type menu-item-object-properties menu-item-xxxx i.e. the number which should be in place of xxxx, is different for each list item.

So I have 2 questions really:

1) Given the source code, is this the most efficient way to obtain the hrefs?

2) If yes, or just for general knowledge actually, how would I go about getting them as the last few digits at the end of the class attribute changes?

Sorry if this is a duplicate, I can't seem to find it on so.

Christos Papoulas · Accepted Answer

I don't know if your real HTML is more complicated than the HTML you provided in the question, but why to mess with classes and not only use the tag elements names for getting your desired results?

Generally, you should use some class names or even better some ids (which are unique) in order to reduce the HTML to the real fields that you are interesting in.

But the actual code that does the magic is:

from bs4 import BeautifulSoup as Soup
html_str = """

    
        Belfast
    
    
        Birmingham
    
    
        Canterbury
    
    
        Durham
    
"""
soup = Soup(html_str, 'html.parser')
for ul in soup.find_all('ul'):
    for the_href in ul.find_all('li'):
        print(the_href.a.get('href'))

Python scraping from variable class attribute

Answers (2)

Related Questions