Exclude Item from Web-Scraped Loop

Question

Suppose I have the following html:


     Google

Random Text

I am able to identify all h4 headings via a loop such as:

for url in soup.findAll("h4")
    print(url.get_text())

And that works well except it includes the "random text" element of the h4 heading. Is it possible to programmatically remove occurrences of h4 headings that do not meet a certain criteria - for example, those that don't contain a link?

alecxe · Accepted Answer

Sure, you can go with a straightforward approach, simply filtering the headings:

for url in soup.find_all("h4")
    if not url.a:  # "url.a" is a shortcut to "url.find('a')"
        continue
    print(url.get_text())

Or, a better way would be to filter them with a function:

for url in soup.find_all(lambda tag: tag.name == "h4" and tag.a):
    print(url.get_text())

Or, even better, go straight to the a elements:

for url in soup.select("h4 > a"):
    print(url.get_text())

h4 > a here is a CSS selector that would match a elements that are direct children of h4 tags.

Exclude Item from Web-Scraped Loop

Answers (2)

Related Questions