Daniel
Daniel

Reputation: 691

Exclude Item from Web-Scraped Loop

Suppose I have the following html:

<h4>
     <a href="http://www.google.com">Google</a>
</h4>
<h4>Random Text</h4>

I am able to identify all h4 headings via a loop such as:

for url in soup.findAll("h4")
    print(url.get_text())

And that works well except it includes the "random text" element of the h4 heading. Is it possible to programmatically remove occurrences of h4 headings that do not meet a certain criteria - for example, those that don't contain a link?

Upvotes: 1

Views: 682

Answers (2)

A.Kot
A.Kot

Reputation: 7903

Use list comprehension as the most pythonic approach:

[i.get_text() for i in soup.findAll("h4") if #Insert criteria here#]

Upvotes: 0

alecxe
alecxe

Reputation: 473873

Sure, you can go with a straightforward approach, simply filtering the headings:

for url in soup.find_all("h4")
    if not url.a:  # "url.a" is a shortcut to "url.find('a')"
        continue
    print(url.get_text())

Or, a better way would be to filter them with a function:

for url in soup.find_all(lambda tag: tag.name == "h4" and tag.a):
    print(url.get_text())

Or, even better, go straight to the a elements:

for url in soup.select("h4 > a"):
    print(url.get_text())

h4 > a here is a CSS selector that would match a elements that are direct children of h4 tags.

Upvotes: 3

Related Questions