Reputation: 691
Suppose I have the following html
:
<h4>
<a href="http://www.google.com">Google</a>
</h4>
<h4>Random Text</h4>
I am able to identify all h4
headings via a loop such as:
for url in soup.findAll("h4")
print(url.get_text())
And that works well except it includes the "random text" element of the h4
heading. Is it possible to programmatically remove occurrences of h4
headings that do not meet a certain criteria - for example, those that don't contain a link?
Upvotes: 1
Views: 682
Reputation: 7903
Use list comprehension as the most pythonic approach:
[i.get_text() for i in soup.findAll("h4") if #Insert criteria here#]
Upvotes: 0
Reputation: 473873
Sure, you can go with a straightforward approach, simply filtering the headings:
for url in soup.find_all("h4")
if not url.a: # "url.a" is a shortcut to "url.find('a')"
continue
print(url.get_text())
Or, a better way would be to filter them with a function:
for url in soup.find_all(lambda tag: tag.name == "h4" and tag.a):
print(url.get_text())
Or, even better, go straight to the a
elements:
for url in soup.select("h4 > a"):
print(url.get_text())
h4 > a
here is a CSS selector that would match a
elements that are direct children of h4
tags.
Upvotes: 3