Reputation:
From BeautifulSoup I'm getting a list back of specific tags, some of the tags only contains links, no further text. When I use the get_text()
method on these, I get the description of the links.
But when the tag only contains a <a href>
element, I want to ignore it.
Tag: <p class="abc">text1 <a href=...>desc</a> text2</p> -> result: text1 desc text2 (OKAY)
Tag: <p class="abc"><a href=...>desc</a></p> -> result: desc (NOT OKAY)
When the tag only contains a link, I want to filter them out. How can I do that?
Upvotes: 1
Views: 875
Reputation: 473803
The idea is to iterate over p
tags and check if there is only one child containing the a
tag:
from bs4 import BeautifulSoup
data = """
<div>
<p class="abc">text1 <a href='http://mysite1.com'>desc1</a> text2</p>
<p class="abc"><a href='http://mysite2.com'>desc2</a></p>
<p class="abc"><a href='http://mysite3.com'>desc3</a>text3</p>
<p class="abc">text4<a href='http://mysite4.com'>des4</a></p>
<p class="abc">text5</p>
</div>
"""
soup = BeautifulSoup(data)
for p in soup('p', class_='abc'):
if len(p.contents) == 1 and p.contents[0].name == 'a':
print p
prints:
<p class="abc"><a href="http://mysite2.com">desc2</a></p>
FYI, .contents
contains the list of tag's children.
Upvotes: 1