user3563297
user3563297

Reputation:

Beautiful Soup: Remove Tags that only contain href

From BeautifulSoup I'm getting a list back of specific tags, some of the tags only contains links, no further text. When I use the get_text() method on these, I get the description of the links.

But when the tag only contains a <a href> element, I want to ignore it.

Tag: <p class="abc">text1 <a href=...>desc</a> text2</p> -> result: text1 desc text2 (OKAY)

Tag: <p class="abc"><a href=...>desc</a></p> -> result: desc (NOT OKAY)

When the tag only contains a link, I want to filter them out. How can I do that?

Upvotes: 1

Views: 875

Answers (1)

alecxe
alecxe

Reputation: 473803

The idea is to iterate over p tags and check if there is only one child containing the a tag:

from bs4 import BeautifulSoup


data = """
<div>
    <p class="abc">text1 <a href='http://mysite1.com'>desc1</a> text2</p>
    <p class="abc"><a href='http://mysite2.com'>desc2</a></p>
    <p class="abc"><a href='http://mysite3.com'>desc3</a>text3</p>
    <p class="abc">text4<a href='http://mysite4.com'>des4</a></p>
    <p class="abc">text5</p>
</div>
"""
soup = BeautifulSoup(data)
for p in soup('p', class_='abc'):
    if len(p.contents) == 1 and p.contents[0].name == 'a':
        print p

prints:

<p class="abc"><a href="http://mysite2.com">desc2</a></p>

FYI, .contents contains the list of tag's children.

Upvotes: 1

Related Questions