Chan
Chan

Reputation: 4291

How to find tag name given a text in BeautifulSoup

I have the following html code:

soup = BeautifulSoup("<div class='mydiv'><p style='xyz'>123</p><p>456</p><p style='xyz'>789</p><p>abc</p></div>", 'lxml')

Now, I have a text '456'.

I want to find the text in the all the tags which have the same tag name containing the text '456'.

That is, in the html, <p>456</p> contains 456, then we should find abc because of <p>abc</p> but not 123 and 789 because <p style> in <p style='xyz'>123</p> and <p style='xyz'>789</p>.

Note that <p> above can be other tag, such as <div>.

Searching soup.find('p') should be avoided.

The final result is [456, 789].

It is a bit complicated.

How can we solve this problem?

Thanks.

Upvotes: 1

Views: 1404

Answers (3)

Andrej Kesely
Andrej Kesely

Reputation: 195428

This script will print all tags that share tag name and tag attributes with tag that contains string "456":

txt = '''
    <div class='mydiv'>
        <p style='xyz'>123</p>
        <p>456</p>
        <p style='xyz'>789</p>
        <p>abc</p>
    </div>'''

text_to_find = '456'
soup = BeautifulSoup(txt, 'html.parser')

tmp = soup.find(lambda t: t.contents and t.contents[0] == text_to_find)
if tmp:
    for tag in soup.find_all(lambda t: t.name == tmp.name and t.attrs == tmp.attrs):
        print(tag)

Prints:

<p>456</p>
<p>abc</p>

For input "123":

<p style="xyz">123</p>
<p style="xyz">789</p>

Upvotes: 1

UWTD TV
UWTD TV

Reputation: 910

Try:

soup = BeautifulSoup("<div class='mydiv'><p style='xyz'>123</p><p>456</p><p style='xyz'>789</p><p>abc</p></div>", 'html5lib')

tags = soup.find_all()
for tag in tags:
    if tag.get('style'):
        tag.extract()

for tag in soup.select('html body'):
    print(tag.get_text('\n'))

prints:

456
abc

Upvotes: 0

DJSchaffner
DJSchaffner

Reputation: 582

Theres actually multiple ways, here are two examples how you could find what you are looking for:

from bs4 import BeautifulSoup

soup = BeautifulSoup("<div class='mydiv'><p style='xyz'>123</p><p>456</p><p style='xyz'>789</p><p>abc</p></div>", 'lxml')

# Find all tags first and then look for the one matching your string
found = [x for x in soup.findAll() if x.text == "456"]

for p in found:
  print(p)

# Using findAll functionality directly
found = soup.findAll(text="456")

for p in found:
  print(p)

<p>456</p>
456

Note however, using the second method you recieve NavigableString objects and not Tag objects!

Upvotes: 0

Related Questions