hamid
hamid

Reputation: 714

how to extract text inside a tag with its tags?

I want to parse a html page using beautifulsoup. I want to extract text inside of a tag without removing inner html tags. for example sample input:

<a class="fl" href="https://stackoverflow.com/questio...">
    Angular2 <b>Router link not working</b>
</a>

sample output:

'Angular2 <b>Router link not working</b>'

I have tried this:

from bs4 import Beautifulsoup
string = '<a class="fl" href="https://stackoverflow.com/questio...">
         Angular2 <b>Router link not working</b>
         </a>'
soup = Beautifulsoup(string, 'html.parser')
print(soup.text)

But it gives:

'Angular2 Router link not working'

How can i extract text without removing inside tags?

Upvotes: 2

Views: 1708

Answers (3)

hamid
hamid

Reputation: 714

From here the first answer works fine. For this example:

from bs4 import Beautifulsoup
string = '<a class="fl" href="https://stackoverflow.com/questio...">
             Angular2 <b>Router link not working</b>
         </a>'
soup = BeautifulSoup(string, 'html.parser')
soup.find('a').encode_contents().decode('utf-8')

It gives:

'Angular2 <b>Router link not working</b>'

Upvotes: 2

chitown88
chitown88

Reputation: 28640

Ya as stated by Den, you'll need to grab that inner tag and then store that as type str to have that inner tag included. In the given solution by Den, that will exclusively grab <b> tags, not the parent tag/text and not if there are other styling type of tags within there. But if there are other tags, you can be more general and have it find the children elements of the <a> tag, instead of finding the <b> tag specifically.

So essentially what this will do is find the <a> tag and grab the whole text. Then it will go within the children of that <a> tag, convert that to string, then replace the text from that parent text, with the string (which includes the tags)

string = '''<a class="fl" href="https://stackoverflow.com/questio...">
     Angular2 <b>Router link not working</b> and then this is in <i>italics</i> and this is in <b>bold</b>
     </a>'''



from bs4 import BeautifulSoup, Tag

soup = BeautifulSoup(string, 'html.parser')
parsed_soup = ''

for item in soup.find_all('a'):
    if type(item) is Tag and 'a' != item.name:
        continue
    else:
        try:
            parent = item.text.strip()
            child_elements = item.findChildren()
            for child_ele in child_elements:
                child_text = child_ele.text
                child_str = str(child_ele)


                parent = parent.replace(child_text, child_str)
        except:
            parent = item.text

print (parent)

Output:

print (parent)
Angular2 <b>Router link not working</b> and then this is in <i>italics</i> and this is in <b>bold</b>

Upvotes: 0

Den Lakusta
Den Lakusta

Reputation: 11

You are extracting all text from tag 'a' including every tag inside it when you are writing print(soup.text). If you want get only tag 'b' object you should try next:

soup = BeautifulSoup(string, 'html.parser')
b = soup.find('b')
print(b)
print(type(b))

or

soup = BeautifulSoup(string, 'html.parser')
b = soup.find('a', class_="fl").find('b')
print(b)
print(type(b))

Output:

<b>Router link not working</b>
<class 'bs4.element.Tag'>

As you see it will return your tag 'b' in beautifullsoup object

If you need data in string format you just can write:

b = soup.find('a', class_="fl").find('b')
b = str(b)
print(b)
print(type(b))

Output:

<b>Router link not working</b>
<class 'str'>

Upvotes: 1

Related Questions