Reputation: 714
I want to parse a html page using beautifulsoup
. I want to extract text inside of a tag without removing inner html tags. for example sample input:
<a class="fl" href="https://stackoverflow.com/questio...">
Angular2 <b>Router link not working</b>
</a>
sample output:
'Angular2 <b>Router link not working</b>'
I have tried this:
from bs4 import Beautifulsoup
string = '<a class="fl" href="https://stackoverflow.com/questio...">
Angular2 <b>Router link not working</b>
</a>'
soup = Beautifulsoup(string, 'html.parser')
print(soup.text)
But it gives:
'Angular2 Router link not working'
How can i extract text without removing inside tags?
Upvotes: 2
Views: 1708
Reputation: 714
From here the first answer works fine. For this example:
from bs4 import Beautifulsoup
string = '<a class="fl" href="https://stackoverflow.com/questio...">
Angular2 <b>Router link not working</b>
</a>'
soup = BeautifulSoup(string, 'html.parser')
soup.find('a').encode_contents().decode('utf-8')
It gives:
'Angular2 <b>Router link not working</b>'
Upvotes: 2
Reputation: 28640
Ya as stated by Den, you'll need to grab that inner tag and then store that as type str
to have that inner tag included. In the given solution by Den, that will exclusively grab <b>
tags, not the parent tag/text and not if there are other styling type of tags within there. But if there are other tags, you can be more general and have it find the children elements of the <a>
tag, instead of finding the <b>
tag specifically.
So essentially what this will do is find the <a>
tag and grab the whole text. Then it will go within the children of that <a>
tag, convert that to string, then replace the text from that parent text, with the string (which includes the tags)
string = '''<a class="fl" href="https://stackoverflow.com/questio...">
Angular2 <b>Router link not working</b> and then this is in <i>italics</i> and this is in <b>bold</b>
</a>'''
from bs4 import BeautifulSoup, Tag
soup = BeautifulSoup(string, 'html.parser')
parsed_soup = ''
for item in soup.find_all('a'):
if type(item) is Tag and 'a' != item.name:
continue
else:
try:
parent = item.text.strip()
child_elements = item.findChildren()
for child_ele in child_elements:
child_text = child_ele.text
child_str = str(child_ele)
parent = parent.replace(child_text, child_str)
except:
parent = item.text
print (parent)
Output:
print (parent)
Angular2 <b>Router link not working</b> and then this is in <i>italics</i> and this is in <b>bold</b>
Upvotes: 0
Reputation: 11
You are extracting all text from tag 'a' including every tag inside it when you are writing print(soup.text)
.
If you want get only tag 'b' object you should try next:
soup = BeautifulSoup(string, 'html.parser')
b = soup.find('b')
print(b)
print(type(b))
or
soup = BeautifulSoup(string, 'html.parser')
b = soup.find('a', class_="fl").find('b')
print(b)
print(type(b))
Output:
<b>Router link not working</b> <class 'bs4.element.Tag'>
As you see it will return your tag 'b' in beautifullsoup object
If you need data in string format you just can write:
b = soup.find('a', class_="fl").find('b')
b = str(b)
print(b)
print(type(b))
Output:
<b>Router link not working</b> <class 'str'>
Upvotes: 1