Reputation: 826
The html for what I'm attempting to grab:
<div id="unitType">
<h2>BB100 <br>v1.4.3</h2>
</div>
I have the contents of an h2
tag below:
initialPage = beautifulSoup(urllib.urlopen(url).read(), 'html.parser')
deviceInfo = initialPage.find('div', {'id': 'unitType'}).h2.contents
print('Device Info: ', deviceInfo)
for i in deviceInfo:
print i
Which outputs:
('Device Info: ', [u'BB100 ', <br>v1.4.3</br>])
BB100
<br>v1.4.3</br>
How do I remove the <h2>
,</h2>
,<br>
and </br>
html tags, using BeautifulSoup rather than regex? I've tried i.decompose()
and i.strip()
but neither has worked. It would throw 'NoneType' object is not callable
.
Upvotes: 0
Views: 11442
Reputation: 180391
Just use find and extract the br tag:
In [15]: from bs4 import BeautifulSoup
...:
...: h = """<div id='unitType'><h2>BB10<br>v1.4.3</h2></d
...: iv>"""
...:
...: soup = BeautifulSoup(h, "html.parser")
...:
...: h2 = soup.find(id="unitType").h2
...: h2.find("br").extract()
...: print(h2)
...:
<h2>BB10</h2>
Or to replace the tag with just the text using replace-with:
In [16]: from bs4 import BeautifulSoup
...:
...: h = """<div id='unitType'><h2<br>v1.4.3 BB10</h2></d
...: iv>"""
...:
...: soup = BeautifulSoup(h, "html.parser")
...:
...: h2 = soup.find(id="unitType").h2
...:
...: br = h2.find("br")
...: br.replace_with(br.text)
...: print(h2)
...:
<h2>v1.4.3 BB10</h2>
To remove the h2 and keep the text:
In [37]: h = """<div id='unitType'><h2><br>v1.4.3</h2></d
...:
...: iv>"""
...:
...: soup = BeautifulSoup(h, "html.parser")
...:
...: unit = soup.find(id="unitType")
...:
...: h2 = unit.find("h2")
...: h2.replace_with(h2.text)
...: print(unit)
...:
<div id="unitType">v1.4.3 BB10</div>
If you just want "v1.4.3"
and "BB10"
, there are many ways to hey them:
In [60]: h = """<div id="unitType">
...: <h2>BB100 <br>v1.4.3</h2>
...: </div>"""
...:
...: soup = BeautifulSoup(h, "html.parser")
...:
...: h2 = soup.find(id="unitType").h2
# just find all strings
...: a,b = h2.find_all(text=True)
...: print(a, b)
# get the br
...: br = h2.find("br")
# get br text and just the h2 text ignoring any text from children
...: a, b = h2.find(text=True, recursive=False), br.text
...: print(a, b)
...:
BB100 v1.4.3
BB100 v1.4.3
Why you end up with text ins
Upvotes: 4
Reputation: 40791
You can check if the element is a <br>
tag with if i.name == 'br'
, and then just change the list to have the contents instead.
for i in deviceInfo:
if i.name == 'br':
i = i.contents
If you need to iterate over it many times, modify the list.
for n, i in enumerate(deviceInfo):
if i.name == 'br':
i = i.contents
deviceInfo[n] = i
Upvotes: 0