How to remove HTML tags in BeautifulSoup when I have contents

Question

The html for what I'm attempting to grab:


     BB100 
v1.4.3

I have the contents of an h2 tag below:

initialPage = beautifulSoup(urllib.urlopen(url).read(), 'html.parser')
deviceInfo = initialPage.find('div', {'id': 'unitType'}).h2.contents
print('Device Info: ', deviceInfo)
for i in deviceInfo:
    print i

Which outputs:

('Device Info: ', [u'BB100 ', 
v1.4.3
])
BB100

v1.4.3

How do I remove the

,

and

html tags, using BeautifulSoup rather than regex? I've tried i.decompose() and i.strip() but neither has worked. It would throw 'NoneType' object is not callable.

Padraic Cunningham · Accepted Answer

Just use find and extract the br tag:

In [15]: from bs4 import BeautifulSoup
    ...: 
    ...: h = """BB10
v1.4.3"""
    ...: 
    ...: soup = BeautifulSoup(h, "html.parser")
    ...: 
    ...: h2 = soup.find(id="unitType").h2
    ...: h2.find("br").extract()
    ...: print(h2)
    ...: 
BB10

Or to replace the tag with just the text using replace-with:

In [16]: from bs4 import BeautifulSoup
    ...: 
    ...: h = """v1.4.3 BB10"""
    ...: 
    ...: soup = BeautifulSoup(h, "html.parser")
    ...: 
    ...: h2 = soup.find(id="unitType").h2
    ...: 
    ...: br = h2.find("br")
    ...: br.replace_with(br.text)
    ...: print(h2)
    ...: 
v1.4.3 BB10

To remove the h2 and keep the text:

In [37]: h = """
v1.4.3"""
    ...: 
    ...: soup = BeautifulSoup(h, "html.parser")
    ...: 
    ...: unit = soup.find(id="unitType")
    ...: 
    ...: h2 = unit.find("h2")
    ...: h2.replace_with(h2.text)
    ...: print(unit)
    ...: 
v1.4.3 BB10

If you just want "v1.4.3" and "BB10", there are many ways to hey them:

In [60]: h = """
    ...:      BB100 
v1.4.3
    ...:  """
    ...: 
    ...: soup = BeautifulSoup(h, "html.parser")
    ...: 
    ...: h2 = soup.find(id="unitType").h2
        # just find all strings
    ...: a,b = h2.find_all(text=True)
    ...: print(a, b)
         # get the br
    ...: br = h2.find("br")
        # get br text and just the h2 text ignoring any text from children
    ...: a, b = h2.find(text=True, recursive=False),  br.text
    ...: print(a, b)
    ...: 
BB100  v1.4.3
BB100  v1.4.3

Why you end up with text ins

How to remove HTML tags in BeautifulSoup when I have contents

Answers (2)

Related Questions