Sruthipriyanga
Sruthipriyanga

Reputation: 468

How to Clean tags from html using BeautifulSoup

I'm trying to train data using the NLTK library. I follow a step-by-step process. I did the first step, but while doing the second step I'm getting the following error:

TypeError: a bytes-like object is required, not 'list'

I tried my best to rectify it but I'm getting the same error again.

Here is my code:

from bs4 import BeautifulSoup
import urllib.request 
response = urllib.request.urlopen('http://php.net/') 
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)

this is my error

C:\python\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 8 of the file E:/secure secure/chatbot-master/nltk.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html5lib")

  markup_type=markup_type))
Traceback (most recent call last):
  File "E:/secure secure/chatbot-master/nltk.py", line 8, in <module>
    soup = BeautifulSoup(html)
  File "C:\python\lib\site-packages\bs4\__init__.py", line 228, in __init__
    self._feed()
  File "C:\python\lib\site-packages\bs4\__init__.py", line 289, in _feed
    self.builder.feed(self.markup)
  File "C:\python\lib\site-packages\bs4\builder\_html5lib.py", line 72, in feed
    doc = parser.parse(markup, **extra_kwargs)
  File "C:\python\lib\site-packages\html5lib\html5parser.py", line 236, in parse
    parseMeta=parseMeta, useChardet=useChardet)
  File "C:\python\lib\site-packages\html5lib\html5parser.py", line 89, in _parse
    parser=self, **kwargs)
  File "C:\python\lib\site-packages\html5lib\tokenizer.py", line 40, in __init__
    self.stream = HTMLInputStream(stream, encoding, parseMeta, useChardet)
  File "C:\python\lib\site-packages\html5lib\inputstream.py", line 148, in HTMLInputStream
    return HTMLBinaryInputStream(source, encoding, parseMeta, chardet)
  File "C:\python\lib\site-packages\html5lib\inputstream.py", line 416, in __init__
    self.rawStream = self.openStream(source)
  File "C:\python\lib\site-packages\html5lib\inputstream.py", line 453, in openStream
    stream = BytesIO(source)
TypeError: a bytes-like object is required, not 'list'

Upvotes: 4

Views: 3628

Answers (3)

Dom DaFonte
Dom DaFonte

Reputation: 1779

For anyone looking for an answer that works in python 3

invalidTags = ['br','b','font']
def stripTags(html, invalid_tags):
    soup = BeautifulSoup(html, "lxml")

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = "::"
            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = stripTags(str(c), invalid_tags)
                s += str(c)
            tag.replaceWith(s)
    return soup

Upvotes: 1

Yuseferi
Yuseferi

Reputation: 8670

You can achieve it by implementing a simple tag-stripper.

def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html)
    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""
            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)
            tag.replaceWith(s)
    return soup

html = "<p>Love, <b>Hate</b>, and <i>Hap<b>piness</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

The result is:

<p>Love, Hate, and Happiness</p>

Upvotes: 2

neehari
neehari

Reputation: 2612

Your code is working as is.

The UserWarning: No parser was explicitly specified was when your statement was soup = BeautifulSoup(html).

The TypeError: a bytes-like object is required, not 'list' error might be due to an issue with dependencies.

The bs4 documentation says if you do not specify a parser, like BeautifulSoup(markup), it uses the best HTML parser that is installed on your system:

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

On my system, using BeautifulSoup(html, "html.parser") worked just fine, with decent speed, without any warnings. html.parser comes with Python’s standard library.

The documentation also summarizes the advantages and disadvantages of each parser library:

enter image description here

Try BeautifulSoup(html, "html.parser"). It should work.

If you want speed, you could try BeautifulSoup(html, "lxml"). If you do not have lxml’s HTML parser, on Windows you may want to install it with pip install lxml.

Upvotes: 1

Related Questions