Reputation: 468
I'm trying to train data using the NLTK library. I follow a step-by-step process. I did the first step, but while doing the second step I'm getting the following error:
TypeError: a bytes-like object is required, not 'list'
I tried my best to rectify it but I'm getting the same error again.
Here is my code:
from bs4 import BeautifulSoup
import urllib.request
response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)
this is my error
C:\python\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 8 of the file E:/secure secure/chatbot-master/nltk.py. To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "html5lib")
markup_type=markup_type))
Traceback (most recent call last):
File "E:/secure secure/chatbot-master/nltk.py", line 8, in <module>
soup = BeautifulSoup(html)
File "C:\python\lib\site-packages\bs4\__init__.py", line 228, in __init__
self._feed()
File "C:\python\lib\site-packages\bs4\__init__.py", line 289, in _feed
self.builder.feed(self.markup)
File "C:\python\lib\site-packages\bs4\builder\_html5lib.py", line 72, in feed
doc = parser.parse(markup, **extra_kwargs)
File "C:\python\lib\site-packages\html5lib\html5parser.py", line 236, in parse
parseMeta=parseMeta, useChardet=useChardet)
File "C:\python\lib\site-packages\html5lib\html5parser.py", line 89, in _parse
parser=self, **kwargs)
File "C:\python\lib\site-packages\html5lib\tokenizer.py", line 40, in __init__
self.stream = HTMLInputStream(stream, encoding, parseMeta, useChardet)
File "C:\python\lib\site-packages\html5lib\inputstream.py", line 148, in HTMLInputStream
return HTMLBinaryInputStream(source, encoding, parseMeta, chardet)
File "C:\python\lib\site-packages\html5lib\inputstream.py", line 416, in __init__
self.rawStream = self.openStream(source)
File "C:\python\lib\site-packages\html5lib\inputstream.py", line 453, in openStream
stream = BytesIO(source)
TypeError: a bytes-like object is required, not 'list'
Upvotes: 4
Views: 3628
Reputation: 1779
For anyone looking for an answer that works in python 3
invalidTags = ['br','b','font']
def stripTags(html, invalid_tags):
soup = BeautifulSoup(html, "lxml")
for tag in soup.findAll(True):
if tag.name in invalid_tags:
s = "::"
for c in tag.contents:
if not isinstance(c, NavigableString):
c = stripTags(str(c), invalid_tags)
s += str(c)
tag.replaceWith(s)
return soup
Upvotes: 1
Reputation: 8670
You can achieve it by implementing a simple tag-stripper.
def strip_tags(html, invalid_tags):
soup = BeautifulSoup(html)
for tag in soup.findAll(True):
if tag.name in invalid_tags:
s = ""
for c in tag.contents:
if not isinstance(c, NavigableString):
c = strip_tags(unicode(c), invalid_tags)
s += unicode(c)
tag.replaceWith(s)
return soup
html = "<p>Love, <b>Hate</b>, and <i>Hap<b>piness</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)
The result is:
<p>Love, Hate, and Happiness</p>
Upvotes: 2
Reputation: 2612
Your code is working as is.
The UserWarning: No parser was explicitly specified
was when your statement was soup = BeautifulSoup(html)
.
The TypeError: a bytes-like object is required, not 'list'
error might be due to an issue with dependencies.
The bs4 documentation says if you do not specify a parser, like BeautifulSoup(markup)
, it uses the best HTML parser that is installed on your system:
If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.
On my system, using BeautifulSoup(html, "html.parser")
worked just fine, with decent speed, without any warnings. html.parser
comes with Python’s standard library.
The documentation also summarizes the advantages and disadvantages of each parser library:
Try BeautifulSoup(html, "html.parser")
. It should work.
If you want speed, you could try BeautifulSoup(html, "lxml")
. If you do not have lxml’s HTML parser, on Windows you may want to install it with pip install lxml
.
Upvotes: 1