Reputation: 67
I have a bunch of HTML files that I am trying to read it with Beautifulsoup. Some of them, I received an error. I tried decoding, encoding... but cannot find the problem. Thank you very much in advance.
Here is an example.
import requests
from bs4 import BeautifulSoup
new_text = requests.get('https://www.sec.gov/Archives/edgar/data/1723069/000121390018016357/0001213900-18-016357.txt')
soup = BeautifulSoup(new_text.content.decode('utf-8','ignore').encode("utf-8"),'lxml')
print(soup)
On Jupyter notebook, I get dead kernel error. On Pycharm, I get the following error: (it repeats itself, so deleted some of them. But was quite long.)
Traceback (most recent call last):
File "C:/Users/oe/.PyCharmCE2019.1/config/scratches/scratch_5.py", line 5, in <module>
print(soup)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1099, in __unicode__
return self.decode()
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\__init__.py", line 566, in decode
indent_level, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
indent_contents, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
formatter))
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
indent_contents, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
formatter))
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
indent_contents, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
formatter))
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
indent_contents, eventual_encoding, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1254, in decode_contents
text = c.output_ready(formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 745, in output_ready
output = self.format_string(self, formatter)
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 220, in format_string
if isinstance(formatter, Callable):
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\abc.py", line 190, in __instancecheck__
subclass in cls._abc_negative_cache):
File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\_weakrefset.py", line 75, in __contains__
return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison
Upvotes: 3
Views: 822
Reputation: 24930
Frankly, I'm not sure what the underlying problem with your code is (although I don't get a dead kernel in a Jupyter notebook), but this seems to work:
url = 'https://www.sec.gov/Archives/edgar/data/1723069/000121390018016357/0001213900-18-016357.txt'
import requests
from bs4 import BeautifulSoup
new_text = requests.get(url)
soup = BeautifulSoup(new_text.text,'lxml')
print(soup.text)
Note that in soup
, new_text.content
is replaced with new_text.text
, I had to drop the encode/decode parameters, and the print
command had to be changed from print(soup)
(which raised an error) to print(soup.text)
which works fine. Maybe someone smarter can explain...
Another option that works is with:
import urllib.request
response = urllib.request.urlopen(url)
new_text2 = response.read()
soup = BeautifulSoup(new_text2,'lxml')
print(soup.text)
Upvotes: 2