Ozioh
Ozioh

Reputation: 67

How to deal with Beautifulsoup Recursion Error (or parse error)

I have a bunch of HTML files that I am trying to read it with Beautifulsoup. Some of them, I received an error. I tried decoding, encoding... but cannot find the problem. Thank you very much in advance.

Here is an example.

import requests
from bs4 import BeautifulSoup
new_text = requests.get('https://www.sec.gov/Archives/edgar/data/1723069/000121390018016357/0001213900-18-016357.txt')
soup = BeautifulSoup(new_text.content.decode('utf-8','ignore').encode("utf-8"),'lxml')
print(soup)

On Jupyter notebook, I get dead kernel error. On Pycharm, I get the following error: (it repeats itself, so deleted some of them. But was quite long.)

Traceback (most recent call last):
  File "C:/Users/oe/.PyCharmCE2019.1/config/scratches/scratch_5.py", line 5, in <module>
    print(soup)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1099, in __unicode__
    return self.decode()
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\__init__.py", line 566, in decode
    indent_level, eventual_encoding, formatter)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
    indent_contents, eventual_encoding, formatter)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
    formatter))
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
    indent_contents, eventual_encoding, formatter)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
    formatter))
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
    indent_contents, eventual_encoding, formatter)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1257, in decode_contents
    formatter))
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1188, in decode
    indent_contents, eventual_encoding, formatter)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 1254, in decode_contents
    text = c.output_ready(formatter)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 745, in output_ready
    output = self.format_string(self, formatter)
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\site-packages\bs4\element.py", line 220, in format_string
    if isinstance(formatter, Callable):
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\abc.py", line 190, in __instancecheck__
    subclass in cls._abc_negative_cache):
  File "C:\Users\oe\Anaconda3\envs\TextAnalysis\lib\_weakrefset.py", line 75, in __contains__
    return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison

Upvotes: 3

Views: 822

Answers (1)

Jack Fleeting
Jack Fleeting

Reputation: 24930

Frankly, I'm not sure what the underlying problem with your code is (although I don't get a dead kernel in a Jupyter notebook), but this seems to work:

url = 'https://www.sec.gov/Archives/edgar/data/1723069/000121390018016357/0001213900-18-016357.txt'

import requests
from bs4 import BeautifulSoup
new_text = requests.get(url)

soup = BeautifulSoup(new_text.text,'lxml')
print(soup.text)

Note that in soup, new_text.content is replaced with new_text.text, I had to drop the encode/decode parameters, and the print command had to be changed from print(soup) (which raised an error) to print(soup.text) which works fine. Maybe someone smarter can explain...

Another option that works is with:

import urllib.request

response = urllib.request.urlopen(url)
new_text2 = response.read()
soup = BeautifulSoup(new_text2,'lxml')
print(soup.text)

Upvotes: 2

Related Questions