Zeno
Zeno

Reputation: 1829

BeautifulSoup: is not JSON serializable

I have this code that was written by someone else for Python 2, and I converted it to Python 3:

    url = self.lodestone_url + '/topics/'
    r = self.make_request(url)

    news = []
    soup = bs4.BeautifulSoup(r.content)
    for tag in soup.select('.news__content__list__topics li'):
        entry = {}
        title_tag = tag.select('.ic_topics a')[0]
        script = str(tag.select('script')[0])
        entry['timestamp'] = int(re.findall(r"1[0-9]{9},", script)[0].rstrip(','))
        entry['link'] = '//' + self.lodestone_domain + title_tag['href']
        entry['id'] = entry['link'].split('/')[-1]
        entry['title'] = title_tag.string.strip()
        body = tag.select('.news__content__list__topics--body')[0]
        for a in body.findAll('a'):
            if a['href'].startswith('/'):
                a['href'] = '//' + self.lodestone_domain + a['href']
        print(type(body))
        entry['body'] = body.encode('utf-8').strip()
        #entry['body'] = ""
        entry['lang'] = 'en'
        news.append(entry)

The last piece I cannot figure out is this line from above:

        entry['body'] = body.encode('utf-8').strip()

Because it's giving this error:

Traceback (most recent call last):
  File "lodestoner", line 48, in <module>
    print(json.dumps(ret, indent=4))
  File "/usr/local/lib/python3.5/json/__init__.py", line 237, in dumps
    **kw).encode(obj)
  File "/usr/local/lib/python3.5/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/usr/local/lib/python3.5/json/encoder.py", line 427, in _iterencode
    yield from _iterencode_list(o, _current_indent_level)
  File "/usr/local/lib/python3.5/json/encoder.py", line 324, in _iterencode_list
    yield from chunks
  File "/usr/local/lib/python3.5/json/encoder.py", line 403, in _iterencode_dict
    yield from chunks
  File "/usr/local/lib/python3.5/json/encoder.py", line 436, in _iterencode
    o = _default(o)
  File "/usr/local/lib/python3.5/json/encoder.py", line 180, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: b'<div class="news__content__list__topics--body"><a class="news__content__list__topics__link_banner" href="//na.finalfantasyxiv.com/lodestone/topics/detail/f05649918007c827f44000ef5462461cec1e8b38"><img alt="" height="149" src="http://img.finalfantasyxiv.com/t/f05649918007c827f44000ef5462461cec1e8b38.png?1473152734" width="570"/></a>FINAL FANTASY XIV will be attending Tokyo Game Show 2016 at Makuhari Messe in Chiba in full force, and we\xe2\x80\x99ll be a larger than Hydaelyn presence as we\xe2\x80\x99ll be occupying space at our own Square Enix booth as well as the Intel booth! Additionally, we\xe2\x80\x99ll be broadcasting the next Letter from the Producer LIVE straight from the show floor, so be sure to mark your calendars as this is the second part of the Patch 3.4 special which you won\xe2\x80\x99t want to miss!<br><br><a href="//na.finalfantasyxiv.com/lodestone/topics/detail/f05649918007c827f44000ef5462461cec1e8b38" rel="f05649918007c827f44000ef5462461cec1e8b38">Read on</a> for more details.</br></br></div>' 
is not JSON serializable

Above, body variable is of type <class 'bs4.element.Tag'>.

So when I remove the encode part and it looks like this:

        entry['body'] = body.strip()

I then get this error:

TypeError: 'NoneType' object is not callable

What am I missing? For most scenarios like this, removing encode had worked.

Upvotes: 0

Views: 5136

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180471

The original author is not extracting the text,they are dumping the HTML content, you need to pass a str to do the same using python3:

In [10]: soup = BeautifulSoup("<div>foo</div>","html.parser")

In [11]: print(json.dumps(soup.div.encode("utf-8")))
.....................................

/usr/lib/python3.5/json/encoder.py in default(self, o)
    177 
    178         """
--> 179         raise TypeError(repr(o) + " is not JSON serializable")
    180 
    181     def encode(self, o):

TypeError: b'<div>foo</div>' is not JSON serializable

In [12]: print(json.dumps(str(soup.div.encode("utf-8"),"utf-8")))
"<div>foo</div>"

Which is exactly what you get using python2:

In [4]: soup = BeautifulSoup("<div>foo</div>","html.parser")

In [5]: print(json.dumps(soup.div.encode("utf-8")))
"<div>foo</div>"

Upvotes: 1

Related Questions