Reputation: 1829
I have this code that was written by someone else for Python 2, and I converted it to Python 3:
url = self.lodestone_url + '/topics/'
r = self.make_request(url)
news = []
soup = bs4.BeautifulSoup(r.content)
for tag in soup.select('.news__content__list__topics li'):
entry = {}
title_tag = tag.select('.ic_topics a')[0]
script = str(tag.select('script')[0])
entry['timestamp'] = int(re.findall(r"1[0-9]{9},", script)[0].rstrip(','))
entry['link'] = '//' + self.lodestone_domain + title_tag['href']
entry['id'] = entry['link'].split('/')[-1]
entry['title'] = title_tag.string.strip()
body = tag.select('.news__content__list__topics--body')[0]
for a in body.findAll('a'):
if a['href'].startswith('/'):
a['href'] = '//' + self.lodestone_domain + a['href']
print(type(body))
entry['body'] = body.encode('utf-8').strip()
#entry['body'] = ""
entry['lang'] = 'en'
news.append(entry)
The last piece I cannot figure out is this line from above:
entry['body'] = body.encode('utf-8').strip()
Because it's giving this error:
Traceback (most recent call last):
File "lodestoner", line 48, in <module>
print(json.dumps(ret, indent=4))
File "/usr/local/lib/python3.5/json/__init__.py", line 237, in dumps
**kw).encode(obj)
File "/usr/local/lib/python3.5/json/encoder.py", line 201, in encode
chunks = list(chunks)
File "/usr/local/lib/python3.5/json/encoder.py", line 427, in _iterencode
yield from _iterencode_list(o, _current_indent_level)
File "/usr/local/lib/python3.5/json/encoder.py", line 324, in _iterencode_list
yield from chunks
File "/usr/local/lib/python3.5/json/encoder.py", line 403, in _iterencode_dict
yield from chunks
File "/usr/local/lib/python3.5/json/encoder.py", line 436, in _iterencode
o = _default(o)
File "/usr/local/lib/python3.5/json/encoder.py", line 180, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: b'<div class="news__content__list__topics--body"><a class="news__content__list__topics__link_banner" href="//na.finalfantasyxiv.com/lodestone/topics/detail/f05649918007c827f44000ef5462461cec1e8b38"><img alt="" height="149" src="http://img.finalfantasyxiv.com/t/f05649918007c827f44000ef5462461cec1e8b38.png?1473152734" width="570"/></a>FINAL FANTASY XIV will be attending Tokyo Game Show 2016 at Makuhari Messe in Chiba in full force, and we\xe2\x80\x99ll be a larger than Hydaelyn presence as we\xe2\x80\x99ll be occupying space at our own Square Enix booth as well as the Intel booth! Additionally, we\xe2\x80\x99ll be broadcasting the next Letter from the Producer LIVE straight from the show floor, so be sure to mark your calendars as this is the second part of the Patch 3.4 special which you won\xe2\x80\x99t want to miss!<br><br><a href="//na.finalfantasyxiv.com/lodestone/topics/detail/f05649918007c827f44000ef5462461cec1e8b38" rel="f05649918007c827f44000ef5462461cec1e8b38">Read on</a> for more details.</br></br></div>'
is not JSON serializable
Above, body
variable is of type <class 'bs4.element.Tag'>
.
So when I remove the encode
part and it looks like this:
entry['body'] = body.strip()
I then get this error:
TypeError: 'NoneType' object is not callable
What am I missing? For most scenarios like this, removing encode
had worked.
Upvotes: 0
Views: 5136
Reputation: 180471
The original author is not extracting the text,they are dumping the HTML content, you need to pass a str to do the same using python3:
In [10]: soup = BeautifulSoup("<div>foo</div>","html.parser")
In [11]: print(json.dumps(soup.div.encode("utf-8")))
.....................................
/usr/lib/python3.5/json/encoder.py in default(self, o)
177
178 """
--> 179 raise TypeError(repr(o) + " is not JSON serializable")
180
181 def encode(self, o):
TypeError: b'<div>foo</div>' is not JSON serializable
In [12]: print(json.dumps(str(soup.div.encode("utf-8"),"utf-8")))
"<div>foo</div>"
Which is exactly what you get using python2:
In [4]: soup = BeautifulSoup("<div>foo</div>","html.parser")
In [5]: print(json.dumps(soup.div.encode("utf-8")))
"<div>foo</div>"
Upvotes: 1