Reputation: 347
I am scraping some data from google images and have found that letters such as 'î' are being decoded incorrectly. In this case 'î' becomes 'î'. I have stored data from a google query in an object and it's in the format of:
{"key":"value"}
however, the values of the dictionary can contain other characters such as:
{"key":"File:Blue tit (Cyanistes caeruleus), Parc du Rouge-Cloître, Brussels ( 32781868883).jpg"}
which when I received the data is in the form
{"key":"File:Blue tit (Cyanistes caeruleus), Parc du Rouge-Clo\xc3\xaetre, Brussels ( 32781868883).jpg"}
So the when I try to convert it to bytes and decode using:
decoded_obj = bytes(raw_obj, 'utf-8').decode('unicode_escape')
I get the output
{"key":"File:Blue tit (Cyanistes caeruleus), Parc du Rouge-Cloître, Brussels ( 32781868883).jpg"}
The scrapers code is as follows:
import urllib.request
import json
url = 'https://www.google.com/search?q=Blue+tit+(Cyanistes+caeruleus),+Parc+du+Rouge-Clo%C3%AEtre,+Brussels+(32781868883).jpg&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiE8866stfjAhWBolwKHQ1YCdQQ_AUIESgB&biw=1920&bih=937'
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request)
data = str(response.read())
start_line = data.find('class="rg_meta notranslate">')
start_obj = data.find('{', start_line + 1)
end_obj = data.find('</div>', start_obj + 1)
raw_obj = str(data[start_obj:end_obj])
decoded_obj = bytes(raw_obj, 'utf-8').decode('unicode_escape')
final_obj = json.loads(decoded_obj)
print(final_obj)
Upvotes: 0
Views: 136
Reputation: 55629
The response data consists of UTF-8 encoded bytes:
>>> response = urllib.request.urlopen(request)
>>> res = response.read()
>>> type(res)
<class 'bytes'>
>>> response.headers
<http.client.HTTPMessage object at 0x7ff6ea74ba90>
>>> response.headers['Content-type']
'text/html; charset=UTF-8'
The correct way to handle this is to decode the response data:
>>> data = response.read().decode('utf-8')
Once this is done, data
is a str
and there is no need for any further decoding or encoding (or str()
or bytes()
calls).
In general, calling str
on a bytes
instance is the wrong thing to do, unless you provide the appropriate encoding:
>>> s = 'spam'
>>> bs = s.encode('utf-8')
>>> str(bs)
"b'spam'" # Now 'b' is inside the string
>>>
>>> str(bs, encoding='utf-8')
'spam'
Upvotes: 1