ic10503
ic10503

Reputation: 130

Scrapy response.replace encoding error

I am trying to replace the response body of a search result block of a search result page of google using response.replace() and I face some encoding issues.

scrapy  shell "http://www.google.de/search?q=Zuckerccc"

>>> srb = hxs.select("//li[@class='g']").extract()
>>> body = '<html><body>' + srb[0] + '</body></html>'    # get only 1st search result block
>>> b = response.replace(body = body)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 54, in replace
    return Response.replace(self, *args, **kwargs)
  File "scrapy/lib/python2.6/site-packages/scrapy/http/response/__init__.py", line 77, in replace
    return cls(*args, **kwargs)
  File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 31, in __init__
    super(TextResponse, self).__init__(*args, **kwargs)
  File "scrapy/lib/python2.6/site-packages/scrapy/http/response/__init__.py", line 19, in __init__
    self._set_body(body)
  File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 48, in _set_body
    self._body = body.encode(self._encoding)
  File "../local_1/Linux-2.6c2.5-x86_64/Python/Python-147.0-0/lib/python2.6/encodings/cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0131' in position 529: character maps to <undefined>

I tried to create my own response as well,

>>> x = HtmlResponse("http://www.google.de/search?q=Zuckerccc", body = body, encoding = response.encoding)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 31, in __init__
    super(TextResponse, self).__init__(*args, **kwargs)
    self._set_body(body)
  File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 48, in _set_body
    self._body = body.encode(self._encoding)
  File "../local_1/Linux-2.6c2.5-x86_64/Python/Python-147.0-0/lib/python2.6/encodings/cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0131' in position 529: character maps to <undefined>
  File "scrapy/lib/python2.6/site-packages/scrapy/http/response/__init__.py", line 19, in __init__

Also, when I use _body_declared_encoding() for encoding in replace() function, it works.

replace(body = body, encoding = response._body_declared_encoding())

I don't understand why response._body_declared_encoding() and response.encoding are different. Can anybody please shed some light on this.

So, what will be a good way to fix this ?

Upvotes: 4

Views: 3714

Answers (2)

Gykbot
Gykbot

Reputation: 11

I check the source code from scrapy.http.response.text , when we use TextResponse, we need to tell self._encoding first. So we can do like this:

>>>response._encoding='utf8'
>>>response._set_body("aaaaaa")
>>>response.body
>>>'aaaaaa'

Upvotes: 1

conscho
conscho

Reputation: 193

I successfully replaced the response body with these lines of code:

scrapy  shell "http://www.google.de/search?q=Zuckerccc"
>>> google_result = response.xpath('//li[@class="g"]').extract()[0]
>>> body = '<html><body>' + google_result + '</body></html>'
>>> b = response.replace(body = body)

Upvotes: 3

Related Questions