How to get rid of special characters while extracting data from web?

I am extracting data from the website and it has an entry that contains a special character i.e. Comfort Inn And Suites�? Blazing Stump. When I try to extract it, it throws an error:

    Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 638, in _tick
    taskObj._oneWorkUnit()
  File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
    result = next(self._iterator)
  File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
  File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
    yield it.next()
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 24, in process_spider_output
    for x in result:
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 14, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 32, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 48, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "E:\Scrapy projects\emedia\emedia\spiders\test_spider.py", line 46, in parse
    print repr(business.select('a[@class="name"]/text()').extract()[0])
  File "C:\Python27\lib\site-packages\scrapy\selector\lxmlsel.py", line 51, in select
    result = self.xpathev(xpath)
  File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src\lxml\lxml.etree.c:145954)

  File "xpath.pxi", line 241, in lxml.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:144987)

  File "extensions.pxi", line 621, in lxml.etree._unwrapXPathObject (src\lxml\lxml.etree.c:139973)

  File "extensions.pxi", line 655, in lxml.etree._createNodeSetResult (src\lxml\lxml.etree.c:140328)

  File "extensions.pxi", line 676, in lxml.etree._unpackNodeSetEntry (src\lxml\lxml.etree.c:140524)

  File "extensions.pxi", line 784, in lxml.etree._buildElementStringResult (src\lxml\lxml.etree.c:141695)

  File "apihelpers.pxi", line 1373, in lxml.etree.funicode (src\lxml\lxml.etree.c:26255)

exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 22: invalid continuation byte

I have tried a lot of different things after searching on the web such as decode('utf-8'), unicodedata.normalize('NFC',business.select('a[@class="name"]/text()').extract()[0]) but the problem persists?

The source URL is "http://www.truelocal.com.au/find/hotels/97/" and on this page it is fourth entry which I am talking about.

Upvotes: 1

Views: 1317

Answers (2)

Rick James
Rick James

Reputation: 142346

Don't use "replace" to fix Mojibake, fix the database and the code that caused the Mojibake.

But first you need to determine whether it is simply Mojibake or "double-encoding". With a SELECT col, HEX(col) ... determine whether a single character turned into 2-4 bytes (Mojibake) or 4-6 bytes (double encoding). Examples:

`é` (as utf8) should come back `C3A9`, but instead shows `C383C2A9`
The Emoji `👽` should come back `F09F91BD`, but comes back `C3B0C5B8E28098C2BD`

Review "Mojibake" and "double encoding" here

Then the database fixes are discussed here :

  • CHARACTER SET latin1, but have utf8 bytes in it; leave bytes alone while fixing charset:

First, lets assume you have this declaration for tbl.col:

col VARCHAR(111) CHARACTER SET latin1 NOT NULL

Then to convert the column without changing the bytes via this 2-step ALTER:

ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;

Note: If you start with TEXT, use BLOB as the intermediate definition. (This is the "2-step ALTER, as discussed elsewhere.) (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)

  • CHARACTER SET utf8mb4 with double-encoding: UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4);

  • CHARACTER SET latin1 with double-encoding: Do the 2-step ALTER, then fix the double-encoding.

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1123002

You have a bad Mojibake in the original webpage, probably due to bad handling of Unicode in the data entry somewhere. The actual UTF-8 bytes in the source are C3 3F C2 A0 when expressed in hexadecimal.

I think it was once a U+00A0 NO-BREAK SPACE. Encoded to UTF-8 that becomes C2 A0, interpret that as Latin-1 instead then encode to UTF-8 again becomes C3 82 C2 A0, but 82 is a control character if interpreted as Latin-1 again so that was substituted by a ? question mark, hex 3F when encoded.

When you follow the link to the detail page for that venue then you get a different Mojibake for the same name: Comfort Inn And Suites Blazing Stump, giving us the Unicode characters U+00C3, U+201A, U+00C2 a &nbsp; HTML entity, or unicode character U+00A0 again. Encode that as Windows Codepage 1252 (a superset of Latin-1) and you get C3 82 C2 A0 again.

You can only get rid of it by targeting this directly in the source of the page

pagesource.replace('\xc3?\xc2\xa0', '\xc2\xa0')

This 'repairs' the data by substituting the train wreck with the original intended UTF-8 bytes.

If you have a scrapy Response object, replace the body:

body = response.body.replace('\xc3?\xc2\xa0', '\xc2\xa0')
response = response.replace(body=body)

Upvotes: 5

Related Questions