Sagar Grover
Sagar Grover

Reputation: 65

UnicodeEncodeError: handling special characters

I am trying to scrap a web page. To keep care of all characters other then ASCII, I have written this code.

    mydata = ''.join([i if ord(i) < 128 else ' ' for i in response.text])

and processed it further using beautiful soup python library. Now this is not handling some special characters that are on webpage like [tick], [star] (can't show a picture here). Any clue on how to escape these characters and replace them with a space. Right now I have this error

    UnicodeEncodeError: 'charmap' codec can't encode character '\u2713' in position 62: character maps to <undefined>

Upvotes: 1

Views: 2020

Answers (3)

bobince
bobince

Reputation: 536319

fp = open("output.txt","w")

gives you a file open for writing text using the default encoding, which in your case is an encoding that doesn't have the character (probably cp1252), hence the error. Open the file with an encoding that supports it and you'll be fine:

fp = open('output.txt', 'w', encoding='utf-8')

Note also that:

print("result: "+ str(ele))

can fail if your console doesn't support Unicode, which under Windows it likely will not. Use print(ascii(...)) to get an ASCII-safe representation for debugging purposes.

The probable reason your attempt to get rid of non-ASCII characters fails is that you are removing them before parsing the HTML, rather than from the values you get after parsing. So a literal would be removed, but if a character reference like &#x2713; were used, it would be left alone, get parsed by bs4, and end up as .

(I am sad that the default reaction to Unicode errors always seems to be to try to get rid of non-ASCII characters completely, instead of fixing the code to handle them correctly.)

You're also extracting text in a pretty weird way, using str() to get markup and then trying to pick out the tags and remove them. This is unreliable—HTML is not that straightforward to parse, which is why BeautifulSoup is a thing—and needless because you already have a perfectly good HTML parser that can give you the pure text in an element (get_text()).

Upvotes: 3

Daniel
Daniel

Reputation: 42748

Most of your code is not necessary. request is already doing the correct decoding for you, beautifulsoup is doing the text extraction for you, and python is doing the correct encoding for you when writing to a file:

import requests
from bs4 import BeautifulSoup

#keyterm = input("Enter a keyword to search:")
URL = 'https://www.google.com/search?q=jaguar&num=30'
#NO_OF_LINKS_TO_BE_EXTRACTED = 10
print("Requesting data from %s" % URL)
response = requests.get(URL)
soup = BeautifulSoup(response.text)

#print(soup.prettify())
metaM = soup.findAll("span","st")
#metaM = soup.find("div", { "class" : "f slp" })
with open("output.txt", "w", encoding='utf8') as fp:
    for ele in metaM:
        print("result: %r" % ele)
        fp.write(ele.get_text().replace('\n', ' ') + '\n')

Upvotes: 0

Mike Bessonov
Mike Bessonov

Reputation: 686

It's always preferable to process everything in Unicode, and convert to any specific encoding only before storage or transfer. For example,

s = u"Hi, привет, ciao"

> s
u'Hi, \u043f\u0440\u0438\u0432\u0435\u0442, ciao'

> s.encode('ascii', 'ignore')
'Hi, , ciao'

> s.encode('ascii', 'replace')
'Hi, ??????, ciao'

If you need to replace non-ascii chars specifically with spaces, you can write and register your own conversion error handler, see codecs.register_error().

Upvotes: 2

Related Questions