Reputation: 2734
I'm fairly new to Python, so I'm hoping this is something simple that I'm just missing.
I'm running Python 2.7 on Windows 7
I'm trying to run a basic twitter scraping program through the command line. However I keep getting the following error:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 79: character maps to (undefined)
I understand basically what's happening here, that it's trying to print to the console in cp437 and it's getting confused by the unicode characters in the tweets that it's grabbing.
All I'm trying to do is either get it to replace those characters with "?" or just get it to drop those characters altogether. I have read a bunch of posts about this and I can't figure out how to do it.
I opened the cp437.py file that's referenced in the error and I changed all the errors='strict'
to errors='ignore'
but that didn't solve the problem.
I then tried to go into the C:\Python27\Lib\codecs.py file and change all the errors='strict'
to errors='ignore'
but that didn't solve the problem either.
Any ideas? Like I said, hopefully I'm just missing something basic but I've read a bunch of posts on this and I can't seem to puzzle it out.
Thanks a lot. Seth
Upvotes: 1
Views: 2207
Reputation: 5565
I would not suggest changing the built in libraries - they are designed to allow handling encoding errors without needing to be fiddled with (and if you have change, not longer clear that any solution that would work for everyone else, would work for you).
You may just want to be passing errors='ignore'
into whatever encoding function you are using to just skip the error character, or errors='replace'
to replace that character with the character \ufff
to signify there was a problem. [ error='strict' is the default if you don't pass any value. ]
However, if you are printing to the command line, you probably don't want to be encoding as unicode anyway, but ASCII instead - since unicode includes characters that the command line can't print. (and i suspect that that which is causing errors to be thrown up, rather than there being non-standard unicode characters in the response you are getting from Twitter).
Try e.g.
print original_data.encode('ascii', 'ignore')
Upvotes: 3