Reputation: 1171
After some issues getting Chrome Compact Language Detection library installed on Windows, I installed CLD from this easy_install.
I can now use CLD, but getting some encoding issues.
Pulling Tweets into a python script, and after stripping out the hashtags and links, passing them to CLD to detect the language.
Following is a simplified version of my code:
s = "I am a tweet from Twitter"
clean_s = s.encode('utf-8')
lan = cld.detect(clean_s, pickSummaryLanguage=True, removeWeakMatches=True)
4 out of 5 times, this works as expected (get returned a response about what language it is).
However, I keep getting this error popping up:
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 15: character maps to undefined
I did read that:
"You must provide CLD clean (interchange-valid) UTF-8, so any encoding issues must be sorted out before-hand."
However, I thought I had this covered with my statement to encode to UTF8?
I assume that I need to ensure that I pass a string to CLD that preserves fonts in languages such as arabic, asian, etc.
This is my first python project, so likely this is a rookie mistake. Can anyone point out my mistake and how to rectify?
Let me know in comments if I need to gather more info, and I will edit my Q to provide more info.
EDIT If it helps, here is my rookie code (cut down to replicate issue). I am running Python 2.7 32bit.
Running this code, after awhile, I get this error. Let me know if I have not correctly implemented the error reporting.
Raw: Traceback (most recent call last):
File "LanguageTesting.py", line 71, in <module>
parse_tweet(tweet)
File "LanguageTesting.py", line 43, in parse_tweet
print "Raw:", raw
File "C:\Python27\ArcGIS10.1\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 29-32: character maps to <undefined>
Upvotes: 0
Views: 321
Reputation: 1186
It looks like you are failing on the print statement right? This means Python cannot encode the unicode string into what it thinks the console's stdout encoding is ("print sys.getdefaultencoding()").
If python is wrong about what your terminal expects, you can set the env var ("export PYTHONIOENCODING=UTF-8") and it will encode your printed strings to utf-8. Alternatively, before printing, you can encode to whatever charset your terminal expects (and will likely have to ignore/replace errors to avoid exceptions like the one you hit)...
Upvotes: 1