sunwarr10r
sunwarr10r

Reputation: 4797

How to handle encoding in Python 2.7 and SQLAlchemy 🏴‍☠️

I have written a code in Python 3.5, where I was using Tweepy & SQLAlchemy & the following lines to load Tweets into a database and it worked well:

twitter = Twitter(str(tweet.user.name).encode('utf8'), str(tweet.text).encode('utf8'))
session.add(twitter)
session.commit()

Using the same code now in Python 2.7 raises an Error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 139: ordinal not in range(128)

Whats the solution? My MySQL Configuration is the following one:

Server side --> utf8mb4 encoding

Client side --> create_engine('mysql+pymysql://abc:def@abc/def', encoding='utf8', convert_unicode=True)):

UPDATE

It seems that there is no solution, at least not with Python 2.7 + SQLAlchemy. Here is what I found out so far and if I am wrong, please correct me.

Tweepy, at least in Python 2.7, returns unicode type objects.

In Python 2.7: tweet = u'☠' is a <'unicode' type>

In Python 3.5: tweet = u'☠' is a <'str' class>

This means Python 2.7 will give me an 'UnicodeEncodeError' if I do str(tweet) because Python 2.7 then tries to encode this character '☠' into ASCII, which is not possible, because ASCII can only handle this basic characters.

Conclusion:

Using just this statement tweet.user.name in the SQLAlchemy line gives me the following error:

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256)

Using either this statement tweet.user.name.encode('utf-8') or this one str(tweet.user.name.encode('utf-8')) in the SQLAlchemy line should actually work the right way, but it shows me unencoded characters on the database side:

ð´ââ ï¸Jack Sparrow

This is what I want it to show:

Printed: 🏴‍☠️ Jack Sparrow

Special characters unicode: u'\U0001f3f4\u200d\u2620\ufe0f'

Special characters UTF-8 encoding: '\xf0\x9f\x8f\xb4\xe2\x80\x8d\xe2\x98\xa0\xef\xb8\x8f'

Upvotes: 0

Views: 1199

Answers (1)

Rick James
Rick James

Reputation: 142528

Do not use any encode/decode functions; they only compound the problems.

Do set the connection to be UTF-8.
Do set the column/table to utf8mb4 instead of utf8.
Do use # -*- coding: utf-8 -*- at the beginning of Python code.

More Python tips Note that that has a link to "Python 2.7 issues; improvements in Python 3".

Upvotes: 0

Related Questions