Reputation: 1
I have a project where I am using Python scripts to collect data from various API's (Google Analytics, Facebook, Instagram, etc). I write the collected data to a flat file, and then use SSIS to extract the data from a file, do some ETL work, and then insert into our Data Warehouse.
The issue I am having is with Unicode values, it looks like they might not be encoded/decoded correctly and a different character is being inserted into the database than what it really is. Here's the process involved:
I encode the data value and write to a file using the csv module:
import csv
with open('{0}{1}.txt'.format(file_path, file_name), 'ab+') as f:
writer = csv.writer(f, delimiter='\t')
try:
writer.writerow(data['name'].encode('utf-8'))
except Exception, ex:
logging.exception(ex)
When I open the file in a text editor like Sublime Text, all of the unicode characters are being shown correctly.
Once all the data is done writing to a file, I start collecting it using SSIS. In SSIS, I have a Flat File Source Task that pulls in the data. I've defined the data type for the 'name' column in the connection manager as DT_WSTR (Length 4000)
. The code page for the flat file connection is 65001 (UTF-8)
.
The destination database I am writing to is a SQL Azure Database with collation SQL_Latin1_General_CP1_CI_AS
. The destination database column is defined as nvarchar(max)
. If I try to write to a regular SQL Server database with same collation, the result is the same.
What am I doing wrong here? There are lots of emoji type characters that I collect and don't care much about, what is important is accented and non-English characters. If I need to provide any more detail or anything else, please let me know.
Upvotes: 0
Views: 244
Reputation: 24138
According to your description, I reviewed the document about "International Considerations for Integration Services" (https://msdn.microsoft.com/en-us/library/ms141186(v=sql.90).aspx) and made a sample code for Python Encoding/Decoding.
I think the code data['name']
not need to be encoded by UTF-8
, because of reading the data from a file in binary mode by default is with the charset ISO-8859-1(Latin-1)
in Python.
As references, here is my experiment code for encoding/decoding data from a txt file below with UTF-8
and ISO-8859-1
in package codecs
and the method encode('<charset>')
.
Data in a txt file called input.txt
Катерина Александровна
import codecs
iso8859_1c = codecs.lookup('iso-8859-1')
utf8c = codecs.lookup('utf-8')
fpw = open('output.txt', 'wb')
fpr = open('input.txt', 'rb')
origin = fpr.readline()
print "Origin>", origin
print (bytearray(origin), )
print
fpw.write('Origin>')
fpw.write(origin)
fpw.write('\n')
fpr.close()
dec_iso8859_1, n = iso8859_1c.decode(origin)
print 'ISO-8859-1 decoding>', dec_iso8859_1, n
print (dec_iso8859_1, n)
print 'UTF-8 encoding ISO-8859-1>', dec_iso8859_1.encode('utf-8')
print 'ISO-8859-1 encoding ISO-8859-1>', dec_iso8859_1.encode('iso-8859-1')
print
fpw.write('UTF-8 encoding ISO-8859-1: dec_iso8859_1.encode("utf-8")>')
fpw.write(dec_iso8859_1.encode('utf-8'))
fpw.write('\n')
dec_utf8, n = utf8c.decode(origin)
print 'UTF-8 decoding>', dec_utf8, n
print (dec_utf8, n)
print 'UTF-8 encoding UTF-8>', dec_utf8.encode('utf-8')
# Error for code below: 'latin-1' codec can't encode characters in position 0-7: ordinal not in range(256)
# print 'ISO-8859-1 encoding UTF-8>', dec_utf8.encode('iso-8859-1')
print
fpw.write('UTF-8 encoding UTF-8: dec_utf8.encode("utf-8")>')
fpw.write(dec_utf8.encode('utf-8'))
fpw.write('\n')
enc_utf8_iso8859_1, n = utf8c.encode(dec_iso8859_1)
print 'UTF-8 encoding ISO-8859-1>', enc_utf8_iso8859_1, n
print (enc_utf8_iso8859_1, n)
print
fpw.write('UTF-8 encoding ISO-8859-1>')
fpw.write(enc_utf8_iso8859_1)
fpw.write('\n')
enc_iso8859_1_iso8859_1, n = iso8859_1c.encode(dec_iso8859_1)
print 'ISO-8859-1 encoding ISO-8859-1>', enc_iso8859_1_iso8859_1, n
print (enc_iso8859_1_iso8859_1, n)
fpw.write('ISO-8859-1 encoding ISO-8859-1>')
fpw.write(enc_iso8859_1_iso8859_1)
fpw.write('\n')
fpw.flush()
fpw.close()
The output in console:
Origin> Катерина Александровна
(bytearray(b'\xd0\x9a\xd0\xb0\xd1\x82\xd0\xb5\xd1\x80\xd0\xb8\xd0\xbd\xd0\xb0 \xd0\x90\xd0\xbb\xd0\xb5\xd0\xba\xd1\x81\xd0\xb0\xd0\xbd\xd0\xb4\xd1\x80\xd0\xbe\xd0\xb2\xd0\xbd\xd0\xb0'),)
ISO-8859-1 decoding> ÐаÑеÑина ÐлекÑандÑовна 43
(u'\xd0\x9a\xd0\xb0\xd1\x82\xd0\xb5\xd1\x80\xd0\xb8\xd0\xbd\xd0\xb0 \xd0\x90\xd0\xbb\xd0\xb5\xd0\xba\xd1\x81\xd0\xb0\xd0\xbd\xd0\xb4\xd1\x80\xd0\xbe\xd0\xb2\xd0\xbd\xd0\xb0', 43)
UTF-8 encoding ISO-8859-1> ÐаÑеÑина ÐлекÑандÑовна
ISO-8859-1 encoding ISO-8859-1> Катерина Александровна
UTF-8 decoding> Катерина Александровна 43
(u'\u041a\u0430\u0442\u0435\u0440\u0438\u043d\u0430 \u0410\u043b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u043e\u0432\u043d\u0430', 43)
UTF-8 encoding UTF-8> Катерина Александровна
UTF-8 encoding ISO-8859-1> ÐаÑеÑина ÐлекÑандÑовна 43
('\xc3\x90\xc2\x9a\xc3\x90\xc2\xb0\xc3\x91\xc2\x82\xc3\x90\xc2\xb5\xc3\x91\xc2\x80\xc3\x90\xc2\xb8\xc3\x90\xc2\xbd\xc3\x90\xc2\xb0 \xc3\x90\xc2\x90\xc3\x90\xc2\xbb\xc3\x90\xc2\xb5\xc3\x90\xc2\xba\xc3\x91\xc2\x81\xc3\x90\xc2\xb0\xc3\x90\xc2\xbd\xc3\x90\xc2\xb4\xc3\x91\xc2\x80\xc3\x90\xc2\xbe\xc3\x90\xc2\xb2\xc3\x90\xc2\xbd\xc3\x90\xc2\xb0', 43)
ISO-8859-1 encoding ISO-8859-1> Катерина Александровна 43
('\xd0\x9a\xd0\xb0\xd1\x82\xd0\xb5\xd1\x80\xd0\xb8\xd0\xbd\xd0\xb0 \xd0\x90\xd0\xbb\xd0\xb5\xd0\xba\xd1\x81\xd0\xb0\xd0\xbd\xd0\xb4\xd1\x80\xd0\xbe\xd0\xb2\xd0\xbd\xd0\xb0', 43)
And the output in the file called output.txt
:
Origin>Катерина Александровна
UTF-8 encoding ISO-8859-1: dec_iso8859_1.encode("utf-8")>ÐаÑеÑина ÐлекÑандÑовна
UTF-8 encoding UTF-8: dec_utf8.encode("utf-8")>Катерина Александровна
UTF-8 encoding ISO-8859-1>ÐаÑеÑина ÐлекÑандÑовна
ISO-8859-1 encoding ISO-8859-1>Катерина Александровна
Upvotes: 0