LifeofaBit
LifeofaBit

Reputation: 1

Inserting unicode strings in SSIS from flat file

I have a project where I am using Python scripts to collect data from various API's (Google Analytics, Facebook, Instagram, etc). I write the collected data to a flat file, and then use SSIS to extract the data from a file, do some ETL work, and then insert into our Data Warehouse.

The issue I am having is with Unicode values, it looks like they might not be encoded/decoded correctly and a different character is being inserted into the database than what it really is. Here's the process involved:

I encode the data value and write to a file using the csv module:

import csv

with open('{0}{1}.txt'.format(file_path, file_name), 'ab+') as f:
    writer = csv.writer(f, delimiter='\t')
    try:
        writer.writerow(data['name'].encode('utf-8'))
    except Exception, ex:
        logging.exception(ex)

When I open the file in a text editor like Sublime Text, all of the unicode characters are being shown correctly.

Once all the data is done writing to a file, I start collecting it using SSIS. In SSIS, I have a Flat File Source Task that pulls in the data. I've defined the data type for the 'name' column in the connection manager as DT_WSTR (Length 4000). The code page for the flat file connection is 65001 (UTF-8).

The destination database I am writing to is a SQL Azure Database with collation SQL_Latin1_General_CP1_CI_AS. The destination database column is defined as nvarchar(max). If I try to write to a regular SQL Server database with same collation, the result is the same.

What am I doing wrong here? There are lots of emoji type characters that I collect and don't care much about, what is important is accented and non-English characters. If I need to provide any more detail or anything else, please let me know.

Upvotes: 0

Views: 244

Answers (1)

Peter Pan
Peter Pan

Reputation: 24138

According to your description, I reviewed the document about "International Considerations for Integration Services" (https://msdn.microsoft.com/en-us/library/ms141186(v=sql.90).aspx) and made a sample code for Python Encoding/Decoding.

I think the code data['name'] not need to be encoded by UTF-8, because of reading the data from a file in binary mode by default is with the charset ISO-8859-1(Latin-1) in Python.

As references, here is my experiment code for encoding/decoding data from a txt file below with UTF-8 and ISO-8859-1 in package codecs and the method encode('<charset>').

Data in a txt file called input.txt

Катерина Александровна

import codecs

iso8859_1c = codecs.lookup('iso-8859-1')
utf8c = codecs.lookup('utf-8')

fpw = open('output.txt', 'wb')

fpr = open('input.txt', 'rb')
origin = fpr.readline()
print "Origin>", origin
print (bytearray(origin), )
print
fpw.write('Origin>')
fpw.write(origin)
fpw.write('\n')

fpr.close()

dec_iso8859_1, n = iso8859_1c.decode(origin)
print 'ISO-8859-1 decoding>', dec_iso8859_1, n
print (dec_iso8859_1, n)
print 'UTF-8 encoding ISO-8859-1>', dec_iso8859_1.encode('utf-8')
print 'ISO-8859-1 encoding ISO-8859-1>', dec_iso8859_1.encode('iso-8859-1')
print
fpw.write('UTF-8 encoding ISO-8859-1: dec_iso8859_1.encode("utf-8")>')
fpw.write(dec_iso8859_1.encode('utf-8'))
fpw.write('\n')

dec_utf8, n = utf8c.decode(origin)
print 'UTF-8 decoding>', dec_utf8, n
print (dec_utf8, n)
print 'UTF-8 encoding UTF-8>', dec_utf8.encode('utf-8')
# Error for code below: 'latin-1' codec can't encode characters in position 0-7: ordinal not in range(256)
# print 'ISO-8859-1 encoding UTF-8>', dec_utf8.encode('iso-8859-1')
print
fpw.write('UTF-8 encoding UTF-8: dec_utf8.encode("utf-8")>')
fpw.write(dec_utf8.encode('utf-8'))
fpw.write('\n')

enc_utf8_iso8859_1, n = utf8c.encode(dec_iso8859_1)
print 'UTF-8 encoding ISO-8859-1>', enc_utf8_iso8859_1, n
print (enc_utf8_iso8859_1, n)
print
fpw.write('UTF-8 encoding ISO-8859-1>')
fpw.write(enc_utf8_iso8859_1)
fpw.write('\n')

enc_iso8859_1_iso8859_1, n = iso8859_1c.encode(dec_iso8859_1)
print 'ISO-8859-1 encoding ISO-8859-1>', enc_iso8859_1_iso8859_1, n
print (enc_iso8859_1_iso8859_1, n)
fpw.write('ISO-8859-1 encoding ISO-8859-1>')
fpw.write(enc_iso8859_1_iso8859_1)
fpw.write('\n')

fpw.flush()
fpw.close()

The output in console:

Origin> Катерина Александровна
(bytearray(b'\xd0\x9a\xd0\xb0\xd1\x82\xd0\xb5\xd1\x80\xd0\xb8\xd0\xbd\xd0\xb0 \xd0\x90\xd0\xbb\xd0\xb5\xd0\xba\xd1\x81\xd0\xb0\xd0\xbd\xd0\xb4\xd1\x80\xd0\xbe\xd0\xb2\xd0\xbd\xd0\xb0'),)

ISO-8859-1 decoding> ÐаÑеÑина ÐлекÑандÑовна 43
(u'\xd0\x9a\xd0\xb0\xd1\x82\xd0\xb5\xd1\x80\xd0\xb8\xd0\xbd\xd0\xb0 \xd0\x90\xd0\xbb\xd0\xb5\xd0\xba\xd1\x81\xd0\xb0\xd0\xbd\xd0\xb4\xd1\x80\xd0\xbe\xd0\xb2\xd0\xbd\xd0\xb0', 43)
UTF-8 encoding ISO-8859-1> ÐаÑеÑина ÐлекÑандÑовна
ISO-8859-1 encoding ISO-8859-1> Катерина Александровна

UTF-8 decoding> Катерина Александровна 43
(u'\u041a\u0430\u0442\u0435\u0440\u0438\u043d\u0430 \u0410\u043b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u043e\u0432\u043d\u0430', 43)
UTF-8 encoding UTF-8> Катерина Александровна

UTF-8 encoding ISO-8859-1> ÐаÑеÑина ÐлекÑандÑовна 43
('\xc3\x90\xc2\x9a\xc3\x90\xc2\xb0\xc3\x91\xc2\x82\xc3\x90\xc2\xb5\xc3\x91\xc2\x80\xc3\x90\xc2\xb8\xc3\x90\xc2\xbd\xc3\x90\xc2\xb0 \xc3\x90\xc2\x90\xc3\x90\xc2\xbb\xc3\x90\xc2\xb5\xc3\x90\xc2\xba\xc3\x91\xc2\x81\xc3\x90\xc2\xb0\xc3\x90\xc2\xbd\xc3\x90\xc2\xb4\xc3\x91\xc2\x80\xc3\x90\xc2\xbe\xc3\x90\xc2\xb2\xc3\x90\xc2\xbd\xc3\x90\xc2\xb0', 43)

ISO-8859-1 encoding ISO-8859-1> Катерина Александровна 43
('\xd0\x9a\xd0\xb0\xd1\x82\xd0\xb5\xd1\x80\xd0\xb8\xd0\xbd\xd0\xb0 \xd0\x90\xd0\xbb\xd0\xb5\xd0\xba\xd1\x81\xd0\xb0\xd0\xbd\xd0\xb4\xd1\x80\xd0\xbe\xd0\xb2\xd0\xbd\xd0\xb0', 43)

And the output in the file called output.txt:

Origin>Катерина Александровна
UTF-8 encoding ISO-8859-1: dec_iso8859_1.encode("utf-8")>ÐаÑеÑина ÐлекÑандÑовна
UTF-8 encoding UTF-8: dec_utf8.encode("utf-8")>Катерина Александровна
UTF-8 encoding ISO-8859-1>ÐаÑеÑина ÐлекÑандÑовна
ISO-8859-1 encoding ISO-8859-1>Катерина Александровна

Upvotes: 0

Related Questions