Goran Jovic
Goran Jovic

Reputation: 9508

How to write UTF-8 characters using bulk insert in SQL Server?

I am doing a BULK INSERT into sqlserver and it is not inserting UTF-8 characters into database properly. The data file contains these characters, but the database rows contain garbage characters after bulk insert execution.

My first suspect was the last line of the format file:

10.0
3
1 SQLCHAR  0  0  "{|}"  1 INSTANCEID ""
2 SQLCHAR  0  0  "{|}"  2 PROPERTYID ""
3 SQLCHAR  0  0  "[|]"  3 CONTENTTEXT "SQL_Latin1_General_CP1_CI_AS"

But, after reading this official page it seems to me that this is actually a bug in reading the data file by the insert operation in SQL Server version 2008. We are using version 2008 R2.

What is the solution to this problem or at least a workaround?

Upvotes: 27

Views: 145769

Answers (13)

Janis Rudovskis
Janis Rudovskis

Reputation: 193

  1. In excel save file as CSV(Comma delimited)
  2. Open saved CSV file in notepad++
  3. Encoding -> Convert tO UCS-2 Big Endian
  4. Save
BULK INSERT #tmpData
FROM 'C:\Book2.csv'
WITH
(
    FIRSTROW = 2,
    FIELDTERMINATOR = ';',  --CSV field delimiter
    ROWTERMINATOR = '\n',   --Use to shift the control to next row
    TABLOCK
);

Done.

Upvotes: 11

Martin Lietz
Martin Lietz

Reputation: 113

Only for to share, I had a similar problem, I had portugues accents in a file and bcp imported garbage chars.(e.g. À became ┴ ) I tried -C with almost all codepages without success. After hours I found a hint on the bcp MS help page.

Format File codepages are having priority over the -C attribute

Means that in the format file I had to use "" like in LastName, once I changed the codepage, the attribute -C 65001 imported the UTF8 file without any problem

13.0
4
1       SQLCHAR             0       7       ","      1     PersonID               ""
2       SQLCHAR             0       25      ","      2     FirstName              SQL_Latin1_General_CP1_CI_AS
3       SQLCHAR             0       30      ","      3     LastName               ""
4       SQLCHAR             0       11      "\r\n"   4     BirthDate              ""

Upvotes: 1

Tom-K
Tom-K

Reputation: 731

I came here before looking for a solution for bulk inserting special characters. Didn't like the workaround with UTF-16 (that would double the size of csv file). I found out that you definitely CAN and it's very easy, you don't need a format file. This answer is for other people who are looking for the same, since it doesn't seem to be documented well anywhere, and I believe this is a very common issue for non-english speaking people. The solution is: just add CODEPAGE='65001' inside the with statement of the bulk insert. (65001=codepage number for UTF-8). Might not work for all unicode characters as suggested by Michael O, but at least it works perfect for latin-extended, greek and cyrillic, probably many others too.

Note: MSDN documentation says utf-8 is not supported, don't believe it, for me this works perfect in SQL server 2008, didn't try other versions however.

e.g.:

BULK INSERT #myTempTable 
FROM  'D:\somefolder\myCSV.txt'+
WITH 
    ( 
        CODEPAGE = '65001',
        FIELDTERMINATOR = '|',
        ROWTERMINATOR ='\n'
    );

If all your special characters are in 160-255 (iso-8859-1 or windows-1252), you could also use:

BULK INSERT #myTempTable 
FROM  'D:\somefolder\myCSV.txt'+
WITH 
    ( 
        CODEPAGE = 'ACP',
        FIELDTERMINATOR = '|',
        ROWTERMINATOR ='\n'
    );

Upvotes: 73

markthegrea
markthegrea

Reputation: 3851

Thought I would add my thoughts to this. We were trying to load data into SqlServer using bcp and had lots of trouble.

bcp does not, in most versions, support any type of UTF-8 files. We discovered that UTF-16 would work, but it is more complex than is shown in these posts.

Using Java we wrote the file using this code:

PrintStream fileStream = new PrintStream(NEW_TABLE_DATA_FOLDER + fileName, "x-UTF-16LE-BOM");

This gave us the correct data to insert.

utf-16 little-endian

We tried using just UTF16 and kept getting EOF errors. This is because we were missing the BOM part of the file. From Wikipedia:

UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.

If these bytes are not present, the file won't work. So we have the file, but there is one more secret that needs to be addressed. When constructing your command line you must include -w to tell bcp what type of data it is. When using just English data, you can use -c (character). So that will look something like this:

bcp dbo.blah in C:\Users\blah\Desktop\events\blah.txt -S tcp:databaseurl,someport -d thedatabase -U username -P password -w

When this is all done you get some sweet looking data!

Good little endian!

Upvotes: 1

Charles Burns
Charles Burns

Reputation: 10602

Note that as of Microsoft SQL Server 2016, UTF-8 is supported by bcp, BULK_INSERT (as was part of the original question), and OPENROWSET.

Upvotes: 2

Jon Morisi
Jon Morisi

Reputation: 101

Microsoft just added UTF-8 support to SQL Server 2014 SP2:

https://support.microsoft.com/en-us/kb/3136780

Upvotes: 7

I have tested the bulk insertion with UTF -8 Format. It works fine in Sql Server 2012.

string bulkInsertQuery = @"DECLARE @BulkInsertQuery NVARCHAR(max) = 'bulk insert [dbo].[temp_Lz_Post_Obj_Lvl_0]
                                      FROM ''C:\\Users\\suryan\\Desktop\\SIFT JOB\\New folder\\POSTdata_OBJ5.dat''
                                      WITH ( FIELDTERMINATOR =  '''+ CHAR(28) + ''', ROWTERMINATOR = ''' +CHAR(10) + ''')'
                                      EXEC SP_EXECUTESQL @BulkInsertQuery";

I was using *.DAT file with FS as column separator.

Upvotes: 0

Heber
Heber

Reputation: 21

Use these options - DATAFILETYPE='char' and CODEPAGE = '1252'

Upvotes: 2

beloblotskiy
beloblotskiy

Reputation: 1008

My exported data are in TSV format from DB which has Latin-1 encoding.

This easy to check: SELECT DATABASEPROPERTYEX('DB', 'Collation') SQLCollation;

Extract file is in UTF-8 format.

BULK INSERT isn't working with UTF-8, so I convert UTF-8 to ISO-8859-1 (aka Latin-1) with simple Clojure script:

(spit ".\\dump\\file1.txt" (slurp ".\\dump\\file1_utf8.txt" :encoding "UTF-8") :encoding "ISO-8859-1")

To execute - correct paths and java.exe -cp clojure-1.6.0.jar clojure.main utf8_to_Latin1.clj

Upvotes: 0

JYatesDBA
JYatesDBA

Reputation: 35

I managed to do this using SSIS and a ADO NET destination instead of OLEDB.

Upvotes: 0

Michael-O
Michael-O

Reputation: 18405

You can't. You should first use a N type data field, convert your file to UTF-16 and then import it. The database does not support UTF-8.

Upvotes: 33

sayap
sayap

Reputation: 6277

You can re-encode the data file with UTF-16. That's what I did anyway.

Upvotes: 4

AdaTheDev
AdaTheDev

Reputation: 147264

Shouldn't you be using SQLNCHAR instead of SQLCHAR for the unicode data?

Upvotes: 1

Related Questions