Reputation: 9508
I am doing a BULK INSERT into sqlserver and it is not inserting UTF-8 characters into database properly. The data file contains these characters, but the database rows contain garbage characters after bulk insert execution.
My first suspect was the last line of the format file:
10.0
3
1 SQLCHAR 0 0 "{|}" 1 INSTANCEID ""
2 SQLCHAR 0 0 "{|}" 2 PROPERTYID ""
3 SQLCHAR 0 0 "[|]" 3 CONTENTTEXT "SQL_Latin1_General_CP1_CI_AS"
But, after reading this official page it seems to me that this is actually a bug in reading the data file by the insert operation in SQL Server version 2008. We are using version 2008 R2.
What is the solution to this problem or at least a workaround?
Upvotes: 27
Views: 145769
Reputation: 193
BULK INSERT #tmpData FROM 'C:\Book2.csv' WITH ( FIRSTROW = 2, FIELDTERMINATOR = ';', --CSV field delimiter ROWTERMINATOR = '\n', --Use to shift the control to next row TABLOCK );
Done.
Upvotes: 11
Reputation: 113
Only for to share, I had a similar problem, I had portugues accents in a file and bcp imported garbage chars.(e.g. À became ┴ ) I tried -C with almost all codepages without success. After hours I found a hint on the bcp MS help page.
Format File codepages are having priority over the -C attribute
Means that in the format file I had to use "" like in LastName, once I changed the codepage, the attribute -C 65001 imported the UTF8 file without any problem
13.0
4
1 SQLCHAR 0 7 "," 1 PersonID ""
2 SQLCHAR 0 25 "," 2 FirstName SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 0 30 "," 3 LastName ""
4 SQLCHAR 0 11 "\r\n" 4 BirthDate ""
Upvotes: 1
Reputation: 731
I came here before looking for a solution for bulk inserting special characters. Didn't like the workaround with UTF-16 (that would double the size of csv file). I found out that you definitely CAN and it's very easy, you don't need a format file. This answer is for other people who are looking for the same, since it doesn't seem to be documented well anywhere, and I believe this is a very common issue for non-english speaking people. The solution is: just add CODEPAGE='65001' inside the with statement of the bulk insert. (65001=codepage number for UTF-8). Might not work for all unicode characters as suggested by Michael O, but at least it works perfect for latin-extended, greek and cyrillic, probably many others too.
Note: MSDN documentation says utf-8 is not supported, don't believe it, for me this works perfect in SQL server 2008, didn't try other versions however.
e.g.:
BULK INSERT #myTempTable
FROM 'D:\somefolder\myCSV.txt'+
WITH
(
CODEPAGE = '65001',
FIELDTERMINATOR = '|',
ROWTERMINATOR ='\n'
);
If all your special characters are in 160-255 (iso-8859-1 or windows-1252), you could also use:
BULK INSERT #myTempTable
FROM 'D:\somefolder\myCSV.txt'+
WITH
(
CODEPAGE = 'ACP',
FIELDTERMINATOR = '|',
ROWTERMINATOR ='\n'
);
Upvotes: 73
Reputation: 3851
Thought I would add my thoughts to this. We were trying to load data into SqlServer using bcp and had lots of trouble.
bcp does not, in most versions, support any type of UTF-8 files. We discovered that UTF-16 would work, but it is more complex than is shown in these posts.
Using Java we wrote the file using this code:
PrintStream fileStream = new PrintStream(NEW_TABLE_DATA_FOLDER + fileName, "x-UTF-16LE-BOM");
This gave us the correct data to insert.
We tried using just UTF16 and kept getting EOF errors. This is because we were missing the BOM part of the file. From Wikipedia:
UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.
If these bytes are not present, the file won't work. So we have the file, but there is one more secret that needs to be addressed. When constructing your command line you must include -w to tell bcp what type of data it is. When using just English data, you can use -c (character). So that will look something like this:
bcp dbo.blah in C:\Users\blah\Desktop\events\blah.txt -S tcp:databaseurl,someport -d thedatabase -U username -P password -w
When this is all done you get some sweet looking data!
Upvotes: 1
Reputation: 10602
Note that as of Microsoft SQL Server 2016, UTF-8 is supported by bcp
, BULK_INSERT
(as was part of the original question), and OPENROWSET
.
Upvotes: 2
Reputation: 101
Microsoft just added UTF-8 support to SQL Server 2014 SP2:
https://support.microsoft.com/en-us/kb/3136780
Upvotes: 7
Reputation: 1
I have tested the bulk insertion with UTF -8
Format. It works fine in Sql Server 2012.
string bulkInsertQuery = @"DECLARE @BulkInsertQuery NVARCHAR(max) = 'bulk insert [dbo].[temp_Lz_Post_Obj_Lvl_0]
FROM ''C:\\Users\\suryan\\Desktop\\SIFT JOB\\New folder\\POSTdata_OBJ5.dat''
WITH ( FIELDTERMINATOR = '''+ CHAR(28) + ''', ROWTERMINATOR = ''' +CHAR(10) + ''')'
EXEC SP_EXECUTESQL @BulkInsertQuery";
I was using *.DAT
file with FS as column separator.
Upvotes: 0
Reputation: 1008
My exported data are in TSV format from DB which has Latin-1 encoding.
This easy to check:
SELECT DATABASEPROPERTYEX('DB', 'Collation') SQLCollation;
Extract file is in UTF-8 format.
BULK INSERT isn't working with UTF-8, so I convert UTF-8 to ISO-8859-1 (aka Latin-1) with simple Clojure script:
(spit ".\\dump\\file1.txt"
(slurp ".\\dump\\file1_utf8.txt" :encoding "UTF-8")
:encoding "ISO-8859-1")
To execute - correct paths and
java.exe -cp clojure-1.6.0.jar clojure.main utf8_to_Latin1.clj
Upvotes: 0
Reputation: 35
I managed to do this using SSIS and a ADO NET destination instead of OLEDB.
Upvotes: 0
Reputation: 18405
You can't. You should first use a N type data field, convert your file to UTF-16 and then import it. The database does not support UTF-8.
Upvotes: 33
Reputation: 6277
You can re-encode the data file with UTF-16. That's what I did anyway.
Upvotes: 4
Reputation: 147264
Shouldn't you be using SQLNCHAR
instead of SQLCHAR
for the unicode data?
Upvotes: 1