softwarenewbie7332
softwarenewbie7332

Reputation: 56

How do I fix the Code Page in SSIS Lookup Transformation to be 65001?

I have an SQL server 2019, DB and tables ALL set to Latin1_General_100_CI_AS_SC_UTF8, relevant table has code and desc columns both varchar.

In SSIS project, single Data Flow component:

I have a UTF-8 CSV file read with flat file connection, text column code to match is DT_STR, 65001

I have a Lookup that is set to "Full Cache" and loads the Latin1_General_100_CI_AS_SC_UTF8 table, but SSIS thinks the varchar columns are DT_STR, 1252

Finally the code in both CSV and lookup are matched and desc is sent to destination table which is on the same Latin1_General_100_CI_AS_SC_UTF8 collation. The destination component is set to AlwaysUseDefaultCodePage True and DefaultCodePage 65001.

I now get an error saying the column has more than one code page and cannot run the package.

If not for the mislabeled 1252, this package should run. I believe its something to do with ExternalMetadataXml, which is read-only and says all my lookup varchar columns are CodePage="1252".

If I manually edit the package .dtsx with npp and replace all instances of 1252 with 65001, the package can run and seems to do what I expected, as long as I never touch the lookup component again.. That seems a bit messed up of a solution tho, I am hoping there's someone else who has a cleaner way to fix this. Thanks.

Upvotes: 2

Views: 5942

Answers (2)

billinkc
billinkc

Reputation: 61231

With the disclaimer that I'm a "dumb American" who doesn't deal with non-English data but did work with a friend recently on using bulk import with UTF-8 data, here's what I see.

I have a pipe separated value file that looks like this

level|name
7|"Ovasino Poste de Santé"

Notepad++ indicates I have saved it as UTF-8.

I created two flat file connection managers in SSIS: Codepage65001STR and Codepage65001WSTR. They both use a Code Page of 65001 (UTF-8)

enter image description here

In the advanced tab for the STR variant, I left the data type as DT_STR enter image description here

In the advanced tab for WSTR variant, I changed the data type to DT_WSTR

enter image description here

I also created a table and loaded it with the same data

DROP TABLE IF EXISTS dbo.dba_286478;
CREATE TABLE dbo.dba_286478
(
level int NOT NULL
,   name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8
)
INSERT INTO dbo.dba_286478
(
    level
,   name
)
VALUES
(
    7 -- level - int
,   'Ovasino Poste de Santé' -- name - varchar(75)
);

DROP TABLE IF EXISTS dbo.dba_286478;
CREATE TABLE dbo.dba_286478
(
    level int NOT NULL
,   name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8
);

I then created a data flow task with a Flat File Source using the different Flat File Connection Managers and added data viewers between them and an empty derived column (so I had an anchor point for the data viewer).

I did the same thing with an OLE DB Source pointing at my table as well as a custom query of

SELECT
    T.level
,   CAST(T.name AS varchar(75)) AS name
FROM 
    dbo.dba_286478 AS T;

as well as explicitly defining the collation as it makes no different in SSIS

,   CAST(T.name COLLATE Latin1_General_100_CI_AS_SC_UTF8 AS varchar(75)) AS name

The results all show the same, the final word is an accented Sante. If the UTF-8 hadn't happened, it'd show as Santé

enter image description here

At this point, it doesn't matter whether we DT_STR or DT_WSTR in our flat file source column definition, the component understands UTF-8 and UTF-16.

Properties, metadata of each. The Codepage 65001 STR looks as expected. code page of 65001 and data type DT_STR

enter image description here

Unicode, DT_WSTR looks good

enter image description here

The OLE components however, they're a different animal. The component is returning a metadata of DT_WSTR (full Unicode/UTF-16) regardless of whether we do an explicit cast to DT_STR, optionally specifying collation, or let the natural metadata flow through.

Either way, it doesn't detect the code page/collation stuff and just says Nope, you're Unicode

enter image description here

So, when we get to trying to use a Lookup task with an OLE DB connection manager, we can expect and receive the same inability to delineate between UTF-8 string/varchar and UTF-16/nvarchar

enter image description here

The error would indicate and that's true, DT_STR can't match DT_WSTR

Cannot map the input column, 'name', to the lookup column, 'name', because the data types do not match.

So what do I do?

You must have type alignment to make the lookup component work which means the source data needs to be of type DT_WSTR. You can either bring the data in from the Flat File as Unicode or leave it as string with code page 65001. If you go the latter route, then you need to make a copy, Derived Column or Data Conversion work, of that column and use it in the Lookup component.

If you're pulling text out of the lookup component, that's now in your pipeline as Unicode so you probably want to then convert that to a string type with code page. Again, Derived Column or Data Conversion will be used.

enter image description here

SSIS OLE components don't understand UTF8

We saw with the source and lookup component that SSIS is going to treat the UTF-8 strings as UTF-16 but I assumed it would handle storing to the table just fine. Not so much.

My server collation is Latin1_General_100_CI_AI_SC_UTF8 and while I switched accent sensitivities between server and table definition of dbo.dba_286478, it doesn't matter in this case as it's UTF-8 all the way down.

For my Flat File Source, I use the STR based file which has the metadata shown above with the yellow highlighting. The Codepage 65001 for data type DT_STR is what we want.

I added an OLE DB Destination and pointed it at my table which again has the "name" column defined as UTF-8

name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8

Check this error!

enter image description here

Validation error. Data Flow Task OLE DB Destination [138]: The column "name" cannot be processed because more than one code page (65001 and 1252) are specified for it.

We only have code page 65001 at play in this data flow and yet, something in SSIS space is inferring/defaulting to a 1252 code page during validation.

Making it work

The componentry in a data flow task was built with OLE DB connections in mind. That's why the Lookup task supported OLE DB Connections for 2005, 2008 and maybe 2008R2? Long time ago now, I know but the Cache Connection Manager (aka anything else) option was added in later iterations because of the need to use something besides OLE connection managers especially given the push then was to deprecate the OLE driver.

An ADO.NET Connection Manager does slightly better than OLE in this case and that's likely what you're going to have to use to work with UTF8 data in an SSIS package. It will be implicitly converting to UTF-16 when it presents to the table and then SQL Server is going to snap it back into UTF-8 space (best I can tell).

For reference, bringing UTF-8 data into the pipeline with ADO Source will still be flagged as DT_WSTR/UTF-16/unicode.

enter image description here

But you can land DT_STR code page 65001 to an ADO.NET Destination without a code page mismatch error like I'm seeing for OLE DB Destination.

enter image description here

The data from the database is going to appear as DT_WSTR regardless of how you bring it into the pipeline. That means you can define and OLE and an ADO connection manager to use the Lookup component as is.

Or you can add a precursor data flow step to populate the Cache Connection Manager and only have an ADO.NET connection manager. Were you to go that route, convert the DT_WSTR data to DT_STR with codepage 65001 and store that data into the cache.

DFT - Populate Cache -> DFT - Load data

DFT - Populate Cache

ADO.NET Source -> Data Conversion -> Cache Connection Manager

DFT - Load Data

Flat File Source -> Lookup Component -> ADO.NET Destination
    

Cross answered from https://dba.stackexchange.com/questions/286478/how-do-i-fix-the-code-page-in-ssis-lookup-transformation-to-be-65001/286520#286520

Upvotes: 1

Thom A
Thom A

Reputation: 95588

It sounds like you haven't changed the Code page on your Flat File Connection Manager. Open your Connection Manager, and on there there is a drop down menu for Code Page, select 65001 for UTF-8 there.

You'll then likely need to change your Data Flow task, as the nodes (prior to any Derived Column Transformations you have to convert to code page) will likely be treating the data as 1252 and you'll get an error, as SSIS doesn't allow for implicit conversions.

Upvotes: 0

Related Questions