Michał Niklas
Michał Niklas

Reputation: 54342

Fetching wrongly encoded data via native JDBC Informix driver

I have Informix database configured with:

DB_LOCALE=pl_pl.CP1250

(Polish locale with Windows CP1250 character encoding).

In this database there is a table with VARCHAR column in which most of the data is encoded in CP1250 but some records are encoded in UTF-8. I suspect they were inserted by ODBC and wrongly encoded .csv import.

When I use ODBC this wrongly encoded data can be fetched. It does not look pretty:

nazw:┼?UKASIK

but can be displayed and end-user can edit such data. Those "strange" chars are UTF-8 characters of 'Ł' letter.

When I use native JDBC driver I cannot fetch such data. Instead of String I got exception:

JDBC Error: -79783
IX000
Kodowanie lub zestaw kodów znaków nie są obsługiwane.

Explanation in English:

-79783 Encoding or code set not supported
Explanation: The encoding or code set entered in the DB_LOCALE or CLIENT_LOCALE variable is not valid.

I created test program in Jython that connect to database using native JDBC driver and JDBC-ODBC bridge. I got exception only with native driver. I also tried to get this data via other JDBC getXXX() methods to get byte[] or Stream but they also raised exceptions. I use JDBC URL as:

jdbc:informix-sqli://test-informix:9088/test:informixserver=ol_testifx;DB_LOCALE=pl_PL.CP1250;CLIENT_LOCALE=pl_PL.CP1250;charSet=CP1250

Server version: IBM Informix Dynamic Server Version 11.50.FC4

Native driver: 3.70.JC5DE; major: 3; minor: 70

ODBC driver used by JDBC-ODBC bridge: 2.0001 (3.70.TC5DE); major: 2; minor: 1

My question is:

Is there any way of getting such wrongly encoded data? I would like to see '?' characters instead of wrongly encoded characters. I don't want exceptions because they do not allow end-users to see and correct wrongly encoded data.

Upvotes: 1

Views: 5911

Answers (2)

Michał Niklas
Michał Niklas

Reputation: 54342

I got help from IBM Polska and they have found that JDBC connect string can be extended by IFX_USE_STRENC=true: http://www-01.ibm.com/support/docview.wss?uid=swg21502902

This allowed JDBC to fetch wrongly encoded data. Now I can read:

nazw:Przemysław
nazw:Ĺ?UKASIK

(2nd record is with Polish letter Ł in wrong encoding)

Thank you IBM Polska!

Upvotes: 2

Jonathan Leffler
Jonathan Leffler

Reputation: 755006

If I understand CP1250 correctly, then the Ł is at CP1250 code point 0xA3, but corresponds to Unicode U+0142 LATIN CAPITAL LETTER L WITH STROKE. The UTF-8 encoding of U+0142 is 0xC5 0x81. And 0xC5 in CP1250 is Ĺ, Unicode U+0139 LATIN CAPITAL LETTER L WITH ACUTE, but 0x81 is unassigned.

At an educated guess, it is the 0x81 byte that is causing Informix to generate an error. The 0x81 byte cannot appear in well-formed CP1250 text (in the same way that neither the 0xC0 byte nor the 0xC1 byte, to name but two, can appear in well-formed UTF-8 text). It is not clear how the malformed data was inserted into the database, but presumably someone had the locale information misset when they did so.

(To add to the fun, the symbol I see in the question as appears to be UTF-8 bytes 0xE2 0x94 0xBC = U+253C BOX DRAWINGS LIGHT VERTICAL AND HORIZONTAL, and it is not clear to me how that is related to Ł at all.)

What can you do to work around this problem? Good question!

Confession: I can spell JDBC — E, then S, then Q, then L, then /, then C? No...Oh!

There is an ONCONFIG parameter, EILSEQ_COMPAT_MODE, which could be set to 1 (and the server restarted) that allows otherwise illegal characters into the database. This might help, or then again, it might not because it is already set and that's how the bogus data got into the database in the first place.

There might be some other tricks that will help if that doesn't (leave a comment below), but this is edge-case stuff; you shouldn't have been able to get the bogus data into the database and it is hard to get it back out again.

Upvotes: 0

Related Questions