Dev_Karl
Dev_Karl

Reputation: 53

Encoding problems exporting file

I'm trying to find out what has happen in an integration project. We just can't get the encoding right at the end.

A Lithuanian file was imported to the as400. There, text is stored in the encoding EBCDIC. Exporting the data to ANSI file and then read as windows-1257. ASCII-characters works fine and some Lithuanian does, but the rest looks like crap with chars like ~, ¶ and ].

Example string going thou the pipe

Start file
Tuskulënö

as400
Tuskulënö
EAA9A9596
34224335A

exported file (after conversion to windows-1257)
Tuskulėnö

expected result for exported file
Tuskulėnų

Any ideas?

Regards, Karl

Upvotes: 1

Views: 352

Answers (1)

Joachim Sauer
Joachim Sauer

Reputation: 308061

EBCDIC isn't a single encoding, it's a family of encodings (in this case called codepages), similar to how ISO-8859-* is a family of encodings: the encodings within the families share about half the codes for "basic" letters (roughly what is present in ASCII) and differ on the other half.

So if you say that it's stored in EBCDIC, you need to tell us which codepage is used.

A similar problem exists with ANSI: when used for an encoding it refers to a Windows default encoding. Unfortunately the default encoding of a Windows installation can vary based on the locale configured.

So again: you need to find out which actual encoding is used here (these are usually from the Windows-* family, the "normal" English one s Windows-1252).

Once you actually know what encoding you have and want at each point, you can go towards the second step: fixing it.

My personal preference for this kind of problems is this: Have only one step where encodings are converted: take whatever the initial tool produces and convert it to UTF-8 in the first step. From then on, always use UTF-8 to handle that data. If necessary convert UTF-8 to some other encoding in the last step (but avoid this if possible).

Upvotes: 5

Related Questions