Reputation: 53
I'm trying to find out what has happen in an integration project. We just can't get the encoding right at the end.
A Lithuanian file was imported to the as400. There, text is stored in the encoding EBCDIC. Exporting the data to ANSI file and then read as windows-1257. ASCII-characters works fine and some Lithuanian does, but the rest looks like crap with chars like ~, ¶ and ].
Example string going thou the pipe
Start file
Tuskulënö
as400
Tuskulënö
EAA9A9596
34224335A
exported file (after conversion to windows-1257)
Tuskulėnö
expected result for exported file
Tuskulėnų
Any ideas?
Regards, Karl
Upvotes: 1
Views: 352
Reputation: 308061
EBCDIC isn't a single encoding, it's a family of encodings (in this case called codepages), similar to how ISO-8859-* is a family of encodings: the encodings within the families share about half the codes for "basic" letters (roughly what is present in ASCII) and differ on the other half.
So if you say that it's stored in EBCDIC, you need to tell us which codepage is used.
A similar problem exists with ANSI: when used for an encoding it refers to a Windows default encoding. Unfortunately the default encoding of a Windows installation can vary based on the locale configured.
So again: you need to find out which actual encoding is used here (these are usually from the Windows-* family, the "normal" English one s Windows-1252).
Once you actually know what encoding you have and want at each point, you can go towards the second step: fixing it.
My personal preference for this kind of problems is this: Have only one step where encodings are converted: take whatever the initial tool produces and convert it to UTF-8 in the first step. From then on, always use UTF-8 to handle that data. If necessary convert UTF-8 to some other encoding in the last step (but avoid this if possible).
Upvotes: 5