Harinie R
Harinie R

Reputation: 307

Decode ebcdic to ascii/readable text in python

I have a IBM mainframe file encoded in 'cp500' (I was informed) which is to be decoded to ascii or readable text. The file is taken from unix server transferred to windows using IPSwitch tool.

I already tried the below codes and couldn't achieve what I desire:

sample data = 'ðñðòðõÅäù@@@@@@@ððð :BÄÑðò÷øò@@@JaÈK' - in txt file

import codecs

with open(file, "rb") as ebcdic:
    ascii_txt = codecs.decode(ebcdic, "cp500")
    print(ascii_txt)

This is producing type error

"TypeError: decoding with 'cp500' codec failed (TypeError: a bytes-
like object is required, not '_io.BufferedReader')"

Then I tried these two,

with open(file, 'r', encoding='cp500') as f:
    for line in f:
        print(line)

with codecs.open(file, 'r', encoding='cp500')
    for line in f:
        print(line)

I also tried International encoding "cp1140" format as well -

with open(file, 'r', encoding="cp1140") as f:
    for line in f:
       print(line)

I expect a readable output - a copybook layout - something like this...

0001***********
0002...........
0003...........

But All the above three are printing output as :

C¢C£C¢C¥C¢C§CeCuC¾       C¢C¢C¢âCdCjC¢C¥C¼C½C¥   [/Ch.

And I also tried reading the file in "rb" mode:

with open(file, 'rb') as f:
    for line in f:
        print(line)

And this is producing below output -

b'\xc3\xb0\xc3\xb1\xc3\xb0\xc3\xb2\xc3\xb0\xc3\xb5\xc3\x85\xc3\xa4\xc3\xb9@@@@@@@\xc3\xb0\xc3\xb0\xc3\xb0 :B\xc3\x84\xc3\x91\xc3\xb0\xc3\xb2\xc3\xb7\xc3\xb8\xc3\xb2@@@Ja\xc3\x88K'

This is the first time I'm dealing with ebcdic/mainframe files - Any help in decoding this would be appreciated!

Thanks in Advance :)

Upvotes: 3

Views: 14020

Answers (1)

lenz
lenz

Reputation: 5817

I suspect the EBCDIC data were decoded with Latin-1 and saved with UTF-8 in the TXT file you are using right now.

Let's try to reconstruct with an abbreviated version of your example:

>>> copybook = '0102 [/H.'

This is what was originally produced. This text was encoded with EBCDIC:

>>> '0102 [/H.'.encode('cp500')
b'\xf0\xf1\xf0\xf2@Ja\xc8K'

So that's the sequence of bytes that was written in the original mainframe file. You could also write it like this in a general (non-Python) representation:

F0 F1 F0 F2 40 4A 61 C8 4B

Now these bytes were decoded with Latin-1, or maybe CP-1252 ("Windows Latin-1"). That's what might happen if you do this on a Windows machine:

>>> with open(file) as f:
...     text = f.read()
>>> text
'ðñðò@JaÈK'

You can simulate this mis-encoding like this:

>>> '0102 [/H.'.encode('cp500').decode('latin1')
'ðñðò@JaÈK'

That's the string you show in the beginning of your post. It's already worse than the mere problem of having to deal with mainframe files – it's a mojibake of a mainframe file!

Now, to make things even worse, this string was saved to a file using UTF-8. Let's try that too:

>>> '0102 [/H.'.encode('cp500').decode('latin1').encode('utf8')
b'\xc3\xb0\xc3\xb1\xc3\xb0\xc3\xb2@Ja\xc3\x88K'

These are the bytes that are contained in your TXT file, according to the last snippet (where you open with 'rb' mode and print the output).

Now these bytes aren't valid EBCDIC anymore. The encoding round-trip with Latin-1 and UTF-8 distorted the contents:

>>> '0102 [/H.'.encode('cp500').decode('latin1').encode('utf8').decode('cp500')
'C¢C£C¢C¥ [/Ch.'

That's the output you got in the first attempt shown in the question.

In order to recover from the situation, you need to undo the distortion:

>>> distorted = '0102 [/H.'.encode('cp500').decode('latin1').encode('utf8')
>>> distorted
b'\xc3\xb0\xc3\xb1\xc3\xb0\xc3\xb2@Ja\xc3\x88K'
>>> recovered = distorted.decode('utf8').encode('latin1').decode('cp500')
>>> recovered
'0102 [/H.'

... or when reading from file, you can let open do the first decode step for you:

>>> with open(file, encoding='utf8') as f:
...     data = f.read()
...     text = data.encode('latin1').decode('cp500')

For the full example line, this yields the following text:

'010205EU9       000\x80\x9aâDJ02782   [/H.'

I'm not 100% sure this is the original text. It contains some control characters (80, 9A) and a non-ASCII letter ("â"). Maybe the 000...782 block has to be interpreted as a binary blob. But I hope this analysis helps you get further in this problem!

Upvotes: 3

Related Questions