Reading fixed width files into Pandas with binary data

Question

I'm trying to read some fixed-width data from an IBM mainframe into Pandas. The fields are stored in a mix of EBCDIC, numbers saved as binary (i.e., 255 stored as 0xFF), and binary coded decimal (i.e., 255 stored as 0x02FF.) I know the field lengths and types ahead of time.

Can read_fwf deal with this kind of data? Are there better alternatives?

Example -- I have an arbitrary number of records structured like this I'm trying to read in.

import tempfile

databin = 0xF0F3F1F5F1F3F9F9F2F50AC2BB85F0F461F2F061F2F0F1F8F2F0F1F860F0F360F2F360F1F54BF4F54BF5F44BF5F9F2F9F1F800004908

#column 1 -- ten bytes, EBCDIC.  Should be 0315139925.
#column 2 -- four bytes, binary number.  Should be 180534149.
#column 3 -- ten characters, EBCDIC.  Should be 04/20/2018.
#column 4 -- twenty six characters, EBCDIC.  Should be 2018-03-23-15.45.54.592918.
#column 5 -- five characters, packed binary coded decimal.  Should be 4908.  I know the scale ahead of time.

rawbin = databin.to_bytes((databin.bit_length() + 7) // 8, 'big') or b'\0'

with tempfile.TemporaryFile() as fp:
    fp.write(rawbin)

CT Zhu · Accepted Answer

I think most likely what's going to happen is that you have to write some stuff to do them record by record, I think it is unlikely to get it to work as it is in pandas, the components can be brake down into (have to shamelessly copy-and-paste How to split a byte string into separate bytes in python for the BCD part):

def bcdDigits(chars):
    for char in chars:
        char = ord(char)
        for val in (char >> 4, char & 0xF):
            if val == 0xF:
                return
            yield val


In [40]: B
Out[40]: b'\xf0\xf3\xf1\xf5\xf1\xf3\xf9\xf9\xf2\xf5
\xc2\xbb\x85\xf0\xf4a\xf2\xf0a\xf2\xf0\xf1\xf8\xf2\xf0\xf1\xf8`\xf0
\xf3`\xf2\xf3`\xf1\xf5K\xf4\xf5K\xf5\xf4K\xf5\xf9\xf2\xf9\xf1\xf8\x00\x00I\x08'

In [41]: import codecs

In [43]: codecs.decode(B[0:10], "cp500")
Out[43]: '0315139925'

In [44]: int.from_bytes(B[10:14], byteorder='big')
Out[44]: 180534149

In [45]: codecs.decode(B[14:24], "cp500")
Out[45]: '04/20/2018'

In [46]: codecs.decode(B[24:50], "cp500")
Out[46]: '2018-03-23-15.45.54.592918'

In [48]: list(bcdDigits([B[i: i+1] for i in range(50, 54)]))
Out[48]: [0, 0, 0, 0, 4, 9, 0, 8]

Note: For the last piece if you want to get integer in return:

In [63]: import numpy as np

In [64]: (list(bcdDigits([B[i: i+1] for i in range(50, 54)])) * (10 ** np.arange(8)[::-1])).sum()
Out[64]: 4908

Reading fixed width files into Pandas with binary data

Answers (1)

Related Questions