Reputation: 641
I'm trying to read some fixed-width data from an IBM mainframe into Pandas. The fields are stored in a mix of EBCDIC, numbers saved as binary (i.e., 255 stored as 0xFF), and binary coded decimal (i.e., 255 stored as 0x02FF.) I know the field lengths and types ahead of time.
Can read_fwf deal with this kind of data? Are there better alternatives?
Example -- I have an arbitrary number of records structured like this I'm trying to read in.
import tempfile
databin = 0xF0F3F1F5F1F3F9F9F2F50AC2BB85F0F461F2F061F2F0F1F8F2F0F1F860F0F360F2F360F1F54BF4F54BF5F44BF5F9F2F9F1F800004908
#column 1 -- ten bytes, EBCDIC. Should be 0315139925.
#column 2 -- four bytes, binary number. Should be 180534149.
#column 3 -- ten characters, EBCDIC. Should be 04/20/2018.
#column 4 -- twenty six characters, EBCDIC. Should be 2018-03-23-15.45.54.592918.
#column 5 -- five characters, packed binary coded decimal. Should be 4908. I know the scale ahead of time.
rawbin = databin.to_bytes((databin.bit_length() + 7) // 8, 'big') or b'\0'
with tempfile.TemporaryFile() as fp:
fp.write(rawbin)
Upvotes: 1
Views: 703
Reputation: 54340
I think most likely what's going to happen is that you have to write some stuff to do them record by record, I think it is unlikely to get it to work as it is in pandas, the components can be brake down into (have to shamelessly copy-and-paste How to split a byte string into separate bytes in python for the BCD part):
def bcdDigits(chars):
for char in chars:
char = ord(char)
for val in (char >> 4, char & 0xF):
if val == 0xF:
return
yield val
In [40]: B
Out[40]: b'\xf0\xf3\xf1\xf5\xf1\xf3\xf9\xf9\xf2\xf5\n\xc2\xbb\x85\xf0\xf4a\xf2\xf0a\xf2\xf0\xf1\xf8\xf2\xf0\xf1\xf8`\xf0
\xf3`\xf2\xf3`\xf1\xf5K\xf4\xf5K\xf5\xf4K\xf5\xf9\xf2\xf9\xf1\xf8\x00\x00I\x08'
In [41]: import codecs
In [43]: codecs.decode(B[0:10], "cp500")
Out[43]: '0315139925'
In [44]: int.from_bytes(B[10:14], byteorder='big')
Out[44]: 180534149
In [45]: codecs.decode(B[14:24], "cp500")
Out[45]: '04/20/2018'
In [46]: codecs.decode(B[24:50], "cp500")
Out[46]: '2018-03-23-15.45.54.592918'
In [48]: list(bcdDigits([B[i: i+1] for i in range(50, 54)]))
Out[48]: [0, 0, 0, 0, 4, 9, 0, 8]
Note: For the last piece if you want to get integer in return:
In [63]: import numpy as np
In [64]: (list(bcdDigits([B[i: i+1] for i in range(50, 54)])) * (10 ** np.arange(8)[::-1])).sum()
Out[64]: 4908
Upvotes: 1