Reputation: 369
I'm trying to read a fixed width file using pandas.read_fwf, and please see a sample of the file as below:
0000123456700123
0001234567800045
Say, column 0-11 is the balance (with format %12.2f), and column 11-16 is the interest rate (with format %6.2f). So my expected output data frame should look like this:
Balance Int_Rate
0 12345.67 1.23
1 123456.78 0.45
Here's my code for reading the file without formatting:
colspecs = [(0,11),(11,16)]
header = ['Balance','Int_Rate']
df = pd.read_fwf("dataset",colspecs=colspecs, names=header)
I've checked the documentation of pandas.read_fwf, however it seems impossible to format the columns as an option during the importing process. Do I have to update the formats afterwards, or there's a better way to do it?
Upvotes: 4
Views: 6571
Reputation: 85
I had the same problem awhile back, I used struct then pandas
import struct
import pandas as pd
def parse_data_file(fieldwidths, fn):
#
# see https://docs.python.org/3.0/library/struct.html, for formatting and other info
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
umpack = fieldstruct.unpack_from
# this part will dissect your data, per your fieldwiths
parse = lambda line: tuple(s.decode() for s in umpack(line.encode()))
df = []
with open(fn, 'r') as f:
for line in f:
row = parse(line)
df.append(row)
return df
#
# test.txt file content, per below
# 6332 x102340 Darwin 080007Darwin 1101
# 6332 x102342 Sydney 200001Sydney 1101
file_location = "test.txt"
fieldwidths = (10 ,10 ,100 ,4 ,2 ,50 ,4) # negative widths represent ignored padding fields
column_names = ['ID', 'LocationID', 'LocationName', 'PostCode', 'StateID', 'Address', 'CountryID']
fields = parse_data_file(fieldwidths=fieldwidths, fn=file_location)
# Pandas options
pd.options.display.width=500
pd.options.display.colheader_justify='left'
# assigned list into dataframe
df = pd.DataFrame(fields)
df.columns = column_names
print(df)
Output
ID LocationID LocationName PostCode StateID Address CountryID 6332 x102340 Darwin 0800 07 Darwin 1101 6332 x102342 Sydney 2000 01 Sydney 1101
Upvotes: 1