Nip
Nip

Reputation: 369

Formatting the columns when reading fixed width files in Python

I'm trying to read a fixed width file using pandas.read_fwf, and please see a sample of the file as below:

0000123456700123  
0001234567800045  

Say, column 0-11 is the balance (with format %12.2f), and column 11-16 is the interest rate (with format %6.2f). So my expected output data frame should look like this:

     Balance  Int_Rate  
0   12345.67      1.23  
1  123456.78      0.45

Here's my code for reading the file without formatting:

colspecs = [(0,11),(11,16)]  
header = ['Balance','Int_Rate']
df = pd.read_fwf("dataset",colspecs=colspecs, names=header)

I've checked the documentation of pandas.read_fwf, however it seems impossible to format the columns as an option during the importing process. Do I have to update the formats afterwards, or there's a better way to do it?

Upvotes: 4

Views: 6571

Answers (1)

fo2bug
fo2bug

Reputation: 85

I had the same problem awhile back, I used struct then pandas

import struct
import pandas as pd

def parse_data_file(fieldwidths, fn):
    #
    # see https://docs.python.org/3.0/library/struct.html, for formatting and other info
    fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                         for fw in fieldwidths)
    fieldstruct = struct.Struct(fmtstring)
    umpack = fieldstruct.unpack_from

    # this part will dissect your data, per your fieldwiths
    parse = lambda line: tuple(s.decode() for s in umpack(line.encode()))
    df = []
    with open(fn, 'r') as f:
        for line in f:
            row = parse(line)
            df.append(row)
    return df

#
# test.txt file content, per below
# 6332      x102340   Darwin                                                                                              080007Darwin                                            1101
# 6332      x102342   Sydney                                                                                              200001Sydney                                            1101
file_location = "test.txt"
fieldwidths = (10 ,10 ,100 ,4 ,2 ,50 ,4)  # negative widths represent ignored padding fields

column_names = ['ID', 'LocationID', 'LocationName', 'PostCode', 'StateID', 'Address', 'CountryID']
fields = parse_data_file(fieldwidths=fieldwidths, fn=file_location)

# Pandas options
pd.options.display.width=500
pd.options.display.colheader_justify='left'

# assigned list into dataframe
df = pd.DataFrame(fields)
df.columns = column_names

print(df)

Output

    ID    LocationID  LocationName  PostCode StateID Address CountryID
    6332  x102340     Darwin        0800     07      Darwin  1101    
    6332  x102342     Sydney        2000     01      Sydney  1101   

Upvotes: 1

Related Questions