How to parse this text file?

Question

                                                    VENDOR
ITEM NUMBER           WH ITEM DESCRIPTION               PRODUCT NUMBER      PRICE    DISC %   MAIN-WH    ALT-WH  BIN#
--------------- ----- -- ------------------------------ --------------- ---------    ------ --------- --------- ------
                                                                             0.00 EA   0.00         0         0

10.5PLC/TLED/26V/27K     14.5W 4PIN CFL REPL 2700K VERT 458406              20.00 EA   0.00         0         0        I68    I68
                                                         (00029  )

10.5PLC/TLED/26V/30K     14.5W 4PIN CFL REPL 3000K VERT 458414              20.00 EA   0.00         3         0 PAYOFF I68    I68
                                                         (00029  )

10.5PLC/TLED/26V/35K     14.5W 4PIN CFL REPL 3500K VERT 458422              20.00 EA   0.00         0         0        I68    I68
                                                         (00029  )

10.5PLC/TLED/26V/40K     14.5W 4PIN CFL REPL 4000K VERT 458430              20.00 EA   0.00         0         0        I68    I68
                                                         (00029  )

I want to read each line item and get the item Number, description, Vendor product number and price.

I tried using this python code

def readInventoryFile():
    # dataFile = open("inventoryFiles/INV.txt","r")
    with open('inventoryFiles/INV.txt') as dataFile:
        for lineItem in dataFile:
            itemProperties = lineItem.split("   ")
            while("" in itemProperties) :
                itemProperties.remove("")
            print(itemProperties)
            try:
                itemNum = itemProperties[0]
                itemDesc = itemProperties[1]
                partNumb = itemProperties[2]
                price = itemProperties[3]

                itemSummry = {
                    "Name": itemDesc,
                    "Price": price,
                    "PN": partNumb,
                }

                print(lineItem, "
 ",itemProperties,"
 Summary ",itemSummry)
            except Exception as e:
                print(e)

The code partially works but it's hard to split the line by spaces or other factors because there are spaces that separated and within the content of each line. How could I get the desired product properties?

martineau · Accepted Answer

I think my answer to the question How to efficiently parse fixed width files? can be adapted to do what you want.

The main modification to the code in that answer was to make it also strip any leading and trailing spaces in each field. Below is Python 3.x code illustrating this:

from __future__ import print_function
import struct


HEADER_LINES = 5

# Indices       0       1        2      3      4      5       6      7
fieldwidths = (20, -5, 37, -10, 12, -1, 6, -1, 9, -1, 9, -1, 10, -1, 7)

# Convert fieldwidths into a format compatible with struct module.
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                    for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
#print('fmtstring: {!r}, recsize: {} chars
'.format(fmtstring, fieldstruct.size))

unpack_from = fieldstruct.unpack_from  # To optimize calls.


def parse(line):
    """ Return unpacked fields in string line, stripped of any leading and
        trailing whitespace.
    """
    return list(s.decode().strip() for s in unpack_from(line.encode()))


def readInventoryFile(filename):
    with open(filename) as invfile:
        for _ in range(HEADER_LINES):
            next(invfile)  # Skip header lines.

        for line in invfile:
            if len(line) < fieldstruct.size:  # Pad line if it's too short.
                line = line + (' ' * (fieldstruct.size-len(line)))
            fields = parse(line)
            if fields[0]:  # First field non-blank?
                print(fields)

readInventoryFile('inventoryFiles_INV.txt')

Results:

['10.5PLC/TLED/26V/27K', '14.5W 4PIN CFL REPL 2700K VERT 458406', '20.00 EA', '0.00', '0', '0', 'I68', 'I68']
['10.5PLC/TLED/26V/30K', '14.5W 4PIN CFL REPL 3000K VERT 458414', '20.00 EA', '0.00', '3', '0', 'PAYOFF I68', 'I68']
['10.5PLC/TLED/26V/35K', '14.5W 4PIN CFL REPL 3500K VERT 458422', '20.00 EA', '0.00', '0', '0', 'I68', 'I68']
['10.5PLC/TLED/26V/40K', '14.5W 4PIN CFL REPL 4000K VERT 458430', '20.00 EA', '0.00', '0', '0', 'I68', 'I68']
['1000PAR64/FFR', '1000W PAR64 HALOGEN GX16D BASE 56217', '50.00 EA', '0.00', '0', '0', 'I10', '']
['1000PAR64/WFL/S', '1000W PAR64 HALOGEN GX16D BASE S4673', '0.00 EA', '0.00', '0', '0', '', 'I105']
['100A/99', '100W A19 EXTENDED SERVICE      229781', '2.62 EA', '0.00', '0', '0', 'W6-2   I70', 'I11']
['100A/CL', '100W A19 130V CLEAR            375279', '0.99 EA', '0.00', '0', '0', 'A2-2   I70', 'I11']

How this works

In a nutshell, this code leverages Python's struct module ability to split or "unpack" a "buffer" full of data into fixed-with "fields" each containing a certain number of characters.

Although more commonly applied to binary data, it also works on character strings that have been converted into arrays of bytes (which isn't necessary in Python 2.x). Basically you give it a format string specifying the characteristics of each these fields (type and size), along with the data to parse (a line from the file in this case), which it then unpacks accordingly and returns the results as a list of values.

How to parse this text file?

Answers (2)

How this works

Related Questions