e.iluf
e.iluf

Reputation: 1659

How to parse this text file?

I have this text file

                                                    VENDOR
ITEM NUMBER           WH ITEM DESCRIPTION               PRODUCT NUMBER      PRICE    DISC %   MAIN-WH    ALT-WH  BIN#
--------------- ----- -- ------------------------------ --------------- ---------    ------ --------- --------- ------
                                                                             0.00 EA   0.00         0         0

10.5PLC/TLED/26V/27K     14.5W 4PIN CFL REPL 2700K VERT 458406              20.00 EA   0.00         0         0        I68    I68
                                                         (00029  )

10.5PLC/TLED/26V/30K     14.5W 4PIN CFL REPL 3000K VERT 458414              20.00 EA   0.00         3         0 PAYOFF I68    I68
                                                         (00029  )

10.5PLC/TLED/26V/35K     14.5W 4PIN CFL REPL 3500K VERT 458422              20.00 EA   0.00         0         0        I68    I68
                                                         (00029  )

10.5PLC/TLED/26V/40K     14.5W 4PIN CFL REPL 4000K VERT 458430              20.00 EA   0.00         0         0        I68    I68
                                                         (00029  )

I want to read each line item and get the item Number, description, Vendor product number and price.

I tried using this python code

def readInventoryFile():
    # dataFile = open("inventoryFiles/INV.txt","r")
    with open('inventoryFiles/INV.txt') as dataFile:
        for lineItem in dataFile:
            itemProperties = lineItem.split("   ")
            while("" in itemProperties) :
                itemProperties.remove("")
            print(itemProperties)
            try:
                itemNum = itemProperties[0]
                itemDesc = itemProperties[1]
                partNumb = itemProperties[2]
                price = itemProperties[3]

                itemSummry = {
                    "Name": itemDesc,
                    "Price": price,
                    "PN": partNumb,
                }

                print(lineItem, "\n ",itemProperties,"\n Summary ",itemSummry)
            except Exception as e:
                print(e)

The code partially works but it's hard to split the line by spaces or other factors because there are spaces that separated and within the content of each line. How could I get the desired product properties?

Upvotes: 0

Views: 125

Answers (2)

martineau
martineau

Reputation: 123531

I think my answer to the question How to efficiently parse fixed width files? can be adapted to do what you want.

The main modification to the code in that answer was to make it also strip any leading and trailing spaces in each field. Below is Python 3.x code illustrating this:

from __future__ import print_function
import struct


HEADER_LINES = 5

# Indices       0       1        2      3      4      5       6      7
fieldwidths = (20, -5, 37, -10, 12, -1, 6, -1, 9, -1, 9, -1, 10, -1, 7)

# Convert fieldwidths into a format compatible with struct module.
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                    for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
#print('fmtstring: {!r}, recsize: {} chars\n'.format(fmtstring, fieldstruct.size))

unpack_from = fieldstruct.unpack_from  # To optimize calls.


def parse(line):
    """ Return unpacked fields in string line, stripped of any leading and
        trailing whitespace.
    """
    return list(s.decode().strip() for s in unpack_from(line.encode()))


def readInventoryFile(filename):
    with open(filename) as invfile:
        for _ in range(HEADER_LINES):
            next(invfile)  # Skip header lines.

        for line in invfile:
            if len(line) < fieldstruct.size:  # Pad line if it's too short.
                line = line + (' ' * (fieldstruct.size-len(line)))
            fields = parse(line)
            if fields[0]:  # First field non-blank?
                print(fields)

readInventoryFile('inventoryFiles_INV.txt')

Results:

['10.5PLC/TLED/26V/27K', '14.5W 4PIN CFL REPL 2700K VERT 458406', '20.00 EA', '0.00', '0', '0', 'I68', 'I68']
['10.5PLC/TLED/26V/30K', '14.5W 4PIN CFL REPL 3000K VERT 458414', '20.00 EA', '0.00', '3', '0', 'PAYOFF I68', 'I68']
['10.5PLC/TLED/26V/35K', '14.5W 4PIN CFL REPL 3500K VERT 458422', '20.00 EA', '0.00', '0', '0', 'I68', 'I68']
['10.5PLC/TLED/26V/40K', '14.5W 4PIN CFL REPL 4000K VERT 458430', '20.00 EA', '0.00', '0', '0', 'I68', 'I68']
['1000PAR64/FFR', '1000W PAR64 HALOGEN GX16D BASE 56217', '50.00 EA', '0.00', '0', '0', 'I10', '']
['1000PAR64/WFL/S', '1000W PAR64 HALOGEN GX16D BASE S4673', '0.00 EA', '0.00', '0', '0', '', 'I105']
['100A/99', '100W A19 EXTENDED SERVICE      229781', '2.62 EA', '0.00', '0', '0', 'W6-2   I70', 'I11']
['100A/CL', '100W A19 130V CLEAR            375279', '0.99 EA', '0.00', '0', '0', 'A2-2   I70', 'I11']

How this works

In a nutshell, this code leverages Python's struct module ability to split or "unpack" a "buffer" full of data into fixed-with "fields" each containing a certain number of characters.

Although more commonly applied to binary data, it also works on character strings that have been converted into arrays of bytes (which isn't necessary in Python 2.x). Basically you give it a format string specifying the characteristics of each these fields (type and size), along with the data to parse (a line from the file in this case), which it then unpacks accordingly and returns the results as a list of values.

Upvotes: 1

JacoSolari
JacoSolari

Reputation: 1404

Since your file is formatted in an inconvenient way I can only see this trick as solution: You can look at the headers (feature names) and parse all the lines according to the indices at which they appear, like such

import numpy as np
with open('/Users/Copo1/Desktop/aaa.txt') as dataFile:
    lines = dataFile.readlines()
headers = ['ITEM NUMBER','WH ITEM DESCRIPTION', 'PRODUCT NUMBER', 'PRICE']
starts = [lines[1].find(h) for h in headers]
starts.append(len(lines[0]))
headers.append(' ')
items = [[line[starts[i]:starts[i+1]] for line in lines] for i,h in zip(range(len(starts)-1), headers[:-1]) ]

this produces the following output for the items list (pasting only the first element, corresponding to 'ITEM NUMBER', the other elements are correct as well you can check).

[['                      ',
  'ITEM NUMBER           ',
  '--------------- ----- ',
  '                      ',
  ' \n',
  '10.5PLC/TLED/26V/27K  ',
  '                      ',
  ' \n',
  '10.5PLC/TLED/26V/30K  ',
  '                      ',
  ' \n',
  '10.5PLC/TLED/26V/35K  ',
  '                      ',
  ' \n',
  '10.5PLC/TLED/26V/40K  ',
  '                      ',
  ' \n',
  '1000PAR64/FFR         ',
  '                      ',
  ' \n',
  '1000PAR64/WFL/S       ',
  '                      ',
  ' \n',
  '100A/99               ',
  '                      ',
  ' \n',
  '100A/CL               ',
  '                      ',
  ' '],

There might be some extra easy polishing left to do after this (like removing empty strings and '\n's) but I am sure you can figure out yourself.

Upvotes: 0

Related Questions