How to parse a text files, separated by colons, with different data types and changing, but consistent, structure, using Python

Question

I have a data acquirer that gives me a header file and a data file that I need to parse in order to do some computations. The header file contains about a little over a hundred variables that follow the pattern in the sample header below.

Sample header:

fileName :                              C:\Path\To\File\prefix.289.13.name.00000.ext
date :                                  2013-10-16 15:46:16.978 EDT
var1 (unit) :                           1381952777
var2 (unit) :                           [ 10000 0 0 0  ]
var3 (0.1unit) :                        400
var4 (unit):                            1.03125
var5 :                                  3
var6 (description (unit)) :
[ 1.1 -0.5 0.1 ]
[ 1.1 -0.5 0.1 ]
[ 1.1 -0.5 0.1 ]


          COMMENTS
------------------------------

Where var5 gives the number of rows in the var6 matrix. The variables are separated from their values by a colon in all but the last case. They may or may not have units specified between parenthesis between the variable name and the colon. When units are specified, sometimes there is also a scale factor prepended to the units. The values may be strings, dates, integers, floats, or arrays of either integers or floats. The last value is separated from an unneeded comment section by a few empty lines.

Expected output:

fileName = C:\Path\To\File\prefix.289.13.name.00000.hdr
date = 2013-10-16 15:46:16.978 EDT
var1 = 1381952777
var2 = np.array( [10000, 0, 0, 0] )
var3 = 40.0
var4 = 1.03125
var5 = 3
var6 = np.array([1.1, -0.5, 0.1],[2.1, 0.01, 0.5],[3.2, 0.4, 1.2])

Ideally, all the variables would be contained in a dictionary, but I'm new enough at this that I'll take suggestions. With the variables I will be able to find the data files and dimension arrays for it, which are huge.

My Attempts so far:

I'm using python to parse the file. My first approach was

hdr = 'C:\Path\To\File\prefix.289.13.name.00000.hdr'
with open(hdr, 'r') as header:
    for line in header:
        # Stop at the first Line Feed or Carriage Return
        if line in  ['
', '
']:
            break
        else:
            (' '.join(line.strip().split(':')).split())

which does a good enough job of giving me the variable name as a first element of the list and the value as a last element, as long as it's not an array. It botches the filename and date because of the colon, and the arrays because of the square brackets.

My second attempt involved regular expressions:

import re
hdr = 'C:\Path\To\File\prefix.289.13.name.00000.hdr'
with open(hdr, 'r') as header:
    for line in header:
        # Stop at the first Line Feed or Carriage Return
        if line in  ['
', '
']:
            break
        else:
            m = re.search('\w*', line)
            if m:
                m.group()
            else:
                print 'No match'

With this approach I successfully got the variable names up until the last part of the file, where the vectors are not preceded by a variable name, which output an empty string. I changed the regular expression to \w+ and then the last part output the first digit of the first element of the vector. It was at this point that I admitted to myself that I was no better than a blindfolded person taking swings at a piñata. So here I am.

My question is, how should I approach this problem? It's a vague question, but all the other questions I've found on this site about parsing have nicely formatted files.

sabhiram · Accepted Answer

Here is some pseudo-code (assumes your header will NEVER have errors):

# I like getting the lines into a list, so I can 
# more freely manipulate the index of the line I
# am messing with.
lines = []
with open(fpath, "r") as file_in:
    lines = file_in.readlines()

out_lines = []
re_STATIC = re.compile(r"^([^\s]+)\s+:\s+(.*)$")
re_VAR    = re to detect the var name, unit multiplier and unit value
re_VAR_SIZE = re to detect a variable sized array is upon us...

for idx in lines:
    line = lines[idx]

    matches_static = re_STATIC.match(line)
    if matches_static:
        out_lines.append("%s = %s"%(matches_static.group(1), matches_static.group(2)))

    matches_regular_var = re_VAR.match(line)
    if matches_regular_var:
        ...

    matches_variable_size = re_VAR_SIZE.match(line)
    if matches_variable_size:
        var_name = matches_variable_size.group(1)
        arr_size = parseInt(matches_variable_size.group(2))

        # Here we can increment index as we see fit
        arr_list = []
        for j in range(arr_size):
            idx += 1
            arr_list.append(lines[idx])
        out_lines.append("%s = np.array(%s)"%(var_name, ",".join(arr_list))

Note: This is probably riddled with errors, but you should get the general idea :)

How to parse a text files, separated by colons, with different data types and changing, but consistent, structure, using Python

Answers (1)

Related Questions