write value for each column in front of that in a line

Question

I have text file such as:

blahhh blaahhh blahhh
 some thing write this long 23.78, lat 45.45
      g.m.  occ/yr  r(event)   g.m. occ/yr   r(event)
      0.125  0.254   12.587    0.258 2.568   1.369
      0.785  0.365   10.258    0.897 2.987   9.365
something note write here blahh blahhh blahhh

I Want a string line such as below:

long 23.78 lat 45.45 g.m. 0.125, 0.785 occ/yr 0.254, 0.365 r(event) 12.587,10.258 g.m 0.258, 0.897 occ/yr 2.568, 2.987 r(event) 1.369, 9.365

This is my code:

file = open('geotechnic.txt').readlines()
i =0
while i < len(file):
    for line in file:
        wordList = re.sub("[^\w\./()]", " ",  line).split()
        try:
            print wordList[i]
        except:
            pass
i+=1

Joe Young · Accepted Answer

The following will have to be adapted to your use case:

parsegeo.py

import re

data = '''blahhh blaahhh blahhh
 some thing write this long 23.78, lat 45.45
      g.m.  occ/yr  r(event)   g.m. occ/yr   r(event)
      0.125  0.254   12.587    0.258 2.568   1.369
      0.785  0.365   10.258    0.897 2.987   9.365
something note write here blahh blahhh blahhh'''

lines = data.split('
')
matchobj = re.match('^.*(long \d+\.\d+),\s+(lat \d+\.\d+)', lines[1])
longval = matchobj.group(1)
latval = matchobj.group(2)

headers = lines[2].strip().split()
dataline1 = lines[3].strip().split()
dataline2 = lines[4].strip().split()

zippeddata = zip(dataline1, dataline2)

outputlist = [longval, latval]
for i in range(0, len(headers)):
    segment = '{header} {valtuple}'.format(header=headers[i], valtuple=', '.join(zippeddata[i]))
    outputlist.append(segment)

print " ".join(outputlist)

Output:

(parsegeo)macbook:parsegeo user$ python parsegeo.py
long 23.78 lat 45.45 g.m. 0.125, 0.785 occ/yr 0.254, 0.365 r(event) 12.587, 10.258 g.m. 0.258, 0.897 occ/yr 2.568, 2.987 r(event) 1.369, 9.365

What's happening:

You'll have to adapt this to work with your readlines, as I'm just using a long string as the data source. I split the data source on the newline character to get individual lines and assign them to the line list.

I skip the first line. On the second line I use a regular expression with capture groups to capture the text long followed by some float into the first capture group (denoted by the parentheses), as well as capturing the lat followed by it's float into the second capture group. These capture groups are accessible via the matchobj variable.

On the next 3 lines, I use strip to remove extraneous whitespace, and use split to tokenize the remaining data (splitting on the default whitespace) and assign the tokens to lists.

Next, I zip the two datalines lists together to form a list of 2-tuples.

I iterate over the number of elements in the header list and append to a list outputlist a line of data containing the column header, followed by the 2 dataline values for that column which are joined together with a comma and space.

Once the loop is done, I join the outputlist list using whitespace and print it out.

EDIT: Solution for parsing data file linked in comment.*

I've included below a solution for parsing the data file you linked in the comments. You didn't specify which block of data you wanted parsed (zero attenuation variability data or the variability in atten data). So I only display the zero attenuation variability data. The variability in atten data has been tokenized and added to the var_atten_data list. If you want to display the variability in atten data, you'll have to zip(), join() and string-format that list yourself. I'll leave that as an exercise up to you.

updated parsegeo.py

import re

with open('geotechnic.txt', 'r') as f:
    in_attenuation_block = skipped_first = skipped_second = parsed_header = False
    longval = latval = None
    zero_atten_headers = []
    var_atten_headers = []
    zero_atten_data = []
    var_atten_data = []
    for line in f:
        matchobj = re.match('^.*site at long\s+(\d+\.\d+),\s+lat\s+(\d+\.\d+)', line)
        if matchobj:
            longval = matchobj.group(1)
            latval = matchobj.group(2)
            in_attenuation_block = True
            continue
        if in_attenuation_block:
            if skipped_first:
                if skipped_second:
                    data_line = line.strip().split()
                    if len(data_line) > 5:
                        if 'g.m.' in data_line[0] and len(data_line) > 5:
                            zero_atten_headers = data_line[0:5]
                            var_atten_headers = data_line[5:]
                        elif re.match('^\d+\.\d+\s+\d+\.\d', line.strip()):
                            zero_atten_data.append(data_line[0:5])
                            var_atten_data.append(data_line[5:])
                        elif re.match('^total yearly events', line.strip()):
                            # Reached the end of data block, print out summary
                            zippeddata = zip(*zero_atten_data)
                            outputlist = ["long", longval, "lat", latval]
                            for i in range(0, len(zero_atten_headers)):
                                segment = '{header} {valtuple}'.format(header=zero_atten_headers[i], valtuple=', '.join(zippeddata[i]))
                                outputlist.append(segment)
                            print " ".join(outputlist)
                            # Reset all of the flags, arrays, and vars for the next block of data
                            in_attenuation_block = skipped_first = skipped_second = parsed_header = False
                            longval = latval = None
                            zero_atten_headers = []
                            var_atten_headers = []
                            zero_atten_data = []
                            var_atten_data = []
                            continue
                        else:
                            print 'Unable to parse current line. Skipping to next line.  Current line: {}'.format(line)
                    else:
                        print 'Unable to parse current line. Skipping to next line.  Current line: {}'.format(line)
                else:
                    skipped_second = True
            else:
                skipped_first = True

Truncated output (5 lines):

(parsegeo)macbook:parsegeo user$ python parsegeo.py
long 46.766 lat 32.305 g.m. 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22, 0.24 occ/yr 0.15773, 0.00734, 0.00084, 0.00030, 0.00011, 0.00004, 0.00002, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 exc/yr 0.00865, 0.00132, 0.00047, 0.00017, 0.00006, 0.00002, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 r(events) 19.2, 126.4, 352.8, 974.5, 2574.4, 8231.0, 70366.1, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9 r(yrs) 115.6, 759.7, 2120.4, 5856.8, 15472.2, 49469.3, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9
long 46.884 lat 32.306 g.m. 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22, 0.24, 0.26, 0.28, 0.30 occ/yr 0.15085, 0.01156, 0.00285, 0.00070, 0.00023, 0.00010, 0.00005, 0.00002, 0.00001, 0.00001, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 exc/yr 0.01553, 0.00397, 0.00112, 0.00042, 0.00019, 0.00009, 0.00004, 0.00002, 0.00001, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 r(events) 10.7, 41.9, 148.2, 394.3, 879.0, 1798.1, 4235.4, 8361.3, 25064.4, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9 r(yrs) 64.4, 251.6, 890.6, 2369.5, 5283.2, 10806.6, 25455.0, 50252.4, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9
long 46.765 lat 32.405 g.m. 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22, 0.24, 0.26 occ/yr 0.15628, 0.00842, 0.00111, 0.00036, 0.00012, 0.00006, 0.00002, 0.00001, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 exc/yr 0.01010, 0.00168, 0.00057, 0.00021, 0.00009, 0.00003, 0.00001, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 r(events) 16.5, 98.8, 292.0, 800.9, 1930.1, 5871.5, 19010.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9 r(yrs) 99.0, 593.8, 1755.0, 4813.5, 11599.9, 35288.4, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9
long 46.883 lat 32.406 g.m. 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22, 0.24, 0.26, 0.28, 0.30, 0.32, 0.34 occ/yr 0.14909, 0.01221, 0.00351, 0.00101, 0.00032, 0.00013, 0.00006, 0.00003, 0.00002, 0.00001, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 exc/yr 0.01730, 0.00509, 0.00158, 0.00058, 0.00026, 0.00012, 0.00006, 0.00003, 0.00001, 0.00001, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 r(events) 9.6, 32.7, 105.0, 287.4, 646.3, 1349.7, 2697.5, 5679.3, 11947.6, 31177.0, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9 r(yrs) 57.8, 196.4, 631.2, 1727.5, 3884.1, 8111.6, 16212.1, 34133.4, 71806.2, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9
long 47.700 lat 33.300 g.m. 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22 occ/yr 0.15767, 0.00717, 0.00095, 0.00046, 0.00011, 0.00003, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 exc/yr 0.00872, 0.00155, 0.00060, 0.00015, 0.00003, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 r(events) 19.1, 107.4, 275.1, 1143.4, 5364.2, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9 r(yrs) 114.7, 645.2, 1653.4, 6872.1, 32239.4, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9
...

write value for each column in front of that in a line

Answers (1)

Related Questions