Reputation: 115
I have text file such as:
blahhh blaahhh blahhh
some thing write this long 23.78, lat 45.45
g.m. occ/yr r(event) g.m. occ/yr r(event)
0.125 0.254 12.587 0.258 2.568 1.369
0.785 0.365 10.258 0.897 2.987 9.365
something note write here blahh blahhh blahhh
I Want a string line such as below:
long 23.78 lat 45.45 g.m. 0.125, 0.785 occ/yr 0.254, 0.365 r(event) 12.587,10.258 g.m 0.258, 0.897 occ/yr 2.568, 2.987 r(event) 1.369, 9.365
This is my code:
file = open('geotechnic.txt').readlines()
i =0
while i < len(file):
for line in file:
wordList = re.sub("[^\w\./()]", " ", line).split()
try:
print wordList[i]
except:
pass
i+=1
Upvotes: 0
Views: 47
Reputation: 5875
The following will have to be adapted to your use case:
parsegeo.py
import re
data = '''blahhh blaahhh blahhh
some thing write this long 23.78, lat 45.45
g.m. occ/yr r(event) g.m. occ/yr r(event)
0.125 0.254 12.587 0.258 2.568 1.369
0.785 0.365 10.258 0.897 2.987 9.365
something note write here blahh blahhh blahhh'''
lines = data.split('\n')
matchobj = re.match('^.*(long \d+\.\d+),\s+(lat \d+\.\d+)', lines[1])
longval = matchobj.group(1)
latval = matchobj.group(2)
headers = lines[2].strip().split()
dataline1 = lines[3].strip().split()
dataline2 = lines[4].strip().split()
zippeddata = zip(dataline1, dataline2)
outputlist = [longval, latval]
for i in range(0, len(headers)):
segment = '{header} {valtuple}'.format(header=headers[i], valtuple=', '.join(zippeddata[i]))
outputlist.append(segment)
print " ".join(outputlist)
Output:
(parsegeo)macbook:parsegeo user$ python parsegeo.py
long 23.78 lat 45.45 g.m. 0.125, 0.785 occ/yr 0.254, 0.365 r(event) 12.587, 10.258 g.m. 0.258, 0.897 occ/yr 2.568, 2.987 r(event) 1.369, 9.365
What's happening:
You'll have to adapt this to work with your readlines
, as I'm just using a long string as the data
source. I split the data source on the newline character to get individual lines and assign them to the line
list.
I skip the first line. On the second line I use a regular expression with capture groups to capture the text long
followed by some float into the first capture group (denoted by the parentheses), as well as capturing the lat
followed by it's float into the second capture group. These capture groups are accessible via the matchobj
variable.
On the next 3 lines, I use strip
to remove extraneous whitespace, and use split
to tokenize the remaining data (splitting on the default whitespace) and assign the tokens to lists.
Next, I zip
the two datalines lists together to form a list of 2-tuples.
I iterate over the number of elements in the header list and append to a list outputlist
a line of data containing the column header
, followed by the 2 dataline values for that column which are joined together with a comma and space.
Once the loop is done, I join the outputlist
list using whitespace and print it out.
EDIT: Solution for parsing data file linked in comment.*
I've included below a solution for parsing the data file you linked in the comments. You didn't specify which block of data you wanted parsed (zero attenuation variability
data or the variability in atten
data). So I only display the zero attenuation variability
data. The variability in atten
data has been tokenized and added to the var_atten_data
list. If you want to display the variability in atten
data, you'll have to zip()
, join()
and string-format that list yourself. I'll leave that as an exercise up to you.
updated parsegeo.py
import re
with open('geotechnic.txt', 'r') as f:
in_attenuation_block = skipped_first = skipped_second = parsed_header = False
longval = latval = None
zero_atten_headers = []
var_atten_headers = []
zero_atten_data = []
var_atten_data = []
for line in f:
matchobj = re.match('^.*site at long\s+(\d+\.\d+),\s+lat\s+(\d+\.\d+)', line)
if matchobj:
longval = matchobj.group(1)
latval = matchobj.group(2)
in_attenuation_block = True
continue
if in_attenuation_block:
if skipped_first:
if skipped_second:
data_line = line.strip().split()
if len(data_line) > 5:
if 'g.m.' in data_line[0] and len(data_line) > 5:
zero_atten_headers = data_line[0:5]
var_atten_headers = data_line[5:]
elif re.match('^\d+\.\d+\s+\d+\.\d', line.strip()):
zero_atten_data.append(data_line[0:5])
var_atten_data.append(data_line[5:])
elif re.match('^total yearly events', line.strip()):
# Reached the end of data block, print out summary
zippeddata = zip(*zero_atten_data)
outputlist = ["long", longval, "lat", latval]
for i in range(0, len(zero_atten_headers)):
segment = '{header} {valtuple}'.format(header=zero_atten_headers[i], valtuple=', '.join(zippeddata[i]))
outputlist.append(segment)
print " ".join(outputlist)
# Reset all of the flags, arrays, and vars for the next block of data
in_attenuation_block = skipped_first = skipped_second = parsed_header = False
longval = latval = None
zero_atten_headers = []
var_atten_headers = []
zero_atten_data = []
var_atten_data = []
continue
else:
print 'Unable to parse current line. Skipping to next line. Current line: {}'.format(line)
else:
print 'Unable to parse current line. Skipping to next line. Current line: {}'.format(line)
else:
skipped_second = True
else:
skipped_first = True
Truncated output (5 lines):
(parsegeo)macbook:parsegeo user$ python parsegeo.py
long 46.766 lat 32.305 g.m. 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22, 0.24 occ/yr 0.15773, 0.00734, 0.00084, 0.00030, 0.00011, 0.00004, 0.00002, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 exc/yr 0.00865, 0.00132, 0.00047, 0.00017, 0.00006, 0.00002, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 r(events) 19.2, 126.4, 352.8, 974.5, 2574.4, 8231.0, 70366.1, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9 r(yrs) 115.6, 759.7, 2120.4, 5856.8, 15472.2, 49469.3, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9
long 46.884 lat 32.306 g.m. 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22, 0.24, 0.26, 0.28, 0.30 occ/yr 0.15085, 0.01156, 0.00285, 0.00070, 0.00023, 0.00010, 0.00005, 0.00002, 0.00001, 0.00001, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 exc/yr 0.01553, 0.00397, 0.00112, 0.00042, 0.00019, 0.00009, 0.00004, 0.00002, 0.00001, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 r(events) 10.7, 41.9, 148.2, 394.3, 879.0, 1798.1, 4235.4, 8361.3, 25064.4, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9 r(yrs) 64.4, 251.6, 890.6, 2369.5, 5283.2, 10806.6, 25455.0, 50252.4, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9
long 46.765 lat 32.405 g.m. 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22, 0.24, 0.26 occ/yr 0.15628, 0.00842, 0.00111, 0.00036, 0.00012, 0.00006, 0.00002, 0.00001, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 exc/yr 0.01010, 0.00168, 0.00057, 0.00021, 0.00009, 0.00003, 0.00001, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 r(events) 16.5, 98.8, 292.0, 800.9, 1930.1, 5871.5, 19010.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9 r(yrs) 99.0, 593.8, 1755.0, 4813.5, 11599.9, 35288.4, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9
long 46.883 lat 32.406 g.m. 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22, 0.24, 0.26, 0.28, 0.30, 0.32, 0.34 occ/yr 0.14909, 0.01221, 0.00351, 0.00101, 0.00032, 0.00013, 0.00006, 0.00003, 0.00002, 0.00001, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 exc/yr 0.01730, 0.00509, 0.00158, 0.00058, 0.00026, 0.00012, 0.00006, 0.00003, 0.00001, 0.00001, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 r(events) 9.6, 32.7, 105.0, 287.4, 646.3, 1349.7, 2697.5, 5679.3, 11947.6, 31177.0, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9 r(yrs) 57.8, 196.4, 631.2, 1727.5, 3884.1, 8111.6, 16212.1, 34133.4, 71806.2, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9
long 47.700 lat 33.300 g.m. 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22 occ/yr 0.15767, 0.00717, 0.00095, 0.00046, 0.00011, 0.00003, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 exc/yr 0.00872, 0.00155, 0.00060, 0.00015, 0.00003, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000 r(events) 19.1, 107.4, 275.1, 1143.4, 5364.2, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9 r(yrs) 114.7, 645.2, 1653.4, 6872.1, 32239.4, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9, 99999.9
...
Upvotes: 1