Reputation: 13
I am trying to read a large text file, containing variable names and corresponding values (see below for small example). Names are all upper case and the value is usually separated by a periods and whitespaces, but if the variable name is too long it is separated by only whitespaces.
WATER DEPTH .......... 20.00 M TENSION AT TOUCHDOWN . 382.47 KN
TOUCHDOWN X-COORD. ... -206.75 M BOTTOM SLOPE ANGLE ... 0.000 DEG
PROJECTED SPAN LENGTH 166.74 M PIPE LENGTH GAIN ..... 1.72 M
I am able to find the values using the following expression:
line = ' PROJECTED SPAN LENGTH 166.74 M PIPE LENGTH GAIN ..... 1.72 M \n'
re.findall(r"[-+]?\d*\.\d+|\d+", line):
['166.74', '1.72']
But when I try to extract the variable names, using below expression I have leading and trailing whitespaces which I would like to leave out.
re.findall('(?<=\s.)[A-Z\s]+', line)
[' PROJECTED SPAN LENGTH ', ' PIPE LENGTH GAIN ', ' ', ' \n']
I believe it should have something like ^\s, but I can't get it to work. When successful I'd like to store the data in a dataframe, having the variable names as indices and the values as column.
Upvotes: 1
Views: 1738
Reputation: 10746
Use [A-Z]{2,}(?:\s+[A-Z]+)*
[A-Z]{2,}
looks for uppercase words at least 2 in length
(?:\s+[A-Z]+)*
is a capture group for if there are multiple words in the label
EDIT
To handle the case in your comment I'd recommend:
[A-Z-\/]{2,}(?:\s*[A-Z-\/]+(?:\.)*)*
just make sure there is at least one space after the last period in R.O.W.
and before the ...
[A-Z-\/]{2,}
will check for uppercase letters, -, and / of 2 length or greater
(?:\s*[A-Z-\/]+(?:\.)*)*
is a capture group for for multiple words and/or words with periods in them
Upvotes: 0
Reputation: 43169
You can use the following expression along with re.finditer()
:
(?P<category>[A-Z][A-Z- ]+[A-Z])
[. ]+
(?P<value>-?\d[.\d]+)\
(?P<unit>M|DEG|KN)
Python
this would be:
import re
rx = re.compile(r'''
(?P<category>[A-Z][A-Z- ]+[A-Z])
[. ]+
(?P<value>-?\d[.\d]+)\
(?P<unit>M|DEG|KN)
''', re.VERBOSE)
string = '''
WATER DEPTH .......... 20.00 M TENSION AT TOUCHDOWN . 382.47 KN
TOUCHDOWN X-COORD. ... -206.75 M BOTTOM SLOPE ANGLE ... 0.000 DEG
PROJECTED SPAN LENGTH 166.74 M PIPE LENGTH GAIN ..... 1.72 M
'''
matches = [(m.group('category'), m.group('value'), m.group('unit')) \
for m in rx.finditer(string)]
print(matches)
# [('WATER DEPTH', '20.00', 'M'), ('TENSION AT TOUCHDOWN', '382.47', 'KN'), ('TOUCHDOWN X-COORD', '-206.75', 'M'), ('BOTTOM SLOPE ANGLE', '0.000', 'DEG'), ('PROJECTED SPAN LENGTH', '166.74', 'M'), ('PIPE LENGTH GAIN', '1.72', 'M')]
See a demo on ideone.com.
Upvotes: 1
Reputation: 546
If you ever want to take out leading/trailing white space, you can use the .strip()
method.
stripped_values = [raw.strip() for raw in re.findall('(?<=\s.)[A-Z\s]+', line)]
Upvotes: 0