EmielT
EmielT

Reputation: 13

Extract variable names and values using REGEX in Python from a text file

I am trying to read a large text file, containing variable names and corresponding values (see below for small example). Names are all upper case and the value is usually separated by a periods and whitespaces, but if the variable name is too long it is separated by only whitespaces.

WATER DEPTH ..........     20.00 M      TENSION AT TOUCHDOWN .    382.47 KN  

TOUCHDOWN X-COORD. ...   -206.75 M      BOTTOM SLOPE ANGLE ...     0.000 DEG 

PROJECTED SPAN LENGTH     166.74 M      PIPE LENGTH GAIN .....      1.72 M  

I am able to find the values using the following expression:

line = '   PROJECTED SPAN LENGTH     166.74 M      PIPE LENGTH GAIN .....      1.72 M   \n'
re.findall(r"[-+]?\d*\.\d+|\d+", line):
['166.74', '1.72']

But when I try to extract the variable names, using below expression I have leading and trailing whitespaces which I would like to leave out.

re.findall('(?<=\s.)[A-Z\s]+', line)
[' PROJECTED SPAN LENGTH     ', '      PIPE LENGTH GAIN ', '    ', '   \n']

I believe it should have something like ^\s, but I can't get it to work. When successful I'd like to store the data in a dataframe, having the variable names as indices and the values as column.

Upvotes: 1

Views: 1738

Answers (3)

depperm
depperm

Reputation: 10746

Use [A-Z]{2,}(?:\s+[A-Z]+)*

[A-Z]{2,} looks for uppercase words at least 2 in length

(?:\s+[A-Z]+)* is a capture group for if there are multiple words in the label

EDIT

To handle the case in your comment I'd recommend:

[A-Z-\/]{2,}(?:\s*[A-Z-\/]+(?:\.)*)*

just make sure there is at least one space after the last period in R.O.W. and before the ...

[A-Z-\/]{2,} will check for uppercase letters, -, and / of 2 length or greater

(?:\s*[A-Z-\/]+(?:\.)*)* is a capture group for for multiple words and/or words with periods in them

Upvotes: 0

Jan
Jan

Reputation: 43169

You can use the following expression along with re.finditer():

(?P<category>[A-Z][A-Z- ]+[A-Z])
[. ]+
(?P<value>-?\d[.\d]+)\ 
(?P<unit>M|DEG|KN)

See a demo on regex101.com.


In Python this would be:

import re

rx = re.compile(r'''
    (?P<category>[A-Z][A-Z- ]+[A-Z])
    [. ]+
    (?P<value>-?\d[.\d]+)\ 
    (?P<unit>M|DEG|KN)
''', re.VERBOSE)

string = '''
WATER DEPTH ..........     20.00 M      TENSION AT TOUCHDOWN .    382.47 KN  

TOUCHDOWN X-COORD. ...   -206.75 M      BOTTOM SLOPE ANGLE ...     0.000 DEG 

PROJECTED SPAN LENGTH     166.74 M      PIPE LENGTH GAIN .....      1.72 M  
'''

matches = [(m.group('category'), m.group('value'), m.group('unit')) \
            for m in rx.finditer(string)]
print(matches)
# [('WATER DEPTH', '20.00', 'M'), ('TENSION AT TOUCHDOWN', '382.47', 'KN'), ('TOUCHDOWN X-COORD', '-206.75', 'M'), ('BOTTOM SLOPE ANGLE', '0.000', 'DEG'), ('PROJECTED SPAN LENGTH', '166.74', 'M'), ('PIPE LENGTH GAIN', '1.72', 'M')]

See a demo on ideone.com.

Upvotes: 1

gregbert
gregbert

Reputation: 546

If you ever want to take out leading/trailing white space, you can use the .strip() method.

Python String strip

stripped_values = [raw.strip() for raw in re.findall('(?<=\s.)[A-Z\s]+', line)]

Upvotes: 0

Related Questions