Reputation: 17

Python - Splitting a line of numbers where some of them contain dashes

I have a data set where lines resemble:

G. Smith      2.2 - 1.6     1.0 - 1.5     33 - 55     660
F. John       1.4 - 0.2.                  60          850 (early) - 1100 (late)
R. Kahn       1.9 - 1.4     1.9                       700 - 850
L. Terry      0.6 - 0.35.   1.8.          60          1100 - 1400

I am trying to get the average of the last column. For numbers that are separated by dashes I would like to consider them as their own averages (e.g. 700 - 850 would be counted as its average, 775, when calculating the overall average of the column)

So far my approach has been to split the line and then average the values at the last index every time I loop through a line. However, I can only ever get it to split every value, making the hyphenated numbers 2 different values, or get it to look like its split perfectly but the values aren't separated by commas so I cannot access specific values in the line. Heres what I've tried with the 2nd row output as example (note that I have not even attempted to include averaging calculations as I have not been able to get the line to split correctly):

with open('file.txt', 'r') as f:
    l = f.readline()
    while l:
        current_line = re.split('\W\W\W+', l)
        print(current_line)
        l = f.readline()

Returns:

['F. John', '1.4', '0.2', '60', '850 (early', '1100 (late)\n']

And...

with open('file.txt', 'r') as f:
    l = f.readline()
    while l:
        current_line = re.split('^-\W\W\W+', l)
        print(current_line)
        l = f.readline()

Returns:

 ['F. John           1.4 - 0.2                    60          850 (early) - 1100 (late)\n']

The second outcome is preferable since it is keeping the dashed values together, however as you can see all of the values are grouped together instead of each being separated by a column.

Desired Outcome:

 ['F. John', '1.4 - 0.2', '60', '850 (early) - 1100 (late)\n']

There is also the matter of that "early/late" in parenthesis on one of the lines. I haven't even begun to think about how to get around that, so tips on that as well would be appreciated.

Also, is there a way to get it to recognize blank spaces? I ask this because later on I will also have to get the largest value of the 3rd column. The way I'm thinking now is that if I try to access index 2 of every row (where the 3rd column's values should be) on the 2nd row I will be reading the 4th columns value instead since the 3rd column is missing that value and in the list it just continues to the next value without making a place holder for the empty field.

If this is possible the desired outcome would instead look like this for the 2nd row:

['F. John', '1.4 - 0.2', ' ', '60', '850 (early) - 1100 (late)\n']

or maybe even:

['F. John', '1.4 - 0.2', NULL, '60', '850 (early) - 1100 (late)\n']

Upvotes: 1

Answers (3)

Francisco

Reputation: 503

last_col = re.split("\s{2,}", l)[-1] # eg '850 (early) - 1100 (late)' or '600'
patt = re.compile("(?P<num1>[0-9]+)[a-zA-z(\)\- ]+(?P<num2>[0-9]+)")
g = patt.search(last_col)

if g:
    val = (int(g.group('num1')) + int(g.group('num2'))) / 2
else:
    val = int(last_col)

Upvotes: 1