Crustater Rocks
Crustater Rocks

Reputation: 59

Any suggestions to improve Python string parsing

I'm running Python 3.6.8. I need to sum values that appear in a log file. The line may contain 1 to 14 {index,value} pairs; a typical line for 8 values is in the code below(variable called 'log_line'). The line format with the '- -' separator is consistent. I have working code, but I'm not sure if this is the most elegant or best way to parse this string; it feels a bit clunky. Any suggestions?

    import re
    
    #verion 1
    log_line = 'Some explanatory text was here:      - -{0, 8} {1, 24} {2, 24} {3, 5} {4, 5} {5, 12} {6, 12} {7, 5}'
    log_line_values = log_line.split('- -')[1]
    values = re.findall(r'{\d+,\s\d+}',log_line_values)
    sum_of_values = 0
    for v in values:
        sum_of_values += int(v.replace('{','').replace('}','').replace(' ','').split(',')[1])
    print(f'1) sum_of_values:{sum_of_values}')

    #verions 2, essentially the same, but more concise (some may say confusing)
    sum_of_values = sum([int(v.replace('{','').replace('}','').replace(' ','').split(',')[1]) for v in re.findall(r'{\d+,\s\d+}',log_line.split('- -')[1])])
    print(f'2) sum_of_values:{sum_of_values}')

Upvotes: 0

Views: 104

Answers (3)

aviso
aviso

Reputation: 2841

Assuming you've already identified that the line is one that matches the pattern, you can simplify your logic a lot by using a generator expression within sum().

import re

# Compile your regular expression for reuse
# Just pull out the last value from each pair
re_extract_val = re.compile(r'{\d+, (\d+)}')

log_line = 'Some explanatory text was here:      - -{0, 8} {1, 24} {2, 24} {3, 5} {4, 5} {5, 12} {6, 12} {7, 5}'

# Use generator comprehension within sum() to add all values
sum_of_values = sum(int(val) for val in re_extract_val.findall(log_line))

You could also use map(), but I find it's clearer with a generator expression

sum_of_values = sum(map(int, re_extract_val.findall(log_line)))

Upvotes: 1

Vlad Havriuk
Vlad Havriuk

Reputation: 1451

Ideal use case for regular expressions capture groups:

import re

log_line = 'Some explanatory text was here:      - -{0, 8} {1, 24} {2, 24} {3, 5} {4, 5} {5, 12} {6, 12} {7, 5}'
pattern = r'{(\d+), (\d+)}'

s = sum([int(e[1]) for e in re.findall(pattern, log_line.split('- -')[1])])

print(s) # 95

Here I use re.findall to match numbers from input array and use list comprehension to convert them to numbers and sum.

The advantage of using {(\d+), (\d+)} pattern is the ability to extract first number too (if you need it).

Upvotes: 0

Mattwmaster58
Mattwmaster58

Reputation: 2579

First, no need to get rid of the prefix - the regex will take care of not matching that. Second, we can use capturing groups to capture values that we only care about. In our case, the second value in a comma seperated pair. We can use map(int, iterable) to turn every string to an int in a list, and then we can use sum on that list of numbers.

Putting it all together:

import re

log_line = 'Some explanatory text was here:      - -{0, 8} {1, 24} {2, 24} {3, 5} {4, 5} {5, 12} {6, 12} {7, 5}'
values = re.findall(r'{\d+,\s(\d+)}', log_line_values)
sum_of_values = sum(map(int, values))

Upvotes: 2

Related Questions