Regular expression to match comma separated list of key=value where value can contain html?

Question

I am trying to match a comma seperated list of key=value where the value can contain well a while host of things quite frankly.

The pattern I am using is exactly from this related question:

split_up_pattern = re.compile(r'([^=]+)=([^=]+)(?:,|$)', re.X|re.M)

However it is causing issues when the value contains html.

Here is an example script:

import re

text = '''package_contents=The basic Super 1050 machine includes the following:
 





uper 1150 machine


 With dies fitted.

The Super 1050




,second_attribute=something else'''

split_up_pattern = re.compile(r'([\w_^=]+)=([^=]+)(?:,|$)', re.X|re.M)

matches = split_up_pattern.findall(text)

import ipdb; ipdb.set_trace()

print(matches)
The output:
ipdb> matches[0]
('package_contents', 'The basic Super 1050 machine includes the following:

 
')
ipdb> matches[1]
('border', '""1"">


')
ipdb> matches[2]
('style', '""width: 200px;"">



uper 1150 machine











 With dies fitted.



The Super 1050







')
ipdb> matches[3]
('second_attribute', 'something else')
The output I want is:
matches[0]:
('package_contents', 'The basic Super 1050 machine includes the following:
 






uper 1150 machine  With dies fitted.The Super 1050
',)
matches[1]:
('second_attribute', 'something else')

FMc · Accepted Answer

Instead of basing the parsing narrowly on the delimiters (comma or equal sign), you can leverage the fact that the next key-value pair starts with something like this:

,WORD=

Here's a sketch of the idea:

import re

text = '''...your example...'''

# Start of the string or our ,WORD= pattern.
rgx_spans = re.compile(r'(\A|,)\w+=')

# Get the start-end positions of all matches.
spans = [m.span() for m in rgx_spans.finditer(text)]

# Use those positions to break up the string into parsable chunks.
for i, s1 in enumerate(spans):
    try:
        s2 = spans[i + 1]
    except IndexError:
        s2 = (None, None)

    start = s1[0]
    end = s2[0]
    key, val = text[start:end].lstrip(',').split('=', 1)

    print()
    print(s1, s2)
    print((key, val))

Regular expression to match comma separated list of key=value where value can contain html?

Answers (1)

Related Questions