Reputation: 11098
I am trying to match a comma seperated list of key=value where the value can contain well a while host of things quite frankly.
The pattern I am using is exactly from this related question:
split_up_pattern = re.compile(r'([^=]+)=([^=]+)(?:,|$)', re.X|re.M)
However it is causing issues when the value contains html.
Here is an example script:
import re
text = '''package_contents=<p>The basic Super 1050 machine includes the following:</p>
<p> </p>
<table style="" height: 567px;"" border=""1"">
<tbody>
<tr>
<td style=""width: 200px;"">
<ul>
<li>uper 1150 machine</li>
</ul>
</td>
<td> With dies fitted.
<ul>
<li>The Super 1050</li>
</ul>
</td>
</tr>
</tbody>
<table>,second_attribute=something else'''
split_up_pattern = re.compile(r'([\w_^=]+)=([^=]+)(?:,|$)', re.X|re.M)
matches = split_up_pattern.findall(text)
import ipdb; ipdb.set_trace()
print(matches)
The output:
ipdb> matches[0]
('package_contents', '<p>The basic Super 1050 machine includes the following:</p>\n\n<p> </p>\n')
ipdb> matches[1]
('border', '""1"">\n\n<tbody>\n\n<tr>\n')
ipdb> matches[2]
('style', '""width: 200px;"">\n\n<ul>\n\n<li>uper 1150 machine</li>\n\n</ul>\n\n</td>\n\n<td> With dies fitted.\n\n<ul>\n\n<li>The Super 1050</li>\n\n</ul>\n\n</td>\n\n</tr>\n</tbody>\n<table>')
ipdb> matches[3]
('second_attribute', 'something else')
The output I want is:
matches[0]
:
('package_contents', '<p>The basic Super 1050 machine includes the following:</p><p> </p><table style="" height: 567px;"" border=""1""><tbody><tr><td style=""width: 200px;""><ul><li>uper 1150 machine</li></ul></td><td> With dies fitted.<ul><li>The Super 1050</li></ul></td></tr>
</tbody><table>',)
matches[1]
:
('second_attribute', 'something else')
Upvotes: 0
Views: 461
Reputation: 42421
Instead of basing the parsing narrowly on the delimiters (comma or equal sign), you can leverage the fact that the next key-value pair starts with something like this:
,WORD=
Here's a sketch of the idea:
import re
text = '''...your example...'''
# Start of the string or our ,WORD= pattern.
rgx_spans = re.compile(r'(\A|,)\w+=')
# Get the start-end positions of all matches.
spans = [m.span() for m in rgx_spans.finditer(text)]
# Use those positions to break up the string into parsable chunks.
for i, s1 in enumerate(spans):
try:
s2 = spans[i + 1]
except IndexError:
s2 = (None, None)
start = s1[0]
end = s2[0]
key, val = text[start:end].lstrip(',').split('=', 1)
print()
print(s1, s2)
print((key, val))
Upvotes: 1