Reputation: 27575
I have some well-behaved xml files I want to reformat (NOT PARSE!) using regex. The goal is to have every <trkpt>
pairs as oneliners.
The following code works, but I'd like to get the operations performed in a single regex substitution instead of the loop, so that I don't need to concatenate the strings back.
import re
xml = """
<trkseg>
<trkpt lon="-51.2220657617" lat="-30.1072524581">
<time>2012-08-25T10:20:44Z</time>
<ele>0</ele>
</trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581">
<time>2012-08-25T10:20:44Z</time>
<ele>0</ele>
</trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581">
<time>2012-08-25T10:20:44Z</time>
<ele>0</ele>
</trkpt>
</trkseg>
"""
for trkpt in re.findall('<trkpt.*?</trkpt>', xml, re.DOTALL):
print re.sub('>\s*<', '><', trkpt, re.DOTALL)
An answer using sed
would also be welcome.
Thanks for reading
Upvotes: 2
Views: 1317
Reputation: 336128
How about this:
>>> regex = re.compile(
r"""\n[ \t]* # Match a newline plus following whitespace
(?= # only if...
(?: # ...the following can be matched:
(?!<trkpt) # (unless an opening <trkpt> tag occurs first)
. # any character
)* # any number of times,
</trkpt> # followed by a closing </trkpt> tag
) # End of lookahead""",
re.DOTALL | re.VERBOSE)
>>> print regex.sub("", xml)
<trkseg>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
</trkseg>
Upvotes: 2
Reputation: 2664
Another one-liner is
print re.sub("(<trkpt.+?>).*?(<time>.+?</time>).*?(<ele>.+?</ele>).*?(</trkpt>)",
r'\1\2\3\4', xml, re.DOTALL)
produces
<trkseg>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
</trkseg>
This has the advantage of being easy to change for other tags.
Upvotes: 1
Reputation: 777
Do you want to keep the <trkseg>
? If so, this could work for you:
print re.sub('([^gt])>\s*<', '\g<1>><', xml, re.DOTALL)
Removes all spaces between elements, on condition that the previous element does not end with t or g.
<trkseg>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
</trkseg>
Upvotes: 1
Reputation: 28846
This isn't really what you were asking for, but here's a one-liner for the sake of being a one-liner:
>>> print re.sub(r'(<trkpt.*?</trkpt>)',
lambda m: re.sub(r'>\s*<', '><', m.group(1), re.DOTALL),
xml, flags=re.DOTALL)
<trkseg>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
</trkseg>
Also note that this approach will break if any string attributes contain the string "<trkpt"
, which probably won't happen, but that's the problem with not using a real parser.
Upvotes: 1