Reputation: 805
I have this file which contains several math
tags like so:
<Math
<Unique 262963>
<BRect 1.02176" 0.09096" 1.86024" 0.40658">
<MathFullForm `equal[therefore[char[tau]],plus[indexes[0,1,char[tau],char[c]],minus[times[indexes[
0,1,char[tau],char[s]],string[" and "],over[times[char[d],char[omega]],times[char[
d],char[t]]]]]],over[char[tau],char[I]]]'
> # end of MathFullForm
<MathLineBreak 138.88883">
<MathOrigin 1.95188" 0.32125">
<MathAlignment Center>
<MathSize MathMedium>
> # end of Math
And like so:
<Math
<Unique 87795>
<Separation 0>
<ObColor `Black'>
<RunaroundGap 0.0 pt>
<BRect 0.01389" 0.01389" 0.17519" 0.22013">
<MathFullForm `indexes[0,1,char[m,0,0,1,0,0],char[i]]'
> # end of MathFullForm
And I want to extract the contents of the Unique
tag and the MathFullForm
tag, but I am at a loss at how to do so. Note that Unique
tags exist elsewhere in the file, outside of Math
tags.
I've tried using regex but that doesn't work too well and misses many of the tags. I then thought about using an XML parser, but that wouldn't work because the code isn't valid XML.
Can anyone steer me in the right direction to do this in Python (a regex solution is acceptable).
Upvotes: 0
Views: 298
Reputation: 805
I have found the solution by using the following regex:
<Math\s*<Unique[^>]*>\s*(?:<Separation[^>]*>)*\s*(?:<ObColor[^>]*>)*\s*(?:<RunaroundGap[^>]*>)*\s*<BRect[^>]*>\s*<MathFullForm `[^']*'
This matches the whole tag, so I can use two more regexes to extract the necessary information.
Upvotes: 0
Reputation: 774
You could use a loop to remove the tag. re.finditer()
can be used to iteratively extract the tags.
Check the below code and see if it works for you.
text = re.sub(r'\r|\n',' ',text)
for m in re.finditer(r'(\<Unique\s).*?\>',text):
print m.group()
for m in re.finditer(r'(\<MathFullForm\s).*?\>',text):
print m.group()
Upvotes: 1
Reputation: 6036
You can use this regex, specifying the DOTALL
flag(otherwise the .
would not match the \n
too):
<(Unique|MathFullForm)(.*?)>
The first capturing group says if the match belongs to the Unique
or MathFullForm
tag, whereas in the second you can find the content of the tag.
Upvotes: 0