Beta Decay
Beta Decay

Reputation: 805

Parsing an XML-like file in Python

I have this file which contains several math tags like so:

<Math 
   <Unique 262963>
   <BRect  1.02176" 0.09096" 1.86024" 0.40658">
   <MathFullForm `equal[therefore[char[tau]],plus[indexes[0,1,char[tau],char[c]],minus[times[indexes[
0,1,char[tau],char[s]],string[" and  "],over[times[char[d],char[omega]],times[char[
d],char[t]]]]]],over[char[tau],char[I]]]'
   > # end of MathFullForm
   <MathLineBreak  138.88883">
   <MathOrigin  1.95188" 0.32125">
   <MathAlignment Center>
   <MathSize MathMedium>
> # end of Math

And like so:

<Math 
   <Unique 87795>
   <Separation 0>
   <ObColor `Black'>
   <RunaroundGap  0.0 pt>
   <BRect  0.01389" 0.01389" 0.17519" 0.22013">
   <MathFullForm `indexes[0,1,char[m,0,0,1,0,0],char[i]]'
> # end of MathFullForm

And I want to extract the contents of the Unique tag and the MathFullForm tag, but I am at a loss at how to do so. Note that Unique tags exist elsewhere in the file, outside of Math tags.

I've tried using regex but that doesn't work too well and misses many of the tags. I then thought about using an XML parser, but that wouldn't work because the code isn't valid XML.

Can anyone steer me in the right direction to do this in Python (a regex solution is acceptable).

Upvotes: 0

Views: 298

Answers (3)

Beta Decay
Beta Decay

Reputation: 805

I have found the solution by using the following regex:

<Math\s*<Unique[^>]*>\s*(?:<Separation[^>]*>)*\s*(?:<ObColor[^>]*>)*\s*(?:<RunaroundGap[^>]*>)*\s*<BRect[^>]*>\s*<MathFullForm `[^']*'

This matches the whole tag, so I can use two more regexes to extract the necessary information.

Upvotes: 0

Rohan Amrute
Rohan Amrute

Reputation: 774

You could use a loop to remove the tag. re.finditer() can be used to iteratively extract the tags.

Check the below code and see if it works for you.

text = re.sub(r'\r|\n',' ',text)
for m in re.finditer(r'(\<Unique\s).*?\>',text):
   print m.group()
for m in re.finditer(r'(\<MathFullForm\s).*?\>',text):
   print m.group()

Upvotes: 1

Marco Luzzara
Marco Luzzara

Reputation: 6036

You can use this regex, specifying the DOTALL flag(otherwise the . would not match the \n too):

<(Unique|MathFullForm)(.*?)>

The first capturing group says if the match belongs to the Unique or MathFullForm tag, whereas in the second you can find the content of the tag.

Upvotes: 0

Related Questions