Dan Halbert
Dan Halbert

Reputation: 2925

Representing XML tags as character ranges

Do you know of any XML libraries that convert XML markup to and from character range or offset information based on the original unmarked text? (I don't care much about the base platform of the libraries: it could be Java, Python, Perl, etc.)

For instance, suppose I have this unmarked text:

the calico cat and the black dog

which is marked up as

the <PET>calico</PET> cat and the <PET>black do</PET>g

The markup has positional errors (as demonstrated above). I know how to fix those errors: that's not the question here. But it's fairly painful to use conventional hierarchy-minded XML parsers to do this. It would be easier if the XML markup was converted to out-of-band character ranges which I could easily adjust:

PET: 4-10    # "calico"  (should be 4-14 "calico cat" )
PET: 23-31   # "black do" (should be 23-32 "black dog" )

After fixing the offsets I would regenerate the XML.

I've only found a few XML parsing libraries that return character offset information, and the offsets are based on the XML text, not the unmarked text. Also the offsets can be wrong (cf. Java, XMLEvent location Characters).

Upvotes: 1

Views: 79

Answers (4)

rlayers
rlayers

Reputation: 21

You can get the character indices for all elements, attributes, tags, text, etc. in an XML document using Pawpaw.

Code:

import sys
sys.modules['_elementtree'] = None
import xml.etree.ElementTree as ET

from pawpaw import xml, visualization

text = \
"""<?xml version="1.0"?>
<music xmlns:mb="http://musicbrainz.org/ns/mmd-1.0#" xmlns="http://mymusic.org/xml/">
    <?display table-view?>
    <album genre="R&amp;B" mb:id="123-456-789-0">
        Robson Jorge &amp; Lincoln Olivetti <!-- 1982, Vinyl -->
    </album>
</music>"""

root = ET.fromstring(text, parser=xml.XmlParser())
print(visualization.pepo.Tree().dumps(root.ito))

Output:

(22, 271) 'element' : '<music xmlns:mb="htt… </album>\n</music>'
├──(22, 107) 'start_tag' : '<music xmlns:mb="htt…/mymusic.org/xml/">'
│  ├──(23, 28) 'tag' : 'music'
│  │  └──(23, 28) 'name' : 'music'
│  └──(29, 106) 'attributes' : 'xmlns:mb="http://mus…//mymusic.org/xml/"'
│     ├──(29, 74) 'attribute' : 'xmlns:mb="http://mus…nz.org/ns/mmd-1.0#"'
│     │  ├──(29, 37) 'tag' : 'xmlns:mb'
│     │  │  ├──(29, 34) 'namespace' : 'xmlns'
│     │  │  └──(35, 37) 'name' : 'mb'
│     │  └──(39, 73) 'value' : 'http://musicbrainz.org/ns/mmd-1.0#'
│     └──(75, 106) 'attribute' : 'xmlns="http://mymusic.org/xml/"'
│        ├──(75, 80) 'tag' : 'xmlns'
│        │  └──(75, 80) 'name' : 'xmlns'
│        └──(82, 105) 'value' : 'http://mymusic.org/xml/'
├──(107, 139) 'text' : '\n    <?display table-view?>\n    '
│  └──(112, 134) 'pi' : '<?display table-view?>'
│     └──(114, 132) 'value' : 'display table-view'
├──(139, 262) 'element' : '<album genre="R&amp;…l -->\n    </album>'
│  ├──(139, 184) 'start_tag' : '<album genre="R&amp;…id="123-456-789-0">'
│  │  ├──(140, 145) 'tag' : 'album'
│  │  │  └──(140, 145) 'name' : 'album'
│  │  └──(146, 183) 'attributes' : 'genre="R&amp;B" mb:id="123-456-789-0"'
│  │     ├──(146, 161) 'attribute' : 'genre="R&amp;B"'
│  │     │  ├──(146, 151) 'tag' : 'genre'
│  │     │  │  └──(146, 151) 'name' : 'genre'
│  │     │  └──(153, 160) 'value' : 'R&amp;B'
│  │     └──(162, 183) 'attribute' : 'mb:id="123-456-789-0"'
│  │        ├──(162, 167) 'tag' : 'mb:id'
│  │        │  ├──(162, 164) 'namespace' : 'mb'
│  │        │  └──(165, 167) 'name' : 'id'
│  │        └──(169, 182) 'value' : '123-456-789-0'
│  ├──(184, 254) 'text' : '\n        Robson Jor…82, Vinyl -->\n    '
│  │  └──(229, 249) 'comment' : '<!-- 1982, Vinyl -->'
│  │     └──(233, 246) 'value' : ' 1982, Vinyl '
│  └──(254, 262) 'end_tag' : '</album>'
│     └──(256, 261) 'tag' : 'album'
│        └──(256, 261) 'name' : 'album'
└──(263, 271) 'end_tag' : '</music>'
   └──(265, 270) 'tag' : 'music'
      └──(265, 270) 'name' : 'music'

Upvotes: 0

JLRishe
JLRishe

Reputation: 101738

Here's how this can be accomplished with an XmlReader in .NET:

class MarkupSpan
{
    internal string Name;
    internal int Start;
    internal int Stop;
    internal List<object> ChildItems;

    internal MarkupSpan(string name, int start)
    {
        Name = name;
        Start = start;
        ChildItems = new List<object>();
    }

    public override string ToString()
    {
        return string.Concat(ChildItems);
    }
}


private static string ProcessMarkup(string text)
{
    Stack<MarkupSpan> inputStack = new Stack<MarkupSpan>();

    StringReader sr = new StringReader("<n>" + text + "</n>");

    XmlReader xr = XmlReader.Create(sr);
    int pos = 0;
    StringBuilder output = new StringBuilder();

    while (xr.Read())
    {
        if (xr.Depth > 0)
        {
            switch (xr.NodeType)
            {
                case XmlNodeType.Text:
                    pos += xr.Value.Length;
                    if (inputStack.Count != 0)
                    {
                        inputStack.Peek().ChildItems.Add(xr.Value);
                    }
                    break;
                case XmlNodeType.Element:
                    MarkupSpan ms = new MarkupSpan(xr.LocalName, pos);
                    if (inputStack.Count != 0)
                    {
                        inputStack.Peek().ChildItems.Add(ms);
                    }
                    inputStack.Push(ms);
                    break;
                case XmlNodeType.EndElement:
                    ms = inputStack.Pop();
                    ms.Stop = pos;
                    if (inputStack.Count == 0)
                    {
                        output.Append(OutputSpan(ms));
                    }
                    break;
            }
        }
    }

    return output.ToString();
}

private static string OutputSpan(MarkupSpan ms)
{
    string nameAndRange = string.Format("{0}: {1}-{2}",
                                        ms.Name, ms.Start, ms.Stop);
    return string.Format("{0,-14}# \"{1}\"", nameAndRange, ms) +
           Environment.NewLine +
           string.Concat(ms.ChildItems.OfType<MarkupSpan>().Select(OutputSpan));
}

When run on your sample input, the result is:

PET: 4-10     # "calico"
PET: 23-31    # "black do"

When run on a more interesting example (with nested tags):

the <PET><COLOR>calico</COLOR></PET> cat and the <PET><COLOR>bla</COLOR>ck do</PET>g

The result is:

PET: 4-10     # "calico"
COLOR: 4-10   # "calico"
PET: 23-31    # "black do"
COLOR: 23-26  # "bla"

Upvotes: 1

JLRishe
JLRishe

Reputation: 101738

I've provided a .NET answer, but here's how this can be done with XSLT:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text" />
  <xsl:variable name="space" select="'                  '" />
  <xsl:variable name="spaceLen" select="string-length($space)" />

  <xsl:template match="text()" />

  <xsl:template match="*/*">
    <xsl:param name="parentLeading" select="0" />
    <xsl:variable name="leadingText">
      <xsl:apply-templates select="preceding-sibling::node()" mode="value" />
    </xsl:variable>

    <xsl:variable name="leading" select="$parentLeading + 
                                             string-length($leadingText)" />

    <xsl:variable name="nameAndRange" 
                  select="concat(local-name(), ' ', $leading, 
                                 '-', $leading + string-length())" />
    <xsl:variable name="spacing"
                  select="substring($space, 1, 14 - string-length($nameAndRange))" />
    <xsl:value-of select="concat($nameAndRange, $spacing, 
                                 '# &quot;', ., '&quot;&#xA;')"/>
    <xsl:apply-templates>
      <xsl:with-param name="parentLeading" select="$leading" />
    </xsl:apply-templates>
  </xsl:template>

  <xsl:template match="node()" mode="value">
    <xsl:value-of select="." />
  </xsl:template>
</xsl:stylesheet>

When run on this input:

<n>the <PET>calico</PET> cat and the <PET>black do</PET>g</n>

The result is:

PET 4-10      # "calico"
PET 23-31     # "black do"

And when run on this input:

<n>the <PET><COLOR>calico</COLOR></PET> cat and the <PET><COLOR>bla</COLOR>ck do</PET>g</n>

The result is:

PET 4-10      # "calico"
COLOR 4-10    # "calico"
PET 23-31     # "black do"
COLOR 23-26   # "bla"

Upvotes: 1

Cᴏʀʏ
Cᴏʀʏ

Reputation: 107586

Are you opposed to .NET?

You might want to tackle this from the standpoint of HTML. There is a library called the HtmlAgilityPack that can parse HTML (which is just XML anyway). In doing so, your example would look something like a list of nodes, broken up between text nodes and HTML (XML) PET nodes:

HtmlNode[n]
|
+--[0] "the " (text node)
|
+--[1] <PET>
|   |
|   +--[0] "calico" (text node)
|
+--[2] " cat and the " (text node)
|
+--[3] <PET>
|   |
|   +--[0] "black do" (text node)
|
+--[4] "g" (text node)

Each HtmlNode object has a LinePosition property that would give you your starting offsets. The end offsets can be calculated by adding on the length of the node's text (the InnerText property) or subtracting 1 from the next node's LinePosition.

I don't know if you think this approach is less painful, but it's where I would start (having never tackled a problem like this before).

There's a list of HTML parsing libraries in various languages here.

Upvotes: 1

Related Questions