bandrzej
bandrzej

Reputation: 533

XSLT 2.0: Creating child elements from an element's text value via known semantic hierarchy

A bit stuck on this one. Data is provided in the following format (non-important content snipped):

<?xml version="1.0" encoding="UTF-8"?>
<Content Type="Statutes">
  <Indexes>
    <!--SNIP-->
    <Index Level="3" HasChildren="0">
      <!--SNIP-->
      <Content>&lt;p&gt; (1)(a)The statutes ... &lt;/p&gt;&lt;p&gt; (b)To ensure public ..: &lt;/p&gt;&lt;p&gt; 
            (I)Shall authorize ...; &lt;/p&gt;&lt;p&gt; (II)May authorize and ...: &lt;/p&gt;&lt;p&gt; (A)Compact disks; 
            &lt;/p&gt;&lt;p&gt; (B)On-line public ...; &lt;/p&gt;&lt;p&gt; (C)Electronic applications for ..; 
            &lt;/p&gt;&lt;p&gt; (D)Electronic books or ... &lt;/p&gt;&lt;p&gt; (E)Other electronic products or formats; 
            &lt;/p&gt;&lt;p&gt; (III)May, pursuant ... &lt;/p&gt;&lt;p&gt; (IV)Recognizes that ... &lt;/p&gt;&lt;p&gt; 
            (2)(a)Any person, ...: &lt;/p&gt;&lt;p&gt; (I)A statement specifying ...; &lt;/p&gt;&lt;p&gt; (II)A statement 
            specifying ...; &lt;/p&gt;&lt;p&gt; (3)A statement 
            specifying ...; &lt;/p&gt;&lt;p&gt; (4)A statement 
            specifying ...; &lt;/p&gt;</Content>
    </Index>
    <!--SNIP-->
  </Indexes>
</Content>

Need to take the text value of element Content which contains a semantic hierarchy:

(1)
 +-(a)
    +-(I)
       +-(A)

...and place via XSLT 2.0 transformation as a parent-child element relationship as the final output:

    <?xml version="1.0" encoding="UTF-8"?>
    <law>
       <!--SNIP-->
       <content>
          <section prefix="(1)">
            <section prefix="(a)">The statutes ...
            <section prefix="(b)">To ensure public ..:
              <section prefix="(I)">Shall authorize ...;</section>
              <section prefix="(II)">May authorize and ...:
                <section prefix="(A)">Compact disks;</section>
                <section prefix="(B)">On-line public ...;</section>
                <section prefix="(C)">Electronic applications for ..;</section>
                <section prefix="(D)">Electronic books or ...</section>
                <section prefix="(E)">Other electronic products or formats;</section>
              </section>
              <section prefix="(III)">May, pursuant ...</section>
              <section prefix="(IV)">Recognizes that ...</section>        
            </section>      
          </section>
          <section prefix="(2)">
            <section prefix="(a)">Any person, ...:
              <section prefix="(I)">A statement specifying ...;</section>
              <section prefix="(II)">A statement specifying ...;</section>
            </section>      
          </section>
          <section prefix="(3)">Level 1 node with no children</section>
       </content>
    </law>

I was able to tokenize the ending html-encoded P tags from Content's text value, but no clue how to get dynamically created elements to create child elements on conditionals.

My XSLT 2.0 stylesheet:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>

    <xsl:template match="/Content">
        <!-- Work from the lowest index level with no children up -->
        <xsl:apply-templates select=".//Index[@HasChildren=0]"/>
    </xsl:template>  

    <xsl:template match="Index[@HasChildren=0]">
        <law>
            <structure>
                <xsl:apply-templates select="Content"/>
            </structure>
        </law>
    </xsl:template>

    <!-- Template for Content element from originial -->
    <xsl:template match="Content">
        <content>
            <!-- Loop through HTML encoded P tag endings -->
            <xsl:for-each select="tokenize(.,'&lt;/p&gt;')">

                <!-- Set Token to a variable and remove P opening tags -->
                <xsl:variable name="sectionText">
                    <xsl:value-of select="normalize-space(replace(current(),'&lt;p&gt;',''))"/>  
                </xsl:variable>    

                <!-- Output -->
                <xsl:if test="string-length($sectionText)!=0">
                    <section>
                        <!-- Set the section element's prefix attribute (if exists) -->
                        <xsl:analyze-string select="$sectionText" regex="^(\(([\w]+)\)){{1,3}}">
                            <xsl:matching-substring >
                                <xsl:attribute name="prefix" select="." />
                            </xsl:matching-substring>
                        </xsl:analyze-string>

                        <!-- Set the section element's value -->
                        <xsl:value-of select="$sectionText"/>
                    </section>
                </xsl:if>

            </xsl:for-each>
        </content>
    </xsl:template>
</xsl:stylesheet> 

...which gets me out this far - doesn't have the semantic hierarchy within the section elements:

<?xml version="1.0" encoding="UTF-8"?>
<law>
   <structure>
      <content>
         <section prefix="(1)(a)">(1)(a)The statutes ...</section>
         <section prefix="(b)">(b)To ensure public ..:</section>
         <section prefix="(I)">(I)Shall authorize ...;</section>
         <section prefix="(II)">(II)May authorize and ...:</section>
         <section prefix="(A)">(A)Compact disks;</section>
         <section prefix="(B)">(B)On-line public ...;</section>
         <section prefix="(C)">(C)Electronic applications for ..;</section>
         <section prefix="(D)">(D)Electronic books or ...</section>
         <section prefix="(E)">(E)Other electronic products or formats;</section>
         <section prefix="(III)">(III)May, pursuant ...</section>
         <section prefix="(IV)">(IV)Recognizes that ...</section>
         <section prefix="(2)(a)">(2)(a)Any person, ...:</section>
         <section prefix="(I)">(I)A statement specifying ...;</section>
         <section prefix="(II)">(II)A statement specifying ...;</section>
         <section prefix="(3)">(3)Level 1 section with no children ...;</section>
      </content>
   </structure>
</law>

Since the Section elements are being created dynamically by the XSLT 2.0 stylesheet via tokenizing the end P tags, how do you build the parent-child relationship dynamically with the known semantic hierarchy via the prefix attribute?

Other programming language experiences point me in the direction of recursion based on the tokenization and logic on the prefix to its previous prefix for nesting - struggling to find any information on how to do this with my limited XSLT knowledge with v2.0 (used v1.0 almost 10+ years ago). I know I could just parse with an external Python script and be done, but trying to stick to a XSLT 2.0 stylesheet solution for maintainability.

Any help is appreciated to get me on the right track and/or solution.

Upvotes: 1

Views: 1038

Answers (2)

Martin Honnen
Martin Honnen

Reputation: 167696

I played with this a bit and came up with the following stylesheet:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:mf="http://example.com/mf"
 xmlns:d="data:,dpc" 
 exclude-result-prefixes="xs d mf">

    <xsl:include href="htmlparse.xml"/>

    <xsl:param name="patterns" as="element(pattern)*" xmlns="">
      <pattern value="^\s*(\([0-9]+\))" group="1" next="1"/>
      <pattern value="^\s*(\([0-9]+\))?\s*(\([a-z]\))" group="2" next="0"/>
      <pattern value="^\s*(\(*(I|II|III|IV|V|VI|VII|VIII|IX|X|XI|XII|XIII)\))" group="1" next="0"/>
      <pattern value="^\s*(\([A-Z]?\))" group="1" next="0"/>
    </xsl:param>

    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>

    <xsl:function name="mf:group" as="element(section)*">
      <xsl:param name="paragraphs" as="element(p)*"/>
      <xsl:param name="patterns" as="element(pattern)*"/>
      <xsl:variable name="pattern1" as="element(pattern)?" select="$patterns[1]"/>
      <xsl:for-each-group select="$paragraphs" group-starting-with="p[matches(., $pattern1/@value)]">
        <xsl:variable name="prefix" as="xs:string?">
          <xsl:analyze-string select="." regex="{$pattern1/@value}">
            <xsl:matching-substring>
              <xsl:sequence select="string(regex-group(xs:integer($pattern1/@group)))"/>
            </xsl:matching-substring>
          </xsl:analyze-string>
        </xsl:variable>
        <section prefix="{$prefix}">
          <xsl:choose>
            <xsl:when test="xs:boolean(xs:integer($pattern1/@next))">
              <xsl:sequence select="mf:group(current-group(), $patterns[position() gt 1])"/>
            </xsl:when>
            <xsl:otherwise>
              <xsl:apply-templates select="node()">
                <xsl:with-param name="pattern" as="element(pattern)" select="$pattern1" tunnel="yes"/>
              </xsl:apply-templates>
              <xsl:sequence select="mf:group(current-group() except ., $patterns[position() gt 1])"/>
            </xsl:otherwise>
          </xsl:choose>
        </section>
      </xsl:for-each-group>
    </xsl:function>

    <xsl:template match="/Content">
        <!-- Work from the lowest index level with no children up -->
        <xsl:apply-templates select=".//Index[@HasChildren=0]"/>
    </xsl:template>  

    <xsl:template match="Index[@HasChildren=0]">
        <law>
            <structure>
                <xsl:apply-templates select="Content"/>
            </structure>
        </law>
    </xsl:template>

    <!-- Template for Content element from originial -->
    <xsl:template match="Content">

        <content>
            <xsl:sequence select="mf:group(d:htmlparse(., '', true())/*, $patterns)"/>
        </content>
    </xsl:template>

    <xsl:template match="p/text()[1]">
      <xsl:param name="pattern" as="element(pattern)" tunnel="yes"/>
      <xsl:value-of select="replace(., $pattern/@value, '')"/>
    </xsl:template>
</xsl:stylesheet> 

It makes use of http://web-xslt.googlecode.com/svn/trunk/htmlparse/htmlparse.xsl, an HTML tag soup parser written in XSLT 2.0, to parse the escaped HTML fragment markup into nodes which are then grouped using the function mf:group in the stylesheet. The grouping is driven by a sequence of regular expression patterns passed in as a parameter.

When applying the stylesheet with Saxon 9.5 to your input sample, I get the result

<law>
   <structure>
      <content>
         <section prefix="(1)">
            <section prefix="(a)">The statutes ... </section>
            <section prefix="(b)">To ensure public ..: <section prefix="(I)">Shall authorize ...; </section>
               <section prefix="(II)">May authorize and ...: <section prefix="(A)">Compact disks;
            </section>
                  <section prefix="(B)">On-line public ...; </section>
                  <section prefix="(C)">Electronic applications for ..;
            </section>
                  <section prefix="(D)">Electronic books or ... </section>
                  <section prefix="(E)">Other electronic products or formats;
            </section>
               </section>
               <section prefix="(III)">May, pursuant ... </section>
               <section prefix="(IV)">Recognizes that ... </section>
            </section>
         </section>
         <section prefix="(2)">
            <section prefix="(a)">Any person, ...: <section prefix="(I)">A statement specifying ...; </section>
               <section prefix="(II)">A statement
            specifying ...; </section>
            </section>
         </section>
      </content>
   </structure>
</law>

You would need to edit the parameter with the regular expression pattern for roman numbers to list more numbers if there can be more than 13 (XIII) sections as I have currently only listed the numbers including XIII.

Based on the comment and the edited input sample I have adapted the stylesheet a bit:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:mf="http://example.com/mf"
 xmlns:d="data:,dpc" 
 exclude-result-prefixes="xs d mf">

    <xsl:include href="htmlparse.xml"/>

    <xsl:param name="patterns" as="element(pattern)*" xmlns="">
      <pattern value="^\s*(\([0-9]+\))" group="1" next="1"/>
      <pattern value="^\s*(\([0-9]+\))?\s*(\([a-z]\))" group="2" next="0"/>
      <pattern value="^\s*(\(*(I|II|III|IV|V|VI|VII|VIII|IX|X|XI|XII|XIII)\))" group="1" next="0"/>
      <pattern value="^\s*(\([A-Z]?\))" group="1" next="0"/>
    </xsl:param>

    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>

    <xsl:function name="mf:group" as="element(section)*">
      <xsl:param name="paragraphs" as="element(p)*"/>
      <xsl:param name="patterns" as="element(pattern)*"/>
      <xsl:variable name="pattern1" as="element(pattern)?" select="$patterns[1]"/>
      <xsl:for-each-group select="$paragraphs" group-starting-with="p[matches(., $pattern1/@value)]">
        <xsl:variable name="prefix" as="xs:string?">
          <xsl:analyze-string select="." regex="{$pattern1/@value}">
            <xsl:matching-substring>
              <xsl:sequence select="string(regex-group(xs:integer($pattern1/@group)))"/>
            </xsl:matching-substring>
          </xsl:analyze-string>
        </xsl:variable>
        <section prefix="{$prefix}">
          <xsl:choose>
            <xsl:when test="xs:boolean(xs:integer($pattern1/@next)) and matches(., $patterns[2]/@value)">
              <xsl:sequence select="mf:group(current-group(), $patterns[position() gt 1])"/>
            </xsl:when>
            <xsl:otherwise>
              <xsl:apply-templates select="node()">
                <xsl:with-param name="pattern" as="element(pattern)" select="$pattern1" tunnel="yes"/>
              </xsl:apply-templates>
              <xsl:sequence select="mf:group(current-group() except ., $patterns[position() gt 1])"/>
            </xsl:otherwise>
          </xsl:choose>
        </section>
      </xsl:for-each-group>
    </xsl:function>

    <xsl:template match="/Content">
        <!-- Work from the lowest index level with no children up -->
        <xsl:apply-templates select=".//Index[@HasChildren=0]"/>
    </xsl:template>  

    <xsl:template match="Index[@HasChildren=0]">
        <law>
            <structure>
                <xsl:apply-templates select="Content"/>
            </structure>
        </law>
    </xsl:template>

    <!-- Template for Content element from originial -->
    <xsl:template match="Content">

        <content>
            <xsl:sequence select="mf:group(d:htmlparse(., '', true())/*, $patterns)"/>
        </content>
    </xsl:template>

    <xsl:template match="p/text()[1]">
      <xsl:param name="pattern" as="element(pattern)" tunnel="yes"/>
      <xsl:value-of select="replace(., $pattern/@value, '')"/>
    </xsl:template>
</xsl:stylesheet> 

Now it transforms

<?xml version="1.0" encoding="UTF-8"?>
<Content Type="Statutes">
  <Indexes>
    <!--SNIP-->
    <Index Level="3" HasChildren="0">
      <!--SNIP-->
      <Content>&lt;p&gt; (1)(a)The statutes ... &lt;/p&gt;&lt;p&gt; (b)To ensure public ..: &lt;/p&gt;&lt;p&gt; 
            (I)Shall authorize ...; &lt;/p&gt;&lt;p&gt; (II)May authorize and ...: &lt;/p&gt;&lt;p&gt; (A)Compact disks; 
            &lt;/p&gt;&lt;p&gt; (B)On-line public ...; &lt;/p&gt;&lt;p&gt; (C)Electronic applications for ..; 
            &lt;/p&gt;&lt;p&gt; (D)Electronic books or ... &lt;/p&gt;&lt;p&gt; (E)Other electronic products or formats; 
            &lt;/p&gt;&lt;p&gt; (III)May, pursuant ... &lt;/p&gt;&lt;p&gt; (IV)Recognizes that ... &lt;/p&gt;&lt;p&gt; 
            (2)(a)Any person, ...: &lt;/p&gt;&lt;p&gt; (I)A statement specifying ...; &lt;/p&gt;&lt;p&gt; (II)A statement 
            specifying ...; &lt;/p&gt;&lt;p&gt; (3)A statement 
            specifying ...; &lt;/p&gt;&lt;p&gt; (4)A statement 
            specifying ...; &lt;/p&gt;</Content>
    </Index>
    <!--SNIP-->
  </Indexes>
</Content>

to

<law>
   <structure>
      <content>
         <section prefix="(1)">
            <section prefix="(a)">The statutes ... </section>
            <section prefix="(b)">To ensure public ..: <section prefix="(I)">Shall authorize ...; </section>
               <section prefix="(II)">May authorize and ...: <section prefix="(A)">Compact disks;
            </section>
                  <section prefix="(B)">On-line public ...; </section>
                  <section prefix="(C)">Electronic applications for ..;
            </section>
                  <section prefix="(D)">Electronic books or ... </section>
                  <section prefix="(E)">Other electronic products or formats;
            </section>
               </section>
               <section prefix="(III)">May, pursuant ... </section>
               <section prefix="(IV)">Recognizes that ... </section>
            </section>
         </section>
         <section prefix="(2)">
            <section prefix="(a)">Any person, ...: <section prefix="(I)">A statement specifying ...; </section>
               <section prefix="(II)">A statement
            specifying ...; </section>
            </section>
         </section>
         <section prefix="(3)">A statement
            specifying ...; </section>
         <section prefix="(4)">A statement
            specifying ...; </section>
      </content>
   </structure>
</law>

Upvotes: 1

Michael Kay
Michael Kay

Reputation: 163458

You've tackled one tricky phase of the problem to create an intermediate output with elements like this:

<section prefix="(1)(a)">text</section>

My next step would be to compute a level number, so it looks like this:

<section level="1" prefix="(1)(a)">text</section>

Computing the level number is simply a question of seeing which of several regular expressions the prefix matches: (1) gives you level 1, (b) gives you level 2, etc.

Once you've got level numbers you can use recursive positional grouping as described in this paper: http://www.saxonica.com/papers/ideadb-1.1/mhk-paper.xml

Upvotes: 2

Related Questions