J. Nicholas
J. Nicholas

Reputation: 105

XSLT Streaming complex documents

Most of the examples I see of XSLT 3.0 streaming are fairly simple, and take inputs of the form

<rootTag>
 <repeatingThing>
   <CDataTag>text</CDataTag> 
   <CDataTag2>text</CDataTag2>
 </repeatingThing>
 <repeatingThing>...</repeatingThing>
</rootTag>

Assume you need to touch all the tags inside repeatingThing. In this case, streaming works well enough, do a copy-of inside your repeatingThing template, and you have reduced your memory footprint to 1/X (where X is the number of repeatingThing tags) of its original.

However, I deal with XML that is highly nested. Additionally, because of the nature of my stylesheet (JSON<->XML conversion), I need to touch all the tags in the XML source document. The copy-of approach won't work here, as the content is spread over many child nodes, and I'd be copying the entire XML into memory, just more explicitly.

I'm at a loss of how to use streaming to work in this case. A skeleton of such a "hierarchical" document is below:

<n1:ElectionReport xmlns:n1="NIST_V2_election_results_reporting.xsd">
    <n1:Election>
        <n1:BallotCounts>
            <n1:DeviceClass>
                <n1:Manufacturer/>
                <n1:Model/>
                <n1:Type/>
                <n1:OtherType/>
            </n1:DeviceClass>
            <n1:GpUnitId/>
            <n1:IsSuppressedForPrivacy/>
            <n1:Round/>
            <n1:Type/>
            <n1:OtherType/>
            <n1:BallotsCast/>
            <n1:BallotsOutstanding/>
            <n1:BallotsRejected/>
        </n1:BallotCounts>
        <n1:BallotStyle>
            <n1:ExternalIdentifier>
                <n1:Type/>
                <n1:OtherType/>
                <n1:Value/>
            </n1:ExternalIdentifier>
            <n1:GpUnitIds/>
            <n1:ImageUri/>
            <n1:OrderedContent xsi:type="n1:OrderedContest">
                <n1:ContestId/>
                <n1:OrderedContestSelectionIds/>
            </n1:OrderedContent>
            <n1:PartyIds/>
        </n1:BallotStyle>
        <n1:Candidate ObjectId="">
            <n1:BallotName>
                <n1:Text Language=""/>
            </n1:BallotName>
            <n1:CampaignSlogan>
                <n1:Text Language=""/>
            </n1:CampaignSlogan>
            <n1:ContactInformation>
                <n1:AddressLine/>
                <n1:Directions>
                    <n1:Text Language=""/>
                </n1:Directions>
                <n1:Email/>
                <n1:Fax/>
                <n1:LatLng>
                    <n1:Latitude/>
                    <n1:Longitude/>
                    <n1:Source/>
                </n1:LatLng>
                <n1:Name/>
                <n1:Phone/>
                <n1:Schedule>
                    <n1:Hours>
                        <n1:Day/>
                        <n1:StartTime/>
                        <n1:EndTime/>
                    </n1:Hours>
                    <n1:IsOnlyByAppointment/>
                    <n1:IsOrByAppointment/>
                    <n1:IsSubjectToChange/>
                    <n1:StartDate/>
                    <n1:EndDate/>
                </n1:Schedule>
                <n1:Uri/>
            </n1:ContactInformation>
            <n1:ExternalIdentifier>
                <n1:Type/>
                <n1:OtherType/>
                <n1:Value/>
            </n1:ExternalIdentifier>
            <n1:FileDate/>
            <n1:IsIncumbent/>
            <n1:IsTopTicket/>
            <n1:PartyId/>
            <n1:PersonId/>
            <n1:PostElectionStatus/>
            <n1:PreElectionStatus/>
        </n1:Candidate>
    </n1:Election>
    <n1:SequenceStart/>
    <n1:SequenceEnd/>
    <n1:Status/>
    <n1:TestType/>
    <n1:VendorApplicationId/>
</n1:ElectionReport>

Using Saxon-EE 9.8.0.12

Upvotes: 1

Views: 410

Answers (1)

Martin Honnen
Martin Honnen

Reputation: 167706

That sample you have linked to is too long to allow me to judge it but at least some templates are written in a style that seems too verbose even if you don't want to use streaming, e.g.

<xsl:template name="cdf:LatLng" match="element(*, cdf:LatLng)">
    <xsl:param name="set_type" select="false()"/>
    <xsl:where-populated>
        <string key="Label">
            <xsl:value-of select="@Label"/>
        </string>
    </xsl:where-populated>
    <xsl:where-populated>
        <number key="Latitude">
            <xsl:value-of select="cdf:Latitude"/>
        </number>
    </xsl:where-populated>
    <xsl:where-populated>
        <number key="Longitude">
            <xsl:value-of select="cdf:Longitude"/>
        </number>
    </xsl:where-populated>
    <xsl:where-populated>
        <string key="Source">
            <xsl:value-of select="cdf:Source"/>
        </string>
    </xsl:where-populated>
    <xsl:if test="not($set_type)">
        <string key="@type">ElectionResults.LatLng</string>
    </xsl:if>
</xsl:template>

seems to be doable as

<xsl:template match="LatLng">
    <xsl:param name="set_type" select="false()"/>
    <xsl:apply-templates select="@*"/>
    <xsl:apply-templates/>
    <xsl:if test="not($set_type)">
        <string key="@type">ElectionResults.LatLng</string>
    </xsl:if>
</xsl:template>

and then for the child elements and attributes you know that they are simple types you would simply use the approach suggested in my comment e.g.

<xsl:template match="element(*, xs:string)">
    <string key="{local-name()}">{.}</string>
</xsl:template>

<xsl:template match="element(*, xs:double) | element(*, xs:decimal)">
    <number key="{local-name()}">{.}</number>
</xsl:template>

Of course this basically assumes the child elements are to be processed in the order they are present and you want all of them processed but the last restriction can be eased even with streaming if you use e.g. <xsl:apply-templates select="*[self::foo or self::bar]"/>.

So at least where you simply want to map your known schema types to JSON and have spelled out a lot of different templates for the various elements I think that use of apply-templates instead of spelling out various child selections can help to make code streamable. For the types where you have the possible minOccurs=0 and maxOccurs=unbounded I think you can live with

<xsl:for-each-group select="*" group-by="node-name()">
  <xsl:variable name="sibling-group" select="copy-of(current-group())"/>
  <xsl:choose>
     <xsl:when test="tail($sibling-group)">
        <array key="{local-name()}">
           <xsl:apply-templates select="$sibling-group"/>
        </array>
     </xsl:when>
     <xsl:otherwise>
        <xsl:apply-templates select="$sibling-group"/>
     </xsl:otherwise>
  </xsl:choose>
</xsl:for-each-group>

instead of the apply-templates, that will of course "materialize" the adjacent sibling group of elements of the same name but as you seemed to have spelled out the explicit creation of arrays so far in dedicated templates where you need it you can just rewrite this dedicated templates and don't run the risk of using that approach in general for any element.

If you want to keep the verbose style with the explicit selection of various child elements in the same template then you could try how well Saxon does with the use of xsl:fork e.g.

<xsl:template name="cdf:LatLng" match="element(*, cdf:LatLng)">
    <xsl:param name="set_type" select="false()"/>
    <xsl:fork>
     <xsl:sequence>
      <xsl:where-populated>
        <string key="Label">
            <xsl:value-of select="@Label"/>
        </string>
      </xsl:where-populated>
     </xsl:sequence>
     <xsl:sequence>
      <xsl:where-populated>
        <number key="Latitude">
            <xsl:value-of select="cdf:Latitude"/>
        </number>
      </xsl:where-populated>
     </xsl:sequence>
     <xsl:sequence>
      <xsl:where-populated>
        <number key="Longitude">
            <xsl:value-of select="cdf:Longitude"/>
        </number>
      </xsl:where-populated>
     </xsl:sequence>
     <xsl:sequence>
      <xsl:where-populated>
        <string key="Source">
            <xsl:value-of select="cdf:Source"/>
        </string>
      </xsl:where-populated>
     </xsl:sequence>
    </xsl:fork>
    <xsl:if test="not($set_type)">
        <string key="@type">ElectionResults.LatLng</string>
    </xsl:if>
</xsl:template>

The call-template use you also have will not be possible with streaming in general. It seems also be used in this stylesheet to process XML elements in a different order than the input order, it seems to output any subelements declared in abstract types after the ones declared in extended types. That of course doesn't work well with the streaming approach of forwards only, node by node processing. So I guess there you have to decide whether you can't output the base subelements first in the JSON.

Upvotes: 1

Related Questions