I have the following Html document.
<div class="figure-wrapper" id="figure1">...</div>
<p class="para">Lorem Ipsum (see Fig. 1). Lorem Ipsum (see Fig. 2).</p>
<div class="figure-wrapper" id="figure3">...</div>
<p class="para">Lorem Ipsum (see Fig. 3). Lorem Ipsum (see Fig. 1).</p>
<div class="figure-wrapper" id="figure2">...</div>
What do I want to achieve
<div class="figure-wrapper">
element) after the one paragraph that has the first reference to it. Example and ideal output
The <div class="figure-wrapper" id="figure1>
element should be placed after the first paragraph only since it is the first one of all paragraphs that references this figure.
<p class="para">Lorem Ipsum (see Fig. 1). Lorem Ipsum (see Fig. 2).</p>
<div class="figure-wrapper" id="figure1">...</div>
<div class="figure-wrapper" id="figure2">...</div>
<p class="para">Lorem Ipsum (see Fig. 3). Lorem Ipsum (see Fig. 1).</p>
<div class="figure-wrapper" id="figure3">...</div>
No explicit references (in terms of HTML elements) to the figure elements exist in the input document. Thus I have to analyze the paragraph contents (e.g. for occurrence of certain values like Fig. x etc.) to infer if a reference to the figure has been made within the paragraph.
What I fabricated so far is the following solution.
I tried out a strange mixture using the identity transform pattern, keys and a multipass approach which, however, I can't think through.
xmlns:xsl =""
xmlns:xd =""
xmlns:fn =""
<!-- maximum number of figure references within one paragraph -->
<xsl:variable name="figThreshold" select="100" />
<!-- index of all figure elements -->
<xsl:key name="figure-index" match="node()[@class='figure-wrapper']" use="@id" />
<!-- transformation init -->
<xsl:template match="/">
<xsl:variable name="pass1">
<xsl:apply-templates mode="pass1" />
<xsl:variable name="pass2">
<xsl:for-each select="$pass1">
<xsl:apply-templates mode="pass2" />
<xsl:copy-of select="$pass2" />
<!-- pass 1 start -->
<xsl:template match="node() | @*" mode="pass1">
<xsl:apply-templates select="node() | @*" mode="pass1" />
<xsl:template match="node()[name()='p']" mode="pass1" priority="1">
<xsl:apply-templates select="@* | node()" mode="pass1" />
<xsl:call-template name="locate-and-move-figures" />
<!-- iterates x times (see value of figThreshold) over paragraph text and increment each time the figure number reference to look for -->
<xsl:template name="locate-and-move-figures">
<xsl:param name="figCount" select="1" />
<xsl:variable name="figureId" select="concat('figure',$figCount)" />
<xsl:variable name="searchStringText" select="concat('Fig. ',$figCount)) />
<!-- if figure reference is found within paragraph insert the appropriate after it -->
<xsl:if test="$searchStringText">
<xsl:copy-of select="key('figure-index',$figureId)" />
<!-- recursive call of template unless threshold value is reached -->
<xsl:if test="$figCount < $figThreshold">
<xsl:call-template name="locate-and-move-figures">
<xsl:with-param name="figCount" select="$figCount + 1" />
<xsl:template match="node()[@class='figure-wrapper']" mode="pass1" />
<!-- pass 1 end -->
<!-- pass 2 start - eliminations of all duplicates -->
<xsl:template match="node() | @*" mode="pass2">
<xsl:apply-templates select="node() | @*" mode="pass2" />
<!-- pass 2 end -->
The output I get is this:
<p class="para">Lorem Ipsum (see Fig. 1). Lorem Ipsum (see Fig. 2).</p>
<div class="figure-wrapper" id="figure1">...</div>
<div class="figure-wrapper" id="figure2">...</div>
<p class="para">Lorem Ipsum (see Fig. 3). Lorem Ipsum (see Fig. 1).</p>
<div class="figure-wrapper" id="figure1">...</div>
<div class="figure-wrapper" id="figure3">...</div>
The Problems
<div class="figure-wrapper">
elements. I tried to get rid of them in the 2nd pass, but I can't get my head around duplicate removal in combination with the identity transformation pattern. Any help with these problems is highly appreciated.
Upvotes: 2
Views: 410
Reputation: 167651
Here is my suggestion for XSLT 2.0 which in a first step uses analyze-string
to transform e.g. (see Fig. 3)
into an element <ref name="figure" idref="3"/>
and then uses keys to identify the first reference in a p
element to output the div[@class = 'figure-wrapper']
in a second step. The second step also transforms the ref
elements back into inline text:
<xsl:stylesheet version="2.0" xmlns:xsl="">
<xsl:output method="html"/>
<xsl:variable name="references">
<xsl:apply-templates mode="references"/>
<xsl:template match="@* | node()" mode="#all">
<xsl:apply-templates select="@* , node()" mode="#current"/>
<!-- might want to use match="p[@class = 'para']//text()" -->
<xsl:template match="text()" mode="references" priority="5">
<xsl:analyze-string select="." regex="\(see Fig\. ([0-9]+)\)">
<ref name="figure" idref="{regex-group(1)}"/>
<xsl:value-of select="."/>
<xsl:key name="refs" match="div[@class = 'figure-wrapper']" use="@id"/>
<xsl:key name="fig-refs" match="ref" use="concat(@name, @idref)"/>
<xsl:template match="/">
<xsl:apply-templates select="$references/node()"/>
<xsl:template match="div[@class = 'figure-wrapper']"/>
<xsl:template match="p[@class = 'para'][.//ref[. is key('fig-refs', concat(@name, @idref))[1]]]">
<xsl:variable name="first-refs" select=".//ref[. is key('fig-refs', concat(@name, @idref))[1]]"/>
<xsl:copy-of select="key('refs', $first-refs/concat(@name, @idref))"/>
<xsl:template match="ref">
<xsl:text>(see Fig. </xsl:text>
<xsl:value-of select="@idref"/>
Applying that XSLT with Saxon 9.5 to your input I get
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<p class="para">Lorem Ipsum (see Fig. 1). Lorem Ipsum (see Fig. 2).</p>
<div class="figure-wrapper" id="figure1">...</div>
<div class="figure-wrapper" id="figure2">...</div>
<p class="para">Lorem Ipsum (see Fig. 3). Lorem Ipsum (see Fig. 1).</p>
<div class="figure-wrapper" id="figure3">...</div>
which I think is the order of elements you want.
Upvotes: 1
Reputation: 117043
Here's a different approach you could explore. I did this in XSLT 1.0, but the differences are not essential to the method.
The basic idea is to attach the id of the parent para to each reference contained by the para. Then, using Muenchian grouping, we leave only the first occurrence of each reference. And since each of these retains the id of the original parent, we know where it needs to appear in the final output.
Note that it is assumed there are no independent reference elements (i.e elements that are not referenced in at least one para).
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="tokens" match="token" use="." />
<xsl:key name="ref" match="div[@class='figure-wrapper']" use="@id" />
<xsl:variable name="root" select="/"/>
<!-- 1. collect all references, along with their parent id -->
<xsl:variable name="references">
<xsl:for-each select="//p[@class='para']">
<xsl:call-template name="cat_ref">
<xsl:with-param name="string" select="."/>
<xsl:with-param name="pid" select="generate-id()"/>
<!-- 2. keep only unique references -->
<xsl:variable name="unique-ref" select="exsl:node-set($references)/token[count(. | key('tokens', .)[1]) = 1]"/>
<!-- 3. output -->
<xsl:template match="@*|node()">
<xsl:apply-templates select="@*|node()"/>
<xsl:template match="p[@class='para']">
<xsl:apply-templates select="@*|node()"/>
<!-- append my references -->
<xsl:for-each select="$unique-ref[@pid=generate-id(current())]">
<xsl:variable name="ref-key" select="."/>
<!-- switch back to document in order to use key -->
<xsl:for-each select="$root">
<xsl:copy-of select="key('ref', $ref-key)"/>
<!-- suppress references -->
<xsl:template match="div [@class='figure-wrapper']"/>
<!-- proc template -->
<xsl:template name="cat_ref">
<xsl:param name="string"/>
<xsl:param name="pid"/>
<xsl:param name="prefix" select="'(see Fig. '" />
<xsl:param name="suffix" select="')'" />
<xsl:if test="contains($string, $prefix) and contains(substring-after($string, $prefix), $suffix)">
<token pid="{$pid}">
<xsl:value-of select="substring-before(substring-after($string, $prefix), $suffix)" />
<!-- recursive call -->
<xsl:call-template name="cat_ref">
<xsl:with-param name="string" select="substring-after(substring-after($string, $prefix), $suffix)" />
<xsl:with-param name="pid" select="$pid" />
Applied to your input, the following result is obtained:
<?xml version="1.0" encoding="UTF-8"?>
<p class="para">Lorem Ipsum (see Fig. 1). Lorem Ipsum (see Fig. 2).</p>
<div class="figure-wrapper" id="figure1">...</div>
<div class="figure-wrapper" id="figure2">...</div>
<p class="para">Lorem Ipsum (see Fig. 3). Lorem Ipsum (see Fig. 1).</p>
<div class="figure-wrapper" id="figure3">...</div>
Upvotes: 1