cforster
cforster

Reputation: 577

Use XSLT to transform XML to text with Maximum Width

I am using XSLT (XSLT 2.0 is fine) to transform XML (TEI) to readable plaintext (with some minor modifications/challenges—preserving space for poetry; making titles all upper case).

So far everything is working as I'd like, but in the interests of readability I'd additionally like to limit the length of a line of text output by this transformation to some value (like 80 chars wide), splitting only on spaces (not breaking words apart, etc). I want to set a maximum length for output (or, say, 80 chars), not just output the first, say, 80 chars.

Does anyone have suggestions about the best approach? Is a template that matches all text() and then uses XSLT's built in string functions the way to go? I'm trying to imagine using string functions (string-length and substring or similar) to do this, but not having any luck yet.

(I could do this separately, using a python script, pretty easily, so perhaps "do it afterwards" may be the best answer. I'd love to know if I'm overlooking a simple solution though.)

Upvotes: 3

Views: 1757

Answers (1)

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243529

I. Here is a solution I wrote more than 10 years ago.

This transformation (from the FXSL library):

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="http://fxsl.sf.net/"
 xmlns:str-split2lines-func="f:str-split2lines-func"
 exclude-result-prefixes="f str-split2lines-func">

   <xsl:import href="str-foldl.xsl"/>
   <xsl:output method="text"/>

   <str-split2lines-func:str-split2lines-func/>

    <xsl:template match="/">
      <xsl:call-template name="str-split-to-lines">
        <xsl:with-param name="pStr" select="/*"/>
        <xsl:with-param name="pLineLength" select="64"/>
        <xsl:with-param name="pDelimiters" select="' &#9;&#10;&#13;'"/>
      </xsl:call-template>
    </xsl:template>

    <xsl:template name="str-split-to-lines">
      <xsl:param name="pStr"/>
      <xsl:param name="pLineLength" select="60"/>
      <xsl:param name="pDelimiters" select="' &#9;&#10;&#13;'"/>

      <xsl:variable name="vsplit2linesFun"
                    select="document('')/*/str-split2lines-func:*[1]"/>

      <xsl:variable name="vrtfParams">
       <delimiters><xsl:value-of select="$pDelimiters"/></delimiters>
       <lineLength><xsl:copy-of select="$pLineLength"/></lineLength>
      </xsl:variable>

      <xsl:variable name="vResult">
          <xsl:call-template name="str-foldl">
            <xsl:with-param name="pFunc" select="$vsplit2linesFun"/>
            <xsl:with-param name="pStr" select="$pStr"/>
            <xsl:with-param name="pA0" select="$vrtfParams"/>
          </xsl:call-template>
      </xsl:variable>

      <xsl:for-each select="$vResult/line">
        <xsl:for-each select="word">
          <xsl:value-of select="concat(., ' ')"/>
        </xsl:for-each>
        <xsl:value-of select="'&#10;'"/>
      </xsl:for-each>
    </xsl:template>

    <xsl:template match="str-split2lines-func:*" mode="f:FXSL">
      <xsl:param name="arg1" select="/.."/>
      <xsl:param name="arg2"/>

      <xsl:copy-of select="$arg1/*[position() &lt; 3]"/>
      <xsl:copy-of select="$arg1/line[position() != last()]"/>

      <xsl:choose>
        <xsl:when test="contains($arg1/*[1], $arg2)">
          <xsl:if test="string($arg1/word)">
             <xsl:call-template name="fillLine">
               <xsl:with-param name="pLine" select="$arg1/line[last()]"/>
               <xsl:with-param name="pWord" select="$arg1/word"/>
               <xsl:with-param name="pLineLength" select="$arg1/*[2]"/>
             </xsl:call-template>
          </xsl:if>
        </xsl:when>
        <xsl:otherwise>
          <xsl:copy-of select="$arg1/line[last()]"/>
          <word><xsl:value-of select="concat($arg1/word, $arg2)"/></word>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:template>

      <!-- Test if the new word fits into the last line -->
    <xsl:template name="fillLine">
      <xsl:param name="pLine" select="/.."/>
      <xsl:param name="pWord" select="/.."/>
      <xsl:param name="pLineLength" />

      <xsl:variable name="vnWordsInLine" select="count($pLine/word)"/>
      <xsl:variable name="vLineLength" select="string-length($pLine) + $vnWordsInLine"/>
      <xsl:choose>
        <xsl:when test="not($vLineLength + string-length($pWord) > $pLineLength)">
          <line>
            <xsl:copy-of select="$pLine/*"/>
            <xsl:copy-of select="$pWord"/>
          </line>
        </xsl:when>
        <xsl:otherwise>
          <xsl:copy-of select="$pLine"/>
          <line>
            <xsl:copy-of select="$pWord"/>
          </line>
          <word/>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:template>

</xsl:stylesheet>

when applied on the following XML document:

<text>
Dec. 13 — As always for a presidential inaugural, security and surveillance were
extremely tight in Washington, DC, last January. But as George W. Bush prepared to
take the oath of office, security planners installed an extra layer of protection: a
prototype software system to detect a biological attack. The U.S. Department of
Defense, together with regional health and emergency-planning agencies, distributed
a special patient-query sheet to military clinics, civilian hospitals and even aid
stations along the parade route and at the inaugural balls. Software quickly
analyzed complaints of seven key symptoms — from rashes to sore throats — for
patterns that might indicate the early stages of a bio-attack. There was a brief
scare: the system noticed a surge in flulike symptoms at military clinics.
Thankfully, tests confirmed it was just that — the flu.
</text>

Justifies the text to fit in lines long at most 64 (any length can be specified as the value of the parameter $pLineLength) and the result is:

Dec. 13 — As always for a presidential inaugural, security and 
surveillance were extremely tight in Washington, DC, last 
January. But as George W. Bush prepared to take the oath of 
office, security planners installed an extra layer of 
protection: a prototype software system to detect a biological 
attack. The U.S. Department of Defense, together with regional 
health and emergency-planning agencies, distributed a special 
patient-query sheet to military clinics, civilian hospitals and 
even aid stations along the parade route and at the inaugural 
balls. Software quickly analyzed complaints of seven key 
symptoms — from rashes to sore throats — for patterns that might 
indicate the early stages of a bio-attack. There was a brief 
scare: the system noticed a surge in flulike symptoms at 
military clinics. Thankfully, tests confirmed it was just that — 
the flu. 

The separate stylesheet, which is imported in the above transformation is:

str-foldl.xsl:


<xsl:stylesheet version="2.0" 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="http://fxsl.sf.net/"
 exclude-result-prefixes="f">
    <xsl:template name="str-foldl">
      <xsl:param name="pFunc" select="/.."/>
      <xsl:param name="pA0"/>
      <xsl:param name="pStr"/>

      <xsl:choose>
         <xsl:when test="not(string($pStr))">
            <xsl:copy-of select="$pA0"/>
         </xsl:when>
         <xsl:otherwise>
            <xsl:variable name="vFunResult">
              <xsl:apply-templates select="$pFunc[1]" mode="f:FXSL">
                <xsl:with-param name="arg0" select="$pFunc[position() > 1]"/>
                <xsl:with-param name="arg1" select="$pA0"/>
                <xsl:with-param name="arg2" select="substring($pStr,1,1)"/>
              </xsl:apply-templates>
            </xsl:variable>

            <xsl:call-template name="str-foldl">
                    <xsl:with-param name="pFunc" select="$pFunc"/>
                    <xsl:with-param name="pStr" 
                   select="substring($pStr,2)"/>
                    <xsl:with-param name="pA0" select="$vFunResult"/>
            </xsl:call-template>
         </xsl:otherwise>
      </xsl:choose>

    </xsl:template>
</xsl:stylesheet>

Do note that this is essentially an XSLT 1.0 solution. A shorter XSLT 2.0 solution is possible using the capabilities of XSLT 2.0 of regular expression processing.


II. Using XSLT 2.0 Regex

Here is a function -- f:getLine() -- that when passed a string and maximum-line-length, returns the first line from that string that is the longest starting substring (of the 1st maximum-line-length chunk) ending on word boundaries. The transformation below uses this function to produce the first line of the wanted multi-line result.

<xsl:stylesheet version="2.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="my:f" xmlns:xs="http://www.w3.org/2001/XMLSchema">
 <xsl:output method="text"/>

  <xsl:template match="/*/text()">
    <xsl:sequence select="f:getLine(., 64)"/>
  </xsl:template>

  <xsl:function name="f:getLine" as="xs:string?">
    <xsl:param name="pText" as="xs:string?"/>
    <xsl:param name="pLength" as="xs:integer"/>

    <xsl:variable name="vChunk" select="substring($pText, 1, $pLength)"/>

    <xsl:choose>
      <xsl:when test="not(string-length($pText) > $pLength) 
                      or matches(substring($pText, $pLength+1, 1), '\W')">
        <xsl:sequence select="$vChunk"/>
      </xsl:when>
      <xsl:otherwise>
            <xsl:analyze-string select="$vChunk" 
                 regex="^((\W*\w*)*?)(\W+\w*)$">
              <xsl:matching-substring>
                <xsl:sequence select="regex-group(1)"/>
              </xsl:matching-substring>
            </xsl:analyze-string>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:function>
</xsl:stylesheet>

When this transformation is applied on the same XML document, the correct first line is produced:

Dec. 13 — As always for a presidential inaugural, security and

Finally, the complete XSLT 2.0 transformation with RegEx:

<xsl:stylesheet version="2.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="my:f" xmlns:xs="http://www.w3.org/2001/XMLSchema">
 <xsl:output method="text"/>

  <xsl:template match="/*/text()" name="reformat">
    <xsl:param name="pText" select="translate(., '&#xA;', ' ')"/>
    <xsl:param name="pMaxLength" select="64"/>
    <xsl:param name="pTotalLength" select="string-length(.)"/>
    <xsl:param name="pLengthFormatted" select="0"/>

    <xsl:if test="not($pLengthFormatted >= $pTotalLength)">
        <xsl:variable name="vNextLine" 
         select="f:getLine(substring($pText, $pLengthFormatted+1), $pMaxLength)"/>
        <xsl:sequence select="concat($vNextLine, '&#xA;')"/>

        <xsl:call-template name="reformat">
          <xsl:with-param name="pText" select="$pText"/>
          <xsl:with-param name="pMaxLength" select="$pMaxLength"/>
          <xsl:with-param name="pTotalLength" select="$pTotalLength"/>
          <xsl:with-param name="pLengthFormatted" 
                    select="$pLengthFormatted + string-length($vNextLine)"/>
        </xsl:call-template>
    </xsl:if>
  </xsl:template>

  <xsl:function name="f:getLine" as="xs:string?">
    <xsl:param name="pText" as="xs:string?"/>
    <xsl:param name="pLength" as="xs:integer"/>

    <xsl:variable name="vChunk" select="substring($pText, 1, $pLength)"/>

    <xsl:choose>
      <xsl:when test="not(string-length($pText) > $pLength) 
                      or matches(substring($pText, $pLength+1, 1), '\W')">
        <xsl:sequence select="$vChunk"/>
      </xsl:when>
      <xsl:otherwise>
            <xsl:analyze-string select="$vChunk" 
                 regex="^((\W*\w*)*?)(\W+\w*)$">
              <xsl:matching-substring>
                <xsl:sequence select="regex-group(1)"/>
              </xsl:matching-substring>
            </xsl:analyze-string>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:function>
</xsl:stylesheet>

Upvotes: 6

Related Questions