Dan
Dan

Reputation: 536

Group words by every line via XSL

I'm working with an OCR document that's been converted into XML. This means that the words on the page are quite oddly arranged (path-wise) in the document.

In the XML document, words are laid out like this /document/...../ln/wd
What I'd like my XSLT document to do is print the words in each line on their own output line (i.e. detect the words in the XML document and 'preserve' their formatting).

What I have so far is this, which just prints every wd in the document, regardless of formatting/location.

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
         xmlns:ss="http://www.scansoft.com/omnipage/xml/ssdoc-schema3.xsd">
<xsl:template match="/">
  <html>
    <body>
        <xsl:value-of select="/document::descendant::wd"/>
    </body>
  </html>
</xsl:template>
</xsl:stylesheet>

Thanks for your help

Upvotes: 0

Views: 90

Answers (2)

michael.hor257k
michael.hor257k

Reputation: 116993

What I'd like my XSLT document to do is print the words in each line on their own output line (i.e. detect the words in the XML document and 'preserve' their formatting).

Perhaps you could this simply by:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:ss="http://www.scansoft.com/omnipage/xml/ssdoc-schema3.xsd">

<xsl:output method="text" encoding="utf-8" />

<xsl:template match="/">
<xsl:for-each select="descendant::ss:ln">
    <xsl:for-each select="descendant::ss:wd">
        <xsl:value-of select="." />
        <xsl:if test="position()!=last()">
            <xsl:text>, </xsl:text>
        </xsl:if>
    </xsl:for-each> 
    <xsl:if test="position()!=last()">
        <xsl:text>&#10;</xsl:text>
    </xsl:if>
</xsl:for-each> 
</xsl:template>
</xsl:stylesheet>

Not sure what "'preserve' their formatting" means - esp. when the output is plain text.

Upvotes: 0

Ian Roberts
Ian Roberts

Reputation: 122364

From your previous question, the format you're working with is (simplified)

<document xmlns="http://www.scansoft.com/omnipage/xml/ssdoc-schema3.xsd">
  <!-- other intervening elements -->
  <ln>
    <wd>First</wd>
    <space/>
    <wd>line</wd>
  </ln>
  <ln>
    <wd>Second</wd>
    <space/>
    <wd>line</wd>
  </ln>
  <ln>
    <run>
      <wd>Word</wd>
      <tab />
    </run>
    <run>
      <wd>another</wd>
      <space/>
    </run>
  </ln>
</document>

So you can approach this quite nicely using template matching

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
         xmlns:ss="http://www.scansoft.com/omnipage/xml/ssdoc-schema3.xsd">
  <xsl:output method="text" />

  <xsl:template match="/">
    <xsl:apply-templates select="//ss:ln" />
  </xsl:template>

  <!-- for a ln, process the descendant words and spaces in document order -->
  <xsl:template match="ss:ln">
    <xsl:apply-templates select=".//ss:wd | .//ss:space | .//ss:tab" />
    <xsl:text>&#10;</xsl:text><!-- and add a newline character to the end -->
  </xsl:template>

  <!-- replace <space/> with a single space character -->
  <xsl:template match="ss:space">
    <xsl:text> </xsl:text>
  </xsl:template>

  <!-- replace <tab/> with a single tab character -->
  <xsl:template match="ss:tab">
    <xsl:text>&#09;</xsl:text>
  </xsl:template>

  <!-- wd elements use the default built in template rule that will
       just output their contained text -->
</xsl:stylesheet>

If you have any wd elements that contain leading or trailing whitespace then you might want to add an explicit template to handle those:

<xsl:template match="ss:wd">
  <xsl:value-of select="normalize-space()" />
</xsl:template>

Upvotes: 2

Related Questions