Reputation: 536
I'm working with an OCR document that's been converted into XML. This means that the words on the page are quite oddly arranged (path-wise) in the document.
In the XML document, words are laid out like this /document/...../ln/wd
What I'd like my XSLT document to do is print the words in each line on their own output line (i.e. detect the words in the XML document and 'preserve' their formatting).
What I have so far is this, which just prints every wd in the document, regardless of formatting/location.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:ss="http://www.scansoft.com/omnipage/xml/ssdoc-schema3.xsd">
<xsl:template match="/">
<html>
<body>
<xsl:value-of select="/document::descendant::wd"/>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
Thanks for your help
Upvotes: 0
Views: 90
Reputation: 116993
What I'd like my XSLT document to do is print the words in each line on their own output line (i.e. detect the words in the XML document and 'preserve' their formatting).
Perhaps you could this simply by:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:ss="http://www.scansoft.com/omnipage/xml/ssdoc-schema3.xsd">
<xsl:output method="text" encoding="utf-8" />
<xsl:template match="/">
<xsl:for-each select="descendant::ss:ln">
<xsl:for-each select="descendant::ss:wd">
<xsl:value-of select="." />
<xsl:if test="position()!=last()">
<xsl:text>, </xsl:text>
</xsl:if>
</xsl:for-each>
<xsl:if test="position()!=last()">
<xsl:text> </xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Not sure what "'preserve' their formatting" means - esp. when the output is plain text.
Upvotes: 0
Reputation: 122364
From your previous question, the format you're working with is (simplified)
<document xmlns="http://www.scansoft.com/omnipage/xml/ssdoc-schema3.xsd">
<!-- other intervening elements -->
<ln>
<wd>First</wd>
<space/>
<wd>line</wd>
</ln>
<ln>
<wd>Second</wd>
<space/>
<wd>line</wd>
</ln>
<ln>
<run>
<wd>Word</wd>
<tab />
</run>
<run>
<wd>another</wd>
<space/>
</run>
</ln>
</document>
So you can approach this quite nicely using template matching
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:ss="http://www.scansoft.com/omnipage/xml/ssdoc-schema3.xsd">
<xsl:output method="text" />
<xsl:template match="/">
<xsl:apply-templates select="//ss:ln" />
</xsl:template>
<!-- for a ln, process the descendant words and spaces in document order -->
<xsl:template match="ss:ln">
<xsl:apply-templates select=".//ss:wd | .//ss:space | .//ss:tab" />
<xsl:text> </xsl:text><!-- and add a newline character to the end -->
</xsl:template>
<!-- replace <space/> with a single space character -->
<xsl:template match="ss:space">
<xsl:text> </xsl:text>
</xsl:template>
<!-- replace <tab/> with a single tab character -->
<xsl:template match="ss:tab">
<xsl:text>	</xsl:text>
</xsl:template>
<!-- wd elements use the default built in template rule that will
just output their contained text -->
</xsl:stylesheet>
If you have any wd
elements that contain leading or trailing whitespace then you might want to add an explicit template to handle those:
<xsl:template match="ss:wd">
<xsl:value-of select="normalize-space()" />
</xsl:template>
Upvotes: 2