Reputation: 33
I'm been given a piece of messy HTML, which I've cleaned with HTML tidy. I am trying to turn this into a version of DITA.
I want to get the first element with text in it, and turn it into a chapter title.
I've got a file (simplified):
<html><head></head>
<body>
<p><img src="i.gif" alt="int.gif (792 bytes)" border="0" width="105" height="18" />
<strong>
<a class="c1" name="flag" id="flag">Flags</a>
</strong>
</p>
<!-- the elements between the first p and the actual text may vary. -->
<!--more -->
Or sometimes it is like this:
<html><head></head>
<body>
<table border="0" cellpadding="3" cellspacing="0" width="100%">
<tbody> <!-- sometimes this is missing !! -->
<tr>
<td class="c3" width="100%">
<span class="c2">
<a class="c1" name="Errors" id="errors">Error-Codes</a> <strong>with troubleshooting</strong>
</span>
</td></tr></tbody></table> <!--more --></body></html>
Or could potentially be something else.
I've tried these:
<xsl:template match="body">
<xsl:element name="chapter">
<xsl:element name="title">
<!-- <xsl:value-of select="table[1]//td[1]"/> first td, but not p -->
<!-- <xsl:value-of select="./p[1]//text()"/> first para
<!-- <xsl:value-of select="table[1]//td()[1] or p[1]"/> invalid syntax -->
<!-- <xsl:value-of select="text()[1]"/> nothing -->
<!-- <xsl:value-of select="//text()[1]"/> gets all text in document -->
</xsl:element>
I've also tried
<!-- <xsl:value-of select=".//*[@class='c1'][1]"/> gets first instance of child node with class="c1" of every subnode, with are often many -->
By popular request ;-) this is what I want:
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE chapter SYSTEM "our.dtd">
<chapter template-version="01">
<title>Flags</title>
<!-- blabbity blab -->
</chapter>
or
<chapter template-version="01">
<title>Error codes with troubleshooting</title>
<!-- I would also accept just "Error codes",
I could leave some billable work for later -->
<!-- blabbity blab -->
</chapter>
Upvotes: 0
Views: 631
Reputation: 338158
I want to get the first element with text in it, and turn it into a chapter title.
This is not quite as easy as it may sound. What is "the first element with text in it", anyway?
In your first example, it would be this:
<a class="c1" name="flag" id="flag">Flags</a>
Easy enough. In your second example, by the same logic, it would be this:
<a class="c1" name="Errors" id="errors">Error-Codes</a>
But of course it's not that easy, because you actually want this:
<span class="c2">
<a class="c1" name="Errors" id="errors">Error-Codes</a> <strong>with troubleshooting</strong>
</span>
So what's the defining characteristic of the element that you want to use as your title?
I'll make an educated guess and define it as:
The first non-inline element that does not contain other non-inline elements and has non-empty text.
"Non-inline" means all block-level elements as well as <td>
and so on, which have technical differences to block-level elements that are irrelevant in this case.
So using this definition with your first example gets us to:
<p><img src="i.gif" alt="int.gif (792 bytes)" border="0" width="105" height="18" />
<strong>
<a class="c1" name="flag" id="flag">Flags</a>
</strong>
</p>
whose text value is still "Flags".
In your second example the element we would end up with:
<td class="c3" width="100%">
<span class="c2">
<a class="c1" name="Errors" id="errors">Error-Codes</a> <strong>with troubleshooting</strong>
</span>
</td>
whose text value would be "Error-Codes with troubleshooting".
Seems the definition works for the examples you gave.
XPath that matches all relevant "non-inline" elements could look like this:
//*[self::p|self::td|self::div|self::and-so-on]
Add more container element types as you need them.
When we include the condition that it should not contain other elements of the same type, we end up with:
//*[self::p|self::td|self::div|self::and-so-on][
not(.//*[self::p|self::td|self::div|self::and-so-on])
]
Adding the condition that it must contain some text:
//*[self::p|self::td|self::div|self::and-so-on][
not(.//*[self::p|self::td|self::div|self::and-so-on])
and normalize-space() != ''
]
...and that of all those fulfilling this condition throughout the document, we only need the first one:
(//*[self::p|self::td|self::div|self::and-so-on][
not(.//*[self::p|self::td|self::div|self::and-so-on])
and normalize-space() != ''
])[1]
and of that first one, we want the normalized text value:
normalize-space(
(//*[self::p|self::td|self::div|self::and-so-on][
not(.//*[self::p|self::td|self::div|self::and-so-on])
and normalize-space() != ''
])[1]
)
All of this in XSLT:
<xsl:template match="body">
<title>
<xsl:value-of select="
normalize-space(
(//*[self::p|self::td|self::div|self::and-so-on][
not(.//*[self::p|self::td|self::div|self::and-so-on])
and normalize-space() != ''
])[1]
)
" />
</title>
</xsl:template>
Upvotes: 1