rawtext
rawtext

Reputation: 33

XSLT find first text node

I'm been given a piece of messy HTML, which I've cleaned with HTML tidy. I am trying to turn this into a version of DITA.

I want to get the first element with text in it, and turn it into a chapter title.

I've got a file (simplified):

<html><head></head>
<body>
<p><img src="i.gif" alt="int.gif (792 bytes)" border="0" width="105" height="18" /> 
    <strong>
       <a class="c1" name="flag" id="flag">Flags</a>
     </strong>
 </p>
<!-- the elements between the first p and the actual text may vary. -->
<!--more -->

Or sometimes it is like this:

  <html><head></head>
    <body>
    <table border="0" cellpadding="3" cellspacing="0" width="100%">
    <tbody> <!-- sometimes this is missing !! -->
    <tr>
    <td class="c3" width="100%">
        <span class="c2">
            <a class="c1" name="Errors" id="errors">Error-Codes</a>   <strong>with troubleshooting</strong>
        </span>
     </td></tr></tbody></table>  <!--more --></body></html>

Or could potentially be something else.

I've tried these:

<xsl:template match="body">
    <xsl:element name="chapter">
        <xsl:element name="title">
            <!-- <xsl:value-of select="table[1]//td[1]"/> first td, but not p -->
            <!-- <xsl:value-of select="./p[1]//text()"/> first para
            <!-- <xsl:value-of select="table[1]//td()[1] or p[1]"/> invalid syntax -->
            <!-- <xsl:value-of select="text()[1]"/>  nothing -->
            <!-- <xsl:value-of select="//text()[1]"/> gets all text in document -->
        </xsl:element>

I've also tried

<!--  <xsl:value-of select=".//*[@class='c1'][1]"/> gets first instance of child node with class="c1" of every subnode, with are often many -->

By popular request ;-) this is what I want:

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE chapter SYSTEM "our.dtd">
<chapter template-version="01">
    <title>Flags</title>
<!-- blabbity blab -->
</chapter>

or

<chapter template-version="01">
    <title>Error codes with troubleshooting</title>
      <!-- I would also accept just "Error codes", 
           I could leave some billable work for later -->
<!-- blabbity blab -->
</chapter>

Upvotes: 0

Views: 631

Answers (1)

Tomalak
Tomalak

Reputation: 338158

I want to get the first element with text in it, and turn it into a chapter title.

This is not quite as easy as it may sound. What is "the first element with text in it", anyway?

In your first example, it would be this:

<a class="c1" name="flag" id="flag">Flags</a>

Easy enough. In your second example, by the same logic, it would be this:

<a class="c1" name="Errors" id="errors">Error-Codes</a>

But of course it's not that easy, because you actually want this:

<span class="c2">
    <a class="c1" name="Errors" id="errors">Error-Codes</a>   <strong>with troubleshooting</strong>
</span>

So what's the defining characteristic of the element that you want to use as your title?

I'll make an educated guess and define it as:

The first non-inline element that does not contain other non-inline elements and has non-empty text.

"Non-inline" means all block-level elements as well as <td> and so on, which have technical differences to block-level elements that are irrelevant in this case.


So using this definition with your first example gets us to:

<p><img src="i.gif" alt="int.gif (792 bytes)" border="0" width="105" height="18" /> 
    <strong>
       <a class="c1" name="flag" id="flag">Flags</a>
     </strong>
</p>

whose text value is still "Flags".

In your second example the element we would end up with:

<td class="c3" width="100%">
    <span class="c2">
        <a class="c1" name="Errors" id="errors">Error-Codes</a>   <strong>with troubleshooting</strong>
    </span>
</td>

whose text value would be "Error-Codes with troubleshooting".

Seems the definition works for the examples you gave.


XPath that matches all relevant "non-inline" elements could look like this:

//*[self::p|self::td|self::div|self::and-so-on]

Add more container element types as you need them.

When we include the condition that it should not contain other elements of the same type, we end up with:

//*[self::p|self::td|self::div|self::and-so-on][
    not(.//*[self::p|self::td|self::div|self::and-so-on])
]

Adding the condition that it must contain some text:

//*[self::p|self::td|self::div|self::and-so-on][
    not(.//*[self::p|self::td|self::div|self::and-so-on])
    and normalize-space() != ''
]

...and that of all those fulfilling this condition throughout the document, we only need the first one:

(//*[self::p|self::td|self::div|self::and-so-on][
    not(.//*[self::p|self::td|self::div|self::and-so-on])
    and normalize-space() != ''
])[1]

and of that first one, we want the normalized text value:

normalize-space(
    (//*[self::p|self::td|self::div|self::and-so-on][
       not(.//*[self::p|self::td|self::div|self::and-so-on])
        and normalize-space() != ''
    ])[1]
)

All of this in XSLT:

<xsl:template match="body">
  <title>
    <xsl:value-of select="
        normalize-space(
            (//*[self::p|self::td|self::div|self::and-so-on][
               not(.//*[self::p|self::td|self::div|self::and-so-on])
                and normalize-space() != ''
            ])[1]
        )
    " />
  </title>
</xsl:template>

Upvotes: 1

Related Questions