JenPann
JenPann

Reputation: 61

Java child elements stripped using .getTextContent

I need to get the text with the <br /> tags intact. I was using getTextContent, but it strips the inner tags.

Code:

var nodeList = root.getElementsByTagName("tr");
var nodeCount = nodeList.getLength();
for (row = 0; row < nodeList.getLength(); row++) {
    var node = nodeList.item(row);
    // Legacy Data Key e.g "ENV"
    var DOORSKey = new java.lang.String(node.getElementsByTagName("td").item(0).getTextContent().trim());
    var DOORSKeyCount = DOORSKey.length();
    // DOORSVal e.g. "ALL"
    var DOORSVal = new java.lang.String(node.getElementsByTagName("td").item(1).getNodeValue());

Sample HTML:

<table border="1" cellpadding="3" cellspacing="0">
    <tbody>
        <tr>
            <td>Customer</td>
            <td></td>
        </tr>
        <tr>
            <td>ENV</td>
            <td>ALL</td>
        </tr>
        <tr>
            <td>Module</td>
            <td>6DOF</td>
        </tr>
        <tr>
            <td>Object Level</td>
            <td>5</td>
        </tr>
        <tr>
            <td>XML Profile</td>
            <td>DHS_CBP_HW<br />DHS_CBP_TRAIN<br />GE_B0_HW<br />GE_B0_TRAIN<br />GE_B1_HW<br />GE_B1_JSIL_TRAIN<br />GE_B1_TRAIN<br />GE-ER_HW<br />GE-ER_TRAIN<br />GTS_MQ9<br />ITALY_HW<br />ITALY_TRAIN<br />MQ1_HW<br />MQ1_PMATS_TRAIN<br />MQ1_TRAIN<br />MQ9_BLOCK5_BW_HW<br />MQ9_BLOCK5_BW_TRAIN<br />MQ9_BLOCK5_HW<br />MQ9_BLOCK5_JSIL_TRAIN<br />MQ9_BLOCK5_PMATS_TRAIN<br />MQ9_BLOCK5_TRAIN<br />MQ9_BW_HW<br />MQ9_BW_PMATS_TRAIN<br />MQ9_BW_TRAIN<br />MQ9_HW<br />MQ9_IKHANA_TRAIN<br />MQ9_JSIL_TRAIN<br />MQ9_PMATS_TRAIN<br />MQ9_SPECIAL_HW<br />MQ9_SPECIAL_TRAIN<br />MQ9_TAMLG_TRAIN<br />MQ9_TEST<br />MQ9_TRAIN<br />ORGANIC_DEPOT_HW<br />ORGANIC_DEPOT_BLOCK5_HW<br />ORGANIC_DEPOT_TRAIN<br />ORGANIC_DEPOT_BLOCK5_TRAIN<br />PREDA_ITALY_TRAIN<br />PREDB_ITALY_TRAIN<br />PREDC_AC2_HW<br />PREDC_AC2_TRAIN<br />PREDC_HW<br />PREDC_TRAIN<br />PREDEP_TRAIN<br />PREDXP_HW<br />PREDXP_TRAIN<br />RITI_HW<br />RITI_TRAIN<br />WARRIOR_A_HW<br />WARRIOR_A_JSIL_TRAIN<br />WARRIOR_A_TRAIN</td>
        </tr>
    </tbody>
</table>

I have tried to get the child tags using .getNodeValue. But received an error from the database.

var DOORSVal = new java.lang.String(node.getElementsByTagName("td").item(1).getNodeValue());

Upvotes: 2

Views: 38

Answers (1)

VGR
VGR

Reputation: 44413

As you’ve discovered, getTextContent cannot be used for this. You will need to use XSLT in order to preserve both text and elements. (Only XSLT 1.0 is supposed by Java SE, currently, but this is more than sufficient for your task.)

You’ll want a template that always analyzes <td> elements, copies only text and <br> child elements, and ignores everything else:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">

    <xsl:template match="td" priority="2">
        <xsl:apply-templates select="@*|node()"/>
    </xsl:template>

    <xsl:template match="br|text()" priority="1">
        <xsl:copy-of select="."/>
    </xsl:template>

    <xsl:template match="@*|node()"/>

</xsl:stylesheet>

Java uses the Transformer class to represent an XSLT document. The usage looks something like this:

String xslt = """
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                    version="1.0">

        <xsl:template match="td" priority="2">
            <xsl:apply-templates select="@*|node()"/>
        </xsl:template>

        <xsl:template match="br|text()" priority="1">
            <xsl:copy-of select="."/>
        </xsl:template>

        <xsl:template match="@*|node()"/>

    </xsl:stylesheet>
""";

Transformer transformer =
    TransformerFactory.newInstance().newTransformer(
        new StreamSource(new StringReader(xslt)));
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");

var trList = root.getElementsByTagName("tr");
var nodeCount = trList.getLength();
for (int row = 0; row < nodeCount; row++) {
    Element tr = (Element) trList.item(row);

    var tdList = tr.getElementsByTagName("td");

    StringWriter DOORSKey = new StringWriter();
    transformer.transform(new DOMSource(tdList.item(0)),
                          new StreamResult(DOORSKey));

    StringWriter DOORSVal = new StringWriter();
    transformer.transform(new DOMSource(tdList.item(1)),
                          new StreamResult(DOORSVal));

    System.out.println("key=" + DOORSKey + ", value=" + DOORSVal);
}

By the way, there is no reason to use new java.lang.String, since all String objects are immutable and can be safely shared. new String(otherString) accomplishes nothing.

Upvotes: -1

Related Questions