Reputation: 407

Java XPath API Stripping HTML Tags from Text

I am currently using the Java XPath API to extract some text from a String.

This String, however, often has HTML formatting (<b>, <em>, <sub>, etc). When I run my code, the HTML tags are stripped off. Is there any way to avoid this?

Here is a sample input:

<document>
    <summary>
    The <b>dog</b> jumped over the fence.
    </summary>
</document>

Here is a snippet of my code:

XPathFactory factory = XPathFactory.newInstance();  
XPath xPath = factory.newXPath();
InputSource source = new InputSource(new StringReader(xml));
String output = xPath.evaluate("/document/summary", source);

Here is the current output:

The dog jumped over the fence.

Here is the output I want:

The <b>dog</b> jumped over the fence.

Thanks in advance for all your help.

Upvotes: 1

Answers (3)

eDog

Reputation: 173

As part of the parser, it will read the text as XML and will classify the contents of the node summary as text, node, text. When you use /document/summary, the resolver will return a string which is made up of all the descendants of the selected node. This give you text + node.text + text. This is the reason you lose the bold tag. The input string inside of summary should either be:

HTML encoded -or-
Wrapped in a CDATA tag.

Wrapping inside of CDATA tag treats the the contents as text:

<document>
<summary>
    <![CDATA[The <b>dog</b> jumped over the fence.]]>
</summary>

The problem with your solution is that the parser will want to treat as good xml structure. If you had an unbalanced tag inside summary, you would get an exception.

The solution to your question would be to loop over the elements to get text data while preserving the node names. This may work for your example, however, if you have an unbalanced tag it will break:

The <b>dog</b> jumped over <br> the fence

Don't use this solution to parse data between the summary tag. Instead either use CDATA or use some sort of regex to get content between the start and end points.

Upvotes: 0

Vitaliy

Reputation: 489

The <b>dog</b> jumped over the fence

Get children from this string. You will have 2 Text Nodes and one Element Node. Treat them accordingly.

Upvotes: 0

vanje

Reputation: 10383

A simple straight forward (but maybe not very efficient) solution:

/**
 * Serializes a XML node to a string representation without XML declaration
 * 
 * @param node The XML node
 * @return The string representation
 * @throws TransformerFactoryConfigurationError
 * @throws TransformerException
 */
private static String node2String(Node node) throws TransformerFactoryConfigurationError, TransformerException {
  final Transformer transformer = TransformerFactory.newInstance().newTransformer();
  transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
  final StringWriter writer = new StringWriter();
  transformer.transform(new DOMSource(node), new StreamResult(writer));
  return writer.toString();
}

/**
 * Serializes the inner (child) nodes of a XML element.
 * @param el
 * @return
 * @throws TransformerFactoryConfigurationError
 * @throws TransformerException
 */
private static String elementInner2String(Element el) throws TransformerFactoryConfigurationError, TransformerException {
  final NodeList children = el.getChildNodes();
  final StringBuilder sb = new StringBuilder();
  for(int i = 0; i < children.getLength(); i++) {
    final Node child = children.item(i);
    sb.append(node2String(child));
  }
  return sb.toString();
}

Then the XPath evaluation should return the node instead of the string:

Element summaryElement = (Element) xpath.evaluate("/document/summary", doc, XPathConstants.NODE);
String output = elementInner2String(summaryElement);

Upvotes: 2

Java XPath API Stripping HTML Tags from Text

Answers (3)

Related Questions