Reputation: 3056
I need some XSLT (or something - see below) to replace newlines in all attributes with an alternative character.
I am having to process legacy XML which stores all data as attributes, and uses new-lines to express cardinality. For example:
<sample>
<p att="John
Paul
Ringo"></p>
</sample>
These new-lines are being replaced with whitespace when I parse the file in Java (as per the XML spec), however I am wishing to treat them as a list so this behaviour isn't particularly useful.
My 'solution' was to use XSLT to replace all newlines in all attributes with some other delimiter - but I have zero knowledge of XSLT. All examples I've seen thus far have either been very specific or have replaced node content instead of attribute values.
I have dabbled with XSLT 2.0's replace()
but am having a hard time putting everything together.
Is XSLT even the correct solution? With the XSLT below:
<xsl:template match="sample/*">
<xsl:for-each select="@*">
<xsl:value-of select="replace(current(), '\n', '|')"/>
</xsl:for-each>
</xsl:template>
applied to the sample XML outputs the following using Saxon:
John Paul Ringo
Obviously this format isn't what I'm after - this is just to experiment with replace()
- but have the newlines already been normalised by the time we get to XSLT processing? If so, are there any other ways to parse these values as writ using a Java parser? I've only used JAXB thus far.
Upvotes: 2
Views: 1756
Reputation: 3056
I have solved(ish) the issue by preprocessing the XML with JSoup (which is a nod to @Ian Roberts's comment about parsing the XML with a non-XML tool). JSoup is (or was) designed for HTML documents, however works well in this context.
My code is as follows:
@Test
public void verifyNewlineEscaping() {
final List<Node> nodes = Parser.parseXmlFragment(FileUtils.readFileToString(sourcePath.toFile(), "UTF-8"), "");
fixAttributeNewlines(nodes);
// Reconstruct XML
StringBuilder output = new StringBuilder();
for (Node node : nodes) {
output.append(node.toString());
}
// Print cleansed output to stdout
System.out.println(output);
}
/**
* Replace newlines and surrounding whitespace in XML attributes with an alternative delimiter in
* order to avoid whitespace normalisation converting newlines to a single space.
*
* <p>
* This is useful if newlines which have semantic value have been incorrectly inserted into
* attribute values.
* </p>
*
* @param nodes nodes to update
*/
private static void fixAttributeNewlines(final List<Node> nodes) {
/*
* Recursively iterate over all attributes in all nodes in the XML document, performing
* attribute string replacement
*/
for (final Node node : nodes) {
final List<Attribute> attributes = node.attributes().asList();
for (final Attribute attribute : attributes) {
// JSoup reports whitespace as attributes
if (!StringUtils.isWhitespace(attribute.getValue())) {
attribute.setValue(attribute.getValue().replaceAll("\\s*\r?\n\\s*", "|"));
}
}
// Recursively process child nodes
if (!node.childNodes().isEmpty()) {
fixAttributeNewlines(node.childNodes());
}
}
}
For the sample XML in my question, the output of this method is:
<sample>
<p att="John|Paul|Ringo"></p>
</sample>
Note that I am not using
because JSoup is rather vigilant in its character escaping and escapes everything in attribute values. It also replaces existing numeric entity references with their UTF-8 equivalent, so time will tell whether or not this is a a passable solution.
Upvotes: 1
Reputation: 163458
XSLT only sees the XML after it has been processed by the XML parser, which will have done the attribute value normalization.
I think that some XML parsers have an option to suppress attribute value normalization. If you don't have access to such a parser, I think that doing a textual replace of (\r?\n) by 

prior to parsing might be your best escape route. Newlines that are escaped in this way don't get splatted by attribute value normalization.
Upvotes: 1
Reputation: 3428
It seem's to be hard to make this. As I found in Are line breaks in XML attribute values allowed? - new line character in attribute is valid but XML parser normalizes it (https://stackoverflow.com/a/8188290/1324394) so it is probably lost before processing (and thus before replacing).
Upvotes: 2