nullPainter
nullPainter

Reputation: 3056

Replacing newlines in XML attributes with XSLT

I need some XSLT (or something - see below) to replace newlines in all attributes with an alternative character.

I am having to process legacy XML which stores all data as attributes, and uses new-lines to express cardinality. For example:

<sample>
    <p att="John
    Paul
    Ringo"></p>
</sample>

These new-lines are being replaced with whitespace when I parse the file in Java (as per the XML spec), however I am wishing to treat them as a list so this behaviour isn't particularly useful.

My 'solution' was to use XSLT to replace all newlines in all attributes with some other delimiter - but I have zero knowledge of XSLT. All examples I've seen thus far have either been very specific or have replaced node content instead of attribute values.

I have dabbled with XSLT 2.0's replace() but am having a hard time putting everything together.

Is XSLT even the correct solution? With the XSLT below:

<xsl:template match="sample/*">
    <xsl:for-each select="@*">
        <xsl:value-of select="replace(current(), '\n', '|')"/>
    </xsl:for-each>
</xsl:template>

applied to the sample XML outputs the following using Saxon:

John Paul Ringo

Obviously this format isn't what I'm after - this is just to experiment with replace() - but have the newlines already been normalised by the time we get to XSLT processing? If so, are there any other ways to parse these values as writ using a Java parser? I've only used JAXB thus far.

Upvotes: 2

Views: 1756

Answers (3)

nullPainter
nullPainter

Reputation: 3056

I have solved(ish) the issue by preprocessing the XML with JSoup (which is a nod to @Ian Roberts's comment about parsing the XML with a non-XML tool). JSoup is (or was) designed for HTML documents, however works well in this context.

My code is as follows:

@Test
public void verifyNewlineEscaping() {
    final List<Node> nodes = Parser.parseXmlFragment(FileUtils.readFileToString(sourcePath.toFile(), "UTF-8"), "");

    fixAttributeNewlines(nodes);

    // Reconstruct XML
    StringBuilder output = new StringBuilder();
    for (Node node : nodes) {
        output.append(node.toString());
    }

    // Print cleansed output to stdout
    System.out.println(output);
}

/**
 * Replace newlines and surrounding whitespace in XML attributes with an alternative delimiter in
 * order to avoid whitespace normalisation converting newlines to a single space.
 * 
 * <p>
 * This is useful if newlines which have semantic value have been incorrectly inserted into
 * attribute values.
 * </p>
 * 
 * @param nodes nodes to update
 */
private static void fixAttributeNewlines(final List<Node> nodes) {

    /*
     * Recursively iterate over all attributes in all nodes in the XML document, performing
     * attribute string replacement
     */
    for (final Node node : nodes) {
        final List<Attribute> attributes = node.attributes().asList();

        for (final Attribute attribute : attributes) {

            // JSoup reports whitespace as attributes
            if (!StringUtils.isWhitespace(attribute.getValue())) {
                attribute.setValue(attribute.getValue().replaceAll("\\s*\r?\n\\s*", "|"));
            }
        }

        // Recursively process child nodes
        if (!node.childNodes().isEmpty()) {
            fixAttributeNewlines(node.childNodes());
        }
    }
}

For the sample XML in my question, the output of this method is:

<sample> 
    <p att="John|Paul|Ringo"></p> 
</sample>

Note that I am not using &#10; because JSoup is rather vigilant in its character escaping and escapes everything in attribute values. It also replaces existing numeric entity references with their UTF-8 equivalent, so time will tell whether or not this is a a passable solution.

Upvotes: 1

Michael Kay
Michael Kay

Reputation: 163458

XSLT only sees the XML after it has been processed by the XML parser, which will have done the attribute value normalization.

I think that some XML parsers have an option to suppress attribute value normalization. If you don't have access to such a parser, I think that doing a textual replace of (\r?\n) by &#x0A; prior to parsing might be your best escape route. Newlines that are escaped in this way don't get splatted by attribute value normalization.

Upvotes: 1

Jirka Š.
Jirka Š.

Reputation: 3428

It seem's to be hard to make this. As I found in Are line breaks in XML attribute values allowed? - new line character in attribute is valid but XML parser normalizes it (https://stackoverflow.com/a/8188290/1324394) so it is probably lost before processing (and thus before replacing).

Upvotes: 2

Related Questions