Reputation: 223
I have a text document that I want to convert to XML using XSLT for easier processing. the source file is pretty general, such as this:
[{c=1,d=2},{cc=11,dd=22}]%{f=4,g=5,h={i=6,j=[7,8]}}%
I'd like to transform this to an XML file such as this:
<document>
<header>
<item>
<c>1</c>
<d>2</d>
</item>
<item>
<cc>11</c>
<dd>22</d>
</item>
</header>
<content>
<f>4</f>
<g>5</g>
<h>
<i>6</i>
<j>
<elt>7</elt>
<elt>8</elt>
</j>
</h>
</content>
</document>
So in essence, the string before an "=" is the tag name, everything thereafter is the content (with nesting), with the only addition of the document, header, content and elt nodes. The original file will likely contain each value and all "}" on separate lines but that is not guaranteed(I don't know if that matters or not)
I found some answers for similar cases where text is converted to XML, but there the resulting node names and nesting levels are always know beforehand. Gut feeling there should be a relatively simple solution to this, but unfortunately I know only that XSLT is powerful and useful, but not who to write it...
Thanks in advance for the help, DeColaman
Upvotes: 0
Views: 2505
Reputation: 5256
As Michael suggested, this indeed looks like a nice exercise for REx. The sample shows some similarity to JSON, but for demonstration, let's guess an even simpler REx grammar:
source ::= item '%' item '%' eof
item ::= '{' ( named-item ( ',' named-item )* )? '}'
| '[' ( item ( ',' item )* )? ']'
| element
named-item ::= name '=' item
<?TOKENS?>
name ::= [a-z]+
element ::= [0-9]+
eof ::= $
Put it in a file named source.ebnf
, and use REx to generate an XSLT-coded parser from it, by configuring options XSLT
and parse tree
, or using command line -xslt -tree
.
The parser contains a function named p:parse-source
that accepts the input as a string and turns it into a concrete syntax tree according to the above grammar. The syntax tree contains an element for each nonterminal or named token, and a TOKEN element for each unnamed token.
That syntax tree then must be transformed into the target structure. Import the generated parser from file source.xslt
into the XSLT below:
<xsl:stylesheet xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
xmlns:p="source">
<xsl:import href="source.xslt"/>
<xsl:output indent="yes"/>
<xsl:variable name="input" select="'[{c=1,d=2},{cc=11,dd=22}]%{f=4,g=5,h={i=6,j=[7,8]}}%'"/>
<xsl:template match="/">
<xsl:variable name="parse-tree" select="p:parse-source($input)"/>
<xsl:choose>
<xsl:when test="not($parse-tree/self::source)">
<xsl:sequence select="$parse-tree"/>
</xsl:when>
<xsl:otherwise>
<xsl:variable name="item">
<xsl:apply-templates select="$parse-tree/item"/>
</xsl:variable>
<xsl:element name="document">
<xsl:element name="header">
<xsl:sequence select="$item/*[1]/node()"/>
</xsl:element>
<xsl:element name="content">
<xsl:sequence select="$item/*[2]/node()"/>
</xsl:element>
</xsl:element>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template match="item">
<xsl:variable name="items">
<xsl:apply-templates select="*[not(self::TOKEN)]"/>
</xsl:variable>
<xsl:choose>
<xsl:when test="count($items/*) eq 1">
<xsl:sequence select="$items"/>
</xsl:when>
<xsl:otherwise>
<xsl:element name="item">
<xsl:sequence select="$items"/>
</xsl:element>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template match="named-item">
<xsl:element name="{name}">
<xsl:variable name="item">
<xsl:apply-templates select="item"/>
</xsl:variable>
<xsl:sequence select="$item/*/node()"/>
</xsl:element>
</xsl:template>
<xsl:template match="element">
<xsl:element name="elt">
<xsl:sequence select="node()"/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
Running the above on an XSLT 2.0 processor, e.g. Saxon, will generate the desired result.
Upvotes: 1
Reputation: 163595
You're basically trying to write a parser for some grammar. Which is quite feasible to do, but it helps to know exactly what the grammar is, and it helps to know a little bit about how to write a recursive descent parser. From your sample it looks like a recursive grammar, which means you can't do it purely using regular expressions.
You might like to take a look at Rex, Gunther Rademacher's tool for generating parsers in XQuery or (recently) XSLT. It's not well documented but it's very powerful.
Upvotes: 1
Reputation: 56893
In XSLT 2.0 there is a function called unparsed-text()
which will parse an HREF (or file) and return a string.
You could then use one or more of the regular expression instructions or functions (such as tokenize()
or xsl:analyze-string
) to break the string up into a sequence and process the parts.
Elements can be created in a stylesheet using the xsl:element
instruction, like this:
<xsl:variable name="elementName" select="'f'"/>
<xsl:element name="$elementName">
..
</xsl:elelent>
Obviously you would be getting the element name from your string, but hopefully you see the pattern used.
Upvotes: 0