Reputation: 227
I need to match patterns on a set of XML documents (all with the same schema), and when a pattern matches, I need to retrieve the content and do some specific transformations on it.
I have a list of those "patterns", that are similar to regular expressions, but with elements and attributes.
pseudo-pattern example:
(//ELEMENTx) (node())* (//ELEMENTy[@ATTRIBUTEz]) (node())* (//@ATTRIBUTEw)
I used XPath syntax inside the parenthesis only. Other quantifiers could be used...
This would match when the xml has ELEMENTx as the first element, ends with one element that has ATTRIBUTEw, and in between needs to have an ELEMENTy with ATTRIBUTEz.
Note that I need to match the whole document for each whole pattern, not just part of it.
The nesting of elements does not matter in this case (ELEMENTy could be a child of ELEMENTx, or not), but they need to have that specific order.
EDIT: To clarify, the XML have trees with syntactic information. I need to match syntactic patterns.
Example:
TOP / \ X Y |\ |\ 1 2 3 4
Matching patterns could be (node names, assuming no attributes):
X Y
1 * Y
X 3 4
1 * 4
I could use XPath to get each individual part of the pattern, but then I loose the sense of order...if I do two XPath queries, I don't know the positions of the results relative to each other.
After matching, I will have rules for each pattern, that specify some transformations on the content (change order etc).
Is there any way to do something like this using XPath or XQuery? I could use DOM and make the pattern matching code myself, but maybe there is already a better way to do this.
Thanks for any pointers.
Upvotes: 1
Views: 562
Reputation: 163342
I need to match patterns on a set of XML documents (all with the same schema), and when a pattern matches, I need to retrieve the content and do some specific transformations on it.
So far that sounds like a pretty good description of XSLT. Until you say that you want a rule to match a sequence of nodes, rather than a single node.
But if the sequence of nodes you are matching is the sequence of children of some parent node, then you can recast this as a rule for matching the parent node.
The pattern matching language in XSLT isn't as powerful as you are looking for, but it could perhaps be adapted to your needs. Two possibilities that come to mind are (a) convert the structural information that you want to match on into a string, and use regular expression matching to assess the string, or (b) write XSD complex type definitions for the grammar that you want to match, and use the XSLT validate-by-type capability (in conjunction with XSLT 3.0's try/catch) to test whether the sequence of nodes matches a named complex type in the schema.
Upvotes: 1