Jan Krupa
Jan Krupa

Reputation: 506

Improve SQL query for xpath Evaluating

Database

I've got database for saving XML document. Database looks like this:

enter image description here

So I can save any XML file to my universal database.

XPath Query

Then I translate XPATH query to SQL query, for selecting elements.

1) //EMPTY[./PERIOD]

 SELECT e2.docId
 , e2.startPos
 , e2.endPos
 , p2.NodeName
 , p2.levelEl
 , p2.pathID

 From Path p2
 , Element e2
 , Path p3
 , Element e3

 WHERE e2.docID = p2.docID
 AND e2.pathID = p2.pathID
 AND p2.NodeName =  'EMPTY'
 AND p2.levelEl >= 1
 AND e3.docID = p3.docID
 AND e3.pathID = p3.pathID
 AND p3.NodeName =  '_PERIOD_'
 AND e2.startPos < e3.startPos
 AND e2.endPos > e3.endPos
 AND e2.docId = e3.docId
 AND p2.levelEl = p3.levelEl - 1

 AND e2.docId
 = 3147524262 GROUP BY e2.docId
 , e2.startPos
 , e2.endPos
 , p2.NodeName
 , p2.levelEl
 , p2.pathID

 ORDER BY startPos;

2) //EMPTY[./PERIOD]/S/NP

 SELECT e5.docId
 , e5.startPos
 , e5.endPos
 , p5.NodeName
 , p5.levelEl
 , p5.pathID

 From Path p2
 , Element e2
 , Path p3
 , Element e3
 , Path p4
 , Element e4
 , Path p5
 , Element e5

 WHERE e2.docID = p2.docID
 AND e2.pathID = p2.pathID
 AND p2.NodeName =  'EMPTY'
 AND p2.levelEl >= 1
 AND e3.docID = p3.docID
 AND e3.pathID = p3.pathID
 AND p3.NodeName =  '_PERIOD_'
 AND e2.startPos < e3.startPos
 AND e2.endPos > e3.endPos
 AND e2.docId = e3.docId
 AND p2.levelEl = p3.levelEl - 1
 AND e4.docID = p4.docID
 AND e4.pathID = p4.pathID
 AND p4.NodeName =  'S'
 AND e2.startPos < e4.startPos
 AND e2.endPos > e4.endPos
 AND e2.docId = e4.docId
 AND p2.levelEl = p4.levelEl - 1
 AND e5.docID = p5.docID
 AND e5.pathID = p5.pathID
 AND p5.NodeName =  'NP'
 AND e4.startPos < e5.startPos
 AND e4.endPos > e5.endPos
 AND e4.docId = e5.docId
 AND p4.levelEl = p5.levelEl - 1

 AND e5.docId
 = 3147524262 GROUP BY e5.docId
 , e5.startPos
 , e5.endPos
 , p5.NodeName
 , p5.levelEl
 , p5.pathID

 ORDER BY startPos;

Problem

As we can see, with every other node in xpath, Iam adding a table element and path into the FROM statements, which represents that node, with some where conditions to determine where this node should be located.

The problem is, that Query 1) runs pretty fast, but with every new node, the sql is much slower. For example Query 1) takes 24ms but Query 2) Runs something about 5 minutes.

I have added indexes, so Query Plan uses only index seek (according to Microsoft sql server).

enter image description here

Question

Do you know of any solution how to otherwise generate SQL query to make execution faster? Or some improvements for an existing query (DB table engines, etc.)?

Iam creating TREE from Xpath which looks like this

ROOT-ROOT   (type: ROOT) False                                                                                            
   //-EMPTY   (type: NODE) False                                                                                             
       /-_PERIOD_   (type: NODE) False                                                                                         
       /-S   (type: NODE) False                                                                                                  
          /-NP   (type: NODE) True  

And from that tree i Generate the SQL

Upvotes: 1

Views: 391

Answers (1)

Radim Bača
Radim Bača

Reputation: 10701

There is a lot of research on this topic and your approach will inevitably lead to a lot of self joins which is slow. It is very close to this solution. I recommend you to use some native XQuery database such as BaseX, or Saxon which are inherently optimized to process the XQueries without the need to rewrite them into the SQL.

However, if you really want to rewrite XQueries into SQL then read for example XPath accelerator proposed by Torsten Grust. Ideas behind his work are implemented in MonetDB XQuery engine. He is using slightly different labeling scheme than you, but I guess the ideas can be implemented in your approach as well.

Upvotes: 4

Related Questions