Reputation: 7739
How can I optimize a SPARQL query like the following?
The intent of this query is:
countryCode = "US"
)Unfortunately, the OPTIONAL
blocks are being evaluated before the parent block, which causes the query engine to load all data for all countries.
What I want is something like a LEFT OUTER JOIN
behavior, but the query engine is not handling it this way.
What can I do to improve query performance?
SELECT *
WHERE
{
?type (rdfs:subClassOf)* gj:Country .
?this_0 rdf:type ?type ;
gn:countryCode "US"
# each of these blocks is executed as a standalone query in the engine
OPTIONAL
{ ?this_0 gn:countryCode ?countryCode_1}
OPTIONAL
{ ?this_0 gn:name ?name_2}
OPTIONAL
{ ?this_0 gj:cscId ?cscId_3}
}
I am using the SPARQL REST endpoint in MarkLogic 8.4.
Update:
I have tried querying with the optimize=2
option, but it did not give me a significant performance improvement:
/v1/graphs/sparql?optimize=2
Related: How do I specify options in the SPARQL REST endpoint for MarkLogic?
Update 2:
Even if I make one of the optional properties required, the query is still running slow:
WHERE
{
?type (rdfs:subClassOf)* gj:Country .
?this_0 rdf:type ?type ;
gn:countryCode "US"; gj:cscId ?cscId_3 ;
}
Do I need to do something special to index this gj:cscId property?
Update 3:
Here is the profile information from the Query Console.
Update 4:
Here is the diagnostic trace information:
2017-04-27 13:30:17.238 Info: [Event:id=SPARQL Value Frequencies] sessionKey=13846462700334370907 namedGraphs=0 values=
2017-04-27 13:30:17.238 Info: <triple-value-statistics count="154569757" unique-subjects="25445373" unique-predicates="104" unique-objects="67520361" xmlns="cts:triple-value-statistics">
2017-04-27 13:30:17.238 Info: <triple-value-entries>
2017-04-27 13:30:17.238 Info: <triple-value-entry count="181">
2017-04-27 13:30:17.238 Info: <triple-value>http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country</triple-value>
2017-04-27 13:30:17.238 Info: <subject-statistics count="2" unique-predicates="2" unique-objects="2"/>
2017-04-27 13:30:17.238 Info: <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info: <object-statistics count="179" unique-subjects="179" unique-predicates="4"/>
2017-04-27 13:30:17.238 Info: </triple-value-entry>
2017-04-27 13:30:17.238 Info: <triple-value-entry count="15">
2017-04-27 13:30:17.238 Info: <triple-value>http://www.w3.org/2000/01/rdf-schema#subClassOf</triple-value>
2017-04-27 13:30:17.238 Info: <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info: <predicate-statistics count="15" unique-subjects="15" unique-objects="5"/>
2017-04-27 13:30:17.238 Info: <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info: </triple-value-entry>
2017-04-27 13:30:17.238 Info: <triple-value-entry count="8739716">
2017-04-27 13:30:17.238 Info: <triple-value>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</triple-value>
2017-04-27 13:30:17.238 Info: <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info: <predicate-statistics count="8359510" unique-subjects="8341619" unique-objects="14"/>
2017-04-27 13:30:17.238 Info: <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info: </triple-value-entry>
2017-04-27 13:30:17.238 Info: <triple-value-entry count="8697064">
2017-04-27 13:30:17.238 Info: <triple-value>http://www.geonames.org/ontology#countryCode</triple-value>
2017-04-27 13:30:17.238 Info: <subject-statistics count="2" unique-predicates="2" unique-objects="2"/>
2017-04-27 13:30:17.238 Info: <predicate-statistics count="8323137" unique-subjects="8323137" unique-objects="517"/>
2017-04-27 13:30:17.238 Info: <object-statistics count="1" unique-subjects="1" unique-predicates="1"/>
2017-04-27 13:30:17.238 Info: </triple-value-entry>
2017-04-27 13:30:17.238 Info: <triple-value-entry count="2119305">
2017-04-27 13:30:17.238 Info: <triple-value datatype="http://www.w3.org/2001/XMLSchema#string">US</triple-value>
2017-04-27 13:30:17.238 Info: <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info: <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info: <object-statistics count="2061783" unique-subjects="2061783" unique-predicates="3"/>
2017-04-27 13:30:17.238 Info: </triple-value-entry>
2017-04-27 13:30:17.238 Info: <triple-value-entry count="13946907">
2017-04-27 13:30:17.238 Info: <triple-value>http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId</triple-value>
2017-04-27 13:30:17.238 Info: <subject-statistics count="3" unique-predicates="3" unique-objects="3"/>
2017-04-27 13:30:17.238 Info: <predicate-statistics count="11739004" unique-subjects="11739004" unique-objects="11739004"/>
2017-04-27 13:30:17.238 Info: <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info: </triple-value-entry>
2017-04-27 13:30:17.238 Info: </triple-value-entries>
2017-04-27 13:30:17.238 Info: </triple-value-statistics>
2017-04-27 13:30:17.239 Info: [Event:id=SPARQL AST] sessionKey=13846462700334370907
2017-04-27 13:30:17.239 Info: initialPlan=SPARQLModule[
2017-04-27 13:30:17.239 Info: Prolog[]
2017-04-27 13:30:17.239 Info: SPARQLSelect[SPARQLProject[order()
2017-04-27 13:30:17.239 Info: GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info: GraphNode[Var cscId_3 2]
2017-04-27 13:30:17.239 Info: SPARQLLeftNestedLoopJoin[order() hash(1==1) scatter(1 = 1)
2017-04-27 13:30:17.239 Info: SPARQLNestedLoopJoin[order() hash(1==1) scatter(1 = 1)
2017-04-27 13:30:17.239 Info: SPARQLScatterJoin[order(0,1) hash(0==0) scatter(0 = 0)
2017-04-27 13:30:17.239 Info: SPARQLZeroOrOne[
2017-04-27 13:30:17.239 Info: GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info: GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.239 Info: SPARQLScatterOneOrMore[
2017-04-27 13:30:17.239 Info: GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info: GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.239 Info: GraphNode[Var ANON7634081659815295853 1]
2017-04-27 13:30:17.239 Info: GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.239 Info: TriplePattern[order(0,1) PSO
2017-04-27 13:30:17.239 Info: GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.239 Info: GraphNode[IRI <http://www.w3.org/2000/01/rdf-schema#subClassOf>]
2017-04-27 13:30:17.239 Info: GraphNode[Var ANON7634081659815295853 1]]]]
2017-04-27 13:30:17.239 Info: TriplePattern[order(0,1) OPS
2017-04-27 13:30:17.239 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info: GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>]
2017-04-27 13:30:17.239 Info: GraphNode[Var type 0]]]
2017-04-27 13:30:17.239 Info: TriplePattern[order(1) SOP
2017-04-27 13:30:17.239 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info: GraphNode[IRI <http://www.geonames.org/ontology#countryCode>]
2017-04-27 13:30:17.239 Info: GraphNode[Literal "US"]]]
2017-04-27 13:30:17.239 Info: TriplePattern[order(1,2) PSO
2017-04-27 13:30:17.239 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info: GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId>]
2017-04-27 13:30:17.239 Info: GraphNode[Var cscId_3 2]]]]]]
2017-04-27 13:30:17.239 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 optimize=1 r=3 t=1.28811 os=360 is=15 mutations=30 seed=7088858925989728751
2017-04-27 13:30:17.239 Info: initialCost=(m:5.99223e+11,r:0,io:(52.9404/167736/1.17487e+09),cpu(1):(0/1.77017e+08/1.18652e+12),mem:8185,c:1.03266e+07,crd:[14,2.06178e+06,1.03266e+07])
2017-04-27 13:30:17.320 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=0
2017-04-27 13:30:17.320 Info: cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.320 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=1
2017-04-27 13:30:17.320 Info: cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=2
2017-04-27 13:30:17.326 Info: cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907
2017-04-27 13:30:17.326 Info: bestCost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL AST] sessionKey=13846462700334370907
2017-04-27 13:30:17.326 Info: plan=SPARQLModule[
2017-04-27 13:30:17.326 Info: Prolog[]
2017-04-27 13:30:17.326 Info: SPARQLSelect[SPARQLProject[order(1,0)
2017-04-27 13:30:17.326 Info: GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info: GraphNode[Var cscId_3 2]
2017-04-27 13:30:17.326 Info: SPARQLRightMergeJoin[order(1,0) hash(1==1) scatter()
2017-04-27 13:30:17.326 Info: TriplePattern[order(1,2) PSO
2017-04-27 13:30:17.326 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info: GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId>]
2017-04-27 13:30:17.326 Info: GraphNode[Var cscId_3 2]]
2017-04-27 13:30:17.326 Info: SPARQLHashJoin[order(1,0) hash(0==0) scatter()
2017-04-27 13:30:17.326 Info: SPARQLZeroOrOne[
2017-04-27 13:30:17.326 Info: GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info: GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.326 Info: SPARQLBloomOneOrMore[
2017-04-27 13:30:17.326 Info: GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.326 Info: GraphNode[Var ANON7634081659815295853 1]
2017-04-27 13:30:17.326 Info: GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.326 Info: GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info: TriplePattern[order(0,1) PSO
2017-04-27 13:30:17.326 Info: GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.326 Info: GraphNode[IRI <http://www.w3.org/2000/01/rdf-schema#subClassOf>]
2017-04-27 13:30:17.326 Info: GraphNode[Var ANON7634081659815295853 1]]]]
2017-04-27 13:30:17.326 Info: SPARQLMergeJoin[order(1,0) hash(1==1) scatter()
2017-04-27 13:30:17.326 Info: TriplePattern[order(1) OPS
2017-04-27 13:30:17.326 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info: GraphNode[IRI <http://www.geonames.org/ontology#countryCode>]
2017-04-27 13:30:17.326 Info: GraphNode[Literal "US"]]
2017-04-27 13:30:17.326 Info: TriplePattern[order(1,0) PSO
2017-04-27 13:30:17.326 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info: GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>]
2017-04-27 13:30:17.326 Info: GraphNode[Var type 0]]]]]]]]
Update 5:
In some use cases, I found that I could eliminate the ?type
property path expression from the query. In one such case, performance improved by two orders of magnitude:
WHERE
{
?this_0 rdf:type gj:Country ;
gn:countryCode "US"
# each of these blocks is executed as a standalone query in the engine
OPTIONAL
{ ?this_0 gn:countryCode ?countryCode_1}
OPTIONAL
{ ?this_0 gn:name ?name_2}
OPTIONAL
{ ?this_0 gj:cscId ?cscId_3}
}
Since this solution changes the output of the query, it doesn't solve all of our use cases.
It seems that the problem is not with the OPTIONALs per se, but has something to do with the property path expression confusing the query planner, so that the properties in the OPTIONAL blocks are looked up indepedently (which is not performant).
Upvotes: 3
Views: 1095
Reputation: 51
Marklogic 8 seems to have a performance issue with property paths using *
. Try replacing
?type (rdfs:subClassOf)* gj:Country .
with
{
BIND(gj:Country AS ?type)
} UNION {
?type (rdfs:subClassOf)+ gj:Country .
}
Upvotes: 0
Reputation: 958
The query optimizer relies on using statistics to determine the best order for operations. Often there will be a restrictive triple pattern that can be used to restrict further operations using a scatter join.
In your case, the statistics don't provide such an obvious restrictive triple pattern. You can see by looking in the triple value statistics output that the string "US" occurs 2061783 times as an object - so that's not terribly restrictive.
The gj:Country IRI is restrictive (179 times in the object position), but unfortunately you need to use that on the right hand side of a transitive closure operator. It's very hard to predict how many results a transitive closure operator will return, as it depends a great deal on the actual data.
You'll find that using a property path like the one below will allow MarkLogic to avoid the zero-or-one operator, which may be a small performance boost:
?this_0 a/rdfs:subClassOf* gj:Country .
Going further if you know that (for instance) there is only one gj:Country with the country code of "US", you could add a limit to that part of the query to give the optimizer hints of how to treat the query, ie:
select * {
{
select * {
?this_0 a/rdfs:subClassOf* gj:Country .
?this_0 gn:countryCode 'US' .
} limit 1
}
OPTIONAL { ?this_0 gj:cscId ?cscId_3 }
}
Upvotes: 5