Reputation: 16311
I've got a function that takes in an arbitrary list of semantic types, and I need to generate a SPARQL query that finds a path containing those types, starting from a start node. Currently I'm writing a pretty naive function, but I'm hoping that it can be improved. I'm using RDF4J 4.2.2, with a MemoryStore containing about 39k nodes.
Sample query:
SELECT DISTINCT ?e0 ?e1 ?e2 ?e3 ?e4 ?e5 WHERE {
BIND(<http://example.com/data#16f7b974-f1ee-4ef1-883d-72a11c65b9be> AS ?e0)
?e0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/data#FeatureValue> .
?e1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/data#OperatorExpression> .
?e2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/data#ParameterMembership> .
?e3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/data#Feature> .
?e4 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/data#FeatureValue> .
?e5Type <http://example.com/data#hasParent>* <http://example.com/data#LiteralExpression> .
?e5 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?e5Type .
?e0 ?e0_e1 ?e1 .
?e1 ?e1_e2 ?e2 .
?e2 ?e2_e3 ?e3 .
?e3 ?e3_e4 ?e4 .
?e4 ?e4_e5 ?e5 .
FILTER(?e0 != ?e5)
}
I have a specific starting node, and from there I want to find any relationship to an OperatorExpression, then any relationship to a ParameterMembership, etc. The last node should be of any type that is a child of LiteralExpression. The output should be the nodes that match the paths. Unfortunately, any predicate can link the nodes together so it's a very unbounded query.
The issue is that in some cases this query can run very slow; for smaller paths the runtime might be in the 300-500 ms range, for a path like this it's usually more like 8-12 seconds. In trying to find the cause, I've gotten the query path explanation:
Distinct (resultSizeActual=0, totalTimeActual=8.3s, selfTimeActual=0.001ms)
Projection (resultSizeActual=0, totalTimeActual=8.3s, selfTimeActual=0.001ms)
├──ProjectionElemList
│ ProjectionElem "e0"
│ ProjectionElem "e1"
│ ProjectionElem "e2"
│ ProjectionElem "e3"
│ ProjectionElem "e4"
│ ProjectionElem "e5"
└──Join (JoinIterator) (resultSizeActual=0, totalTimeActual=8.3s, selfTimeActual=0.071ms)
╠══Extension (resultSizeActual=1, totalTimeActual=0.005ms, selfTimeActual=0.004ms)
║ ├──SingletonSet (resultSizeActual=1, totalTimeActual=0.001ms, selfTimeActual=0.001ms)
║ └──ExtensionElem (e0)
║ ValueConstant (value=http://example.com/data#aa994b89-12e0-41b9-bd99-8574b43d47fe)
╚══Join (JoinIterator) (resultSizeActual=0, totalTimeActual=8.2s, selfTimeActual=0.012ms)
├──ArbitraryLengthPath (costEstimate=6, resultSizeEstimate=12, resultSizeActual=6, totalTimeActual=0.204ms, selfTimeActual=0.204ms)
│ Var (name=e5Type)
│ StatementPattern (resultSizeEstimate=6)
│ Var (name=e5Type)
│ Var (name=_const_2ae5e92b_uri, value=http://example.com/data#hasParent, anonymous)
│ Var (name=_const_cd806ce_uri, value=http://example.com/data#LiteralExpression, anonymous)
│ Var (name=_const_cd806ce_uri, value=http://example.com/data#LiteralExpression, anonymous)
└──Join (JoinIterator) (resultSizeActual=0, totalTimeActual=8.2s, selfTimeActual=0.445ms)
╠══StatementPattern (costEstimate=21, resultSizeEstimate=64, resultSizeActual=360, totalTimeActual=0.261ms, selfTimeActual=0.261ms)
║ Var (name=e1)
║ Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
║ Var (name=_const_1ff49915_uri, value=http://example.com/data#OperatorExpression, anonymous)
╚══Join (JoinIterator) (resultSizeActual=0, totalTimeActual=8.2s, selfTimeActual=5.32ms)
├──StatementPattern (costEstimate=31, resultSizeEstimate=3.9K, resultSizeActual=5.5K, totalTimeActual=2.66ms, selfTimeActual=2.66ms)
│ Var (name=e5)
│ Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
│ Var (name=e5Type)
└──Join (JoinIterator) (resultSizeActual=0, totalTimeActual=8.2s, selfTimeActual=358ms)
╠══StatementPattern (costEstimate=52, resultSizeEstimate=155, resultSizeActual=837.0K, totalTimeActual=167ms, selfTimeActual=167ms)
║ Var (name=e2)
║ Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
║ Var (name=_const_8b67b0c6_uri, value=http://example.com/data#ParameterMembership, anonymous)
╚══Join (JoinIterator) (resultSizeActual=0, totalTimeActual=7.7s, selfTimeActual=110ms)
├──StatementPattern (costEstimate=34, resultSizeEstimate=39.7K, resultSizeActual=44.6K, totalTimeActual=59.8ms, selfTimeActual=59.8ms)
│ Var (name=e1)
│ Var (name=e1_e2)
│ Var (name=e2)
└──Join (JoinIterator) (resultSizeActual=0, totalTimeActual=7.5s, selfTimeActual=3.9s)
╠══StatementPattern (costEstimate=65, resultSizeEstimate=194, resultSizeActual=8.6M, totalTimeActual=1.6s, selfTimeActual=1.6s)
║ Var (name=e4)
║ Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
║ Var (name=_const_2fa7cd94_uri, value=http://example.com/data#FeatureValue, anonymous)
╚══Filter (resultSizeActual=0, totalTimeActual=2.0s, selfTimeActual=759ms)
├──Compare (!=)
│ Var (name=e0)
│ Var (name=e5)
└──Join (JoinIterator) (resultSizeActual=0, totalTimeActual=1.3s, selfTimeActual=796ms)
╠══StatementPattern (costEstimate=34, resultSizeEstimate=39.7K, resultSizeActual=53.7K, totalTimeActual=406ms, selfTimeActual=406ms)
║ Var (name=e4)
║ Var (name=e4_e5)
║ Var (name=e5)
╚══Join (JoinIterator) (resultSizeActual=0, totalTimeActual=50.1ms, selfTimeActual=31.0ms)
├──StatementPattern (costEstimate=97, resultSizeEstimate=194, resultSizeActual=53.7K, totalTimeActual=11.0ms, selfTimeActual=11.0ms)
│ Var (name=e0)
│ Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
│ Var (name=_const_2fa7cd94_uri, value=http://example.com/data#FeatureValue, anonymous)
└──Join (JoinIterator) (resultSizeActual=0, totalTimeActual=8.13ms, selfTimeActual=5.35ms)
╠══StatementPattern (costEstimate=34, resultSizeEstimate=39.7K, resultSizeActual=0, totalTimeActual=2.78ms, selfTimeActual=2.78ms)
║ Var (name=e0)
║ Var (name=e0_e1)
║ Var (name=e1)
╚══Join (JoinIterator)
├──StatementPattern (costEstimate=201, resultSizeEstimate=602)
│ Var (name=e3)
│ Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
│ Var (name=_const_6c51f5dd_uri, value=http://example.com/data#Feature, anonymous)
└──Join (JoinIterator)
╠══StatementPattern (costEstimate=34, resultSizeEstimate=39.7K)
║ Var (name=e2)
║ Var (name=e2_e3)
║ Var (name=e3)
╚══StatementPattern (costEstimate=34, resultSizeEstimate=39.7K)
Var (name=e3)
Var (name=e3_e4)
Var (name=e4)
Looking at that explanation, if I'm reading it correctly it seems that a very large number of results are being returned when they definitely shouldn't be, for example:
╠══StatementPattern (costEstimate=65, resultSizeEstimate=194, resultSizeActual=8.6M, totalTimeActual=1.6s, selfTimeActual=1.6s)
║ Var (name=e4)
║ Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
║ Var (name=_const_2fa7cd94_uri, value=http://example.com/data#FeatureValue, anonymous)
seems to be saying that ?e4 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/data#FeatureValue> .
is returning 8.6 million results, and I'm not seeing how that can happen when there are only 39k nodes in the store.
My question here is twofold:
Thanks!
Upvotes: 2
Views: 46