Matt McMinn
Matt McMinn

Reputation: 16311

Efficient SPARQL query for specifically typed paths

I've got a function that takes in an arbitrary list of semantic types, and I need to generate a SPARQL query that finds a path containing those types, starting from a start node. Currently I'm writing a pretty naive function, but I'm hoping that it can be improved. I'm using RDF4J 4.2.2, with a MemoryStore containing about 39k nodes.

Sample query:

SELECT DISTINCT ?e0 ?e1 ?e2 ?e3 ?e4 ?e5  WHERE {
    BIND(<http://example.com/data#16f7b974-f1ee-4ef1-883d-72a11c65b9be> AS ?e0)

    ?e0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/data#FeatureValue> .
    ?e1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/data#OperatorExpression> .
    ?e2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/data#ParameterMembership> .
    ?e3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/data#Feature> .
    ?e4 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/data#FeatureValue> .
    ?e5Type <http://example.com/data#hasParent>* <http://example.com/data#LiteralExpression> .
    ?e5 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?e5Type .

    ?e0 ?e0_e1 ?e1 .
    ?e1 ?e1_e2 ?e2 .
    ?e2 ?e2_e3 ?e3 .
    ?e3 ?e3_e4 ?e4 .
    ?e4 ?e4_e5 ?e5 .

    FILTER(?e0 != ?e5)
}

I have a specific starting node, and from there I want to find any relationship to an OperatorExpression, then any relationship to a ParameterMembership, etc. The last node should be of any type that is a child of LiteralExpression. The output should be the nodes that match the paths. Unfortunately, any predicate can link the nodes together so it's a very unbounded query.

The issue is that in some cases this query can run very slow; for smaller paths the runtime might be in the 300-500 ms range, for a path like this it's usually more like 8-12 seconds. In trying to find the cause, I've gotten the query path explanation:

Distinct (resultSizeActual=0, totalTimeActual=8.3s, selfTimeActual=0.001ms)
   Projection (resultSizeActual=0, totalTimeActual=8.3s, selfTimeActual=0.001ms)
   ├──ProjectionElemList
   │     ProjectionElem "e0"
   │     ProjectionElem "e1"
   │     ProjectionElem "e2"
   │     ProjectionElem "e3"
   │     ProjectionElem "e4"
   │     ProjectionElem "e5"
   └──Join (JoinIterator) (resultSizeActual=0, totalTimeActual=8.3s, selfTimeActual=0.071ms)
      ╠══Extension (resultSizeActual=1, totalTimeActual=0.005ms, selfTimeActual=0.004ms)
      ║  ├──SingletonSet (resultSizeActual=1, totalTimeActual=0.001ms, selfTimeActual=0.001ms)
      ║  └──ExtensionElem (e0)
      ║        ValueConstant (value=http://example.com/data#aa994b89-12e0-41b9-bd99-8574b43d47fe)
      ╚══Join (JoinIterator) (resultSizeActual=0, totalTimeActual=8.2s, selfTimeActual=0.012ms)
         ├──ArbitraryLengthPath (costEstimate=6, resultSizeEstimate=12, resultSizeActual=6, totalTimeActual=0.204ms, selfTimeActual=0.204ms)
         │     Var (name=e5Type)
         │     StatementPattern (resultSizeEstimate=6)
         │        Var (name=e5Type)
         │        Var (name=_const_2ae5e92b_uri, value=http://example.com/data#hasParent, anonymous)
         │        Var (name=_const_cd806ce_uri, value=http://example.com/data#LiteralExpression, anonymous)
         │     Var (name=_const_cd806ce_uri, value=http://example.com/data#LiteralExpression, anonymous)
         └──Join (JoinIterator) (resultSizeActual=0, totalTimeActual=8.2s, selfTimeActual=0.445ms)
            ╠══StatementPattern (costEstimate=21, resultSizeEstimate=64, resultSizeActual=360, totalTimeActual=0.261ms, selfTimeActual=0.261ms)
            ║     Var (name=e1)
            ║     Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
            ║     Var (name=_const_1ff49915_uri, value=http://example.com/data#OperatorExpression, anonymous)
            ╚══Join (JoinIterator) (resultSizeActual=0, totalTimeActual=8.2s, selfTimeActual=5.32ms)
               ├──StatementPattern (costEstimate=31, resultSizeEstimate=3.9K, resultSizeActual=5.5K, totalTimeActual=2.66ms, selfTimeActual=2.66ms)
               │     Var (name=e5)
               │     Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
               │     Var (name=e5Type)
               └──Join (JoinIterator) (resultSizeActual=0, totalTimeActual=8.2s, selfTimeActual=358ms)
                  ╠══StatementPattern (costEstimate=52, resultSizeEstimate=155, resultSizeActual=837.0K, totalTimeActual=167ms, selfTimeActual=167ms)
                  ║     Var (name=e2)
                  ║     Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                  ║     Var (name=_const_8b67b0c6_uri, value=http://example.com/data#ParameterMembership, anonymous)
                  ╚══Join (JoinIterator) (resultSizeActual=0, totalTimeActual=7.7s, selfTimeActual=110ms)
                     ├──StatementPattern (costEstimate=34, resultSizeEstimate=39.7K, resultSizeActual=44.6K, totalTimeActual=59.8ms, selfTimeActual=59.8ms)
                     │     Var (name=e1)
                     │     Var (name=e1_e2)
                     │     Var (name=e2)
                     └──Join (JoinIterator) (resultSizeActual=0, totalTimeActual=7.5s, selfTimeActual=3.9s)
                        ╠══StatementPattern (costEstimate=65, resultSizeEstimate=194, resultSizeActual=8.6M, totalTimeActual=1.6s, selfTimeActual=1.6s)
                        ║     Var (name=e4)
                        ║     Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                        ║     Var (name=_const_2fa7cd94_uri, value=http://example.com/data#FeatureValue, anonymous)
                        ╚══Filter (resultSizeActual=0, totalTimeActual=2.0s, selfTimeActual=759ms)
                           ├──Compare (!=)
                           │     Var (name=e0)
                           │     Var (name=e5)
                           └──Join (JoinIterator) (resultSizeActual=0, totalTimeActual=1.3s, selfTimeActual=796ms)
                              ╠══StatementPattern (costEstimate=34, resultSizeEstimate=39.7K, resultSizeActual=53.7K, totalTimeActual=406ms, selfTimeActual=406ms)
                              ║     Var (name=e4)
                              ║     Var (name=e4_e5)
                              ║     Var (name=e5)
                              ╚══Join (JoinIterator) (resultSizeActual=0, totalTimeActual=50.1ms, selfTimeActual=31.0ms)
                                 ├──StatementPattern (costEstimate=97, resultSizeEstimate=194, resultSizeActual=53.7K, totalTimeActual=11.0ms, selfTimeActual=11.0ms)
                                 │     Var (name=e0)
                                 │     Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                                 │     Var (name=_const_2fa7cd94_uri, value=http://example.com/data#FeatureValue, anonymous)
                                 └──Join (JoinIterator) (resultSizeActual=0, totalTimeActual=8.13ms, selfTimeActual=5.35ms)
                                    ╠══StatementPattern (costEstimate=34, resultSizeEstimate=39.7K, resultSizeActual=0, totalTimeActual=2.78ms, selfTimeActual=2.78ms)
                                    ║     Var (name=e0)
                                    ║     Var (name=e0_e1)
                                    ║     Var (name=e1)
                                    ╚══Join (JoinIterator)
                                       ├──StatementPattern (costEstimate=201, resultSizeEstimate=602)
                                       │     Var (name=e3)
                                       │     Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
                                       │     Var (name=_const_6c51f5dd_uri, value=http://example.com/data#Feature, anonymous)
                                       └──Join (JoinIterator)
                                          ╠══StatementPattern (costEstimate=34, resultSizeEstimate=39.7K)
                                          ║     Var (name=e2)
                                          ║     Var (name=e2_e3)
                                          ║     Var (name=e3)
                                          ╚══StatementPattern (costEstimate=34, resultSizeEstimate=39.7K)
                                                Var (name=e3)
                                                Var (name=e3_e4)
                                                Var (name=e4)

Looking at that explanation, if I'm reading it correctly it seems that a very large number of results are being returned when they definitely shouldn't be, for example:

╠══StatementPattern (costEstimate=65, resultSizeEstimate=194, resultSizeActual=8.6M, totalTimeActual=1.6s, selfTimeActual=1.6s)
║     Var (name=e4)
║     Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
║     Var (name=_const_2fa7cd94_uri, value=http://example.com/data#FeatureValue, anonymous)

seems to be saying that ?e4 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/data#FeatureValue> . is returning 8.6 million results, and I'm not seeing how that can happen when there are only 39k nodes in the store.

My question here is twofold:

  1. How can I rewrite the SPARQL query to be more efficient?
  2. Am I reading the explanation correctly, and if so, how do I explain the large number of results being returned for seemingly simple patterns?

Thanks!

Upvotes: 2

Views: 46

Answers (0)

Related Questions