Pantelis Natsiavas
Pantelis Natsiavas

Reputation: 5369

Complex SPARQL query - Virtuoso performance hints?

I have a rather complex SPARQL query, which is executed thousands of times in parallel threads (400 threads). The query is here somewhat simplified (namespaces, properties, and variables have been reduced) for readability, but the complexity is left untouched (unions, number of graphs, etc.). The query is run against 4 graphs, the biggest of which contains 5,561,181 triples.

PREFIX graphA: <GraphABaseURI:>

ASK
FROM NAMED <GraphBURI>
FROM NAMED <GraphCURI>
FROM NAMED <GraphABaseURI>
FROM NAMED <GraphDBaseURI>
WHERE{
   {  
      GRAPH <GraphABaseURI>{
         ?variableA a graphA:ClassA .
         ?variableA graphA:propertyA ?variableB .
         ?variableB dcterms:title ?variableC .
         ?variableA graphA:propertyB ?variableD .
         ?variableL<GraphABaseURI:propertyB> ?variableD .
         ?variableD <propertyBURI> ?variableE
      }
      .
      GRAPH <GraphBURI>{
        ?variableF <propertyCURI>/<propertyDURI> ?variableG .
        ?variableF <propertyEURI> ?variableH
      }
      .
      GRAPH <GraphCURI>{
        ?variableI <http://www.w3.org/2004/02/skos/core#notation> ?variableJ .
        ?variableI <http://www.w3.org/2004/02/skos/core#prefLabel> ?variableK .
        FILTER (isLiteral(?variableK) && REGEX(?variableK, "literalA", "i"))
      }
      . 
      FILTER (isLiteral(?variableJ) && ?variableG = ?variableJ) .
      FILTER (?variableE = ?variableH)
   }
   UNION
   {
       GRAPH <GraphABaseURI>{
          ?variableA a graphA:ClassA .
          ?variableA graphA:propertyA ?variableB .
          ?variableB dcterms:title ?variableC .
          ?variableA graphA:propertyB ?variableD .
          ?variableL<propertyBURI> ?variableE .
          ?variableL <propertyFURI> ?variableD .
       }
       .
       GRAPH <GraphDBaseURI>{
          ?variableM <propertyGURI> ?variableN .
          ?variableM <propertyHURI> ?variableO .
          FILTER (isLiteral(?variableO) && REGEX(?variableO, "literalA", "i"))
       }
       .
       FILTER (?variableE = ?variableN) .

   }
   UNION
   {
       GRAPH <GraphABaseURI>{
          ?variableA a graphA:ClassA .
          ?variableA graphA:propertyA ?variableB .
          ?variableB dcterms:title ?variableC .
          ?variableA graphA:propertyB ?variableD .
          ?variableL<propertyBURI> ?variableE .
          ?variableL <propertyIURI> ?variableD .
       }
       .
       GRAPH <GraphDBaseURI>{
          ?variableM <propertyGURI> ?variableN .
          ?variableM <propertyHURI> ?variableO .
          FILTER (isLiteral(?variableO) && REGEX(?variableO, "literalA", "i"))
       }
       .
       FILTER (?variableE = ?variableN) .
   }
   . FILTER (isLiteral(?variableC) && REGEX(?variableC, "literalB", "i")) .  
}

I would not expect someone to transform the above query (of course...). I am only posting the query to demonstrate the complexity and all the SPARQL structures used.

My questions:

  1. Would I gain regarding performance if I had all my triples in one graph? This way I would avoid unions and simplify my query, however, would this also benefit in terms of performance?
  2. Are there any kind of indexes that I could built and they could be of any help with the above query? I am not really confident on data indexing, however reading in the RDF Index Scheme section of RDF Performance Tuning, I wonder if Virtuoso 7's default indexing scheme is suitable for queries like the above. While the predicates are defined in the above query's SPARQL triple patterns, there are many triple patterns that have not defined subject or predicate. Could this be a major problem regarding performance?
  3. Perhaps there is a SPARQL syntax structure that I am not aware of and could be of great help in the above query. Could you suggest something? For example, I have already improved performance by removing STR() casts and using the isLiteral() function. Could you suggest anything else?
  4. Perhaps you could suggest overusing a complex SPARQL syntax structure?

Please note that I use Virtuoso Open source edition, built on Ubuntu, Version: 07.20.3214, Build: Oct 14 2015.

Regards, Pantelis Natsiavas

Upvotes: 3

Views: 847

Answers (1)

TallTed
TallTed

Reputation: 9434

First thing -- your Virtuoso build is long outdated; updating to 7.20.3217 as of April 2016 (or later) is strongly recommended.

Optimization suggestions are naturally limited when looking at a simplified query., but here are several thoughts, in no particular order...

  • Index Scheme Selection, the RDF Performance Tuning doc section following RDF Index Scheme, offers a couple of alternative and/or additional indexes which may make sense for your queries and data. As you say that some of your patterns will have defined graph and object, and undefined subject and predicate, some other indexes may also make sense (e.g., GOPS, GOSP), depending on some other factors.

  • Depending on how much your data has changed since original load, it may be worth rebuilding the free-text indexes, with this SQL command (which may be issued through any SQL interface -- iSQL, ODBC, JDBC, etc.) —

    VT_INC_INDEX_DB_DBA_RDF_OBJ ()
    
  • Using the bif:contains predicate can result in substantially better performance than regex() filters, for instance replacing —

    FILTER (isLiteral(?variableO) && REGEX(?variableO, "literalA", "i")) .
    

    — with —

    ?variableO bif:contains "'literalA'" .
    FILTER ( isLiteral(?variableO) ) .
    
  • Explain() and profile() can be helpful in query optimization efforts. Much of this output is meant for analysis by Development, so it may not mean much to you, but providing it to other Virtuoso users can still yield helpful suggestions.

  • For a number of reasons, the rdf:type predicate (often expressed as a, thanks to SPARQL/Turtle semantic sugar) can be a performance killer. Removing those predicates from your graph pattern is likely to boost performance substantially. If needed, there are other ways to limit the solution set (such as by testing for attributes only possessed by entities your desired rdf:type) which do not have such negative performance impacts.

(ObDisclaimer: OpenLink Software produces Virtuoso, and employs me.)

Upvotes: 4

Related Questions