Reputation: 5369
I have a rather complex SPARQL query, which is executed thousands of times in parallel threads (400 threads). The query is here somewhat simplified (namespaces, properties, and variables have been reduced) for readability, but the complexity is left untouched (unions, number of graphs, etc.). The query is run against 4 graphs, the biggest of which contains 5,561,181 triples.
PREFIX graphA: <GraphABaseURI:>
ASK
FROM NAMED <GraphBURI>
FROM NAMED <GraphCURI>
FROM NAMED <GraphABaseURI>
FROM NAMED <GraphDBaseURI>
WHERE{
{
GRAPH <GraphABaseURI>{
?variableA a graphA:ClassA .
?variableA graphA:propertyA ?variableB .
?variableB dcterms:title ?variableC .
?variableA graphA:propertyB ?variableD .
?variableL<GraphABaseURI:propertyB> ?variableD .
?variableD <propertyBURI> ?variableE
}
.
GRAPH <GraphBURI>{
?variableF <propertyCURI>/<propertyDURI> ?variableG .
?variableF <propertyEURI> ?variableH
}
.
GRAPH <GraphCURI>{
?variableI <http://www.w3.org/2004/02/skos/core#notation> ?variableJ .
?variableI <http://www.w3.org/2004/02/skos/core#prefLabel> ?variableK .
FILTER (isLiteral(?variableK) && REGEX(?variableK, "literalA", "i"))
}
.
FILTER (isLiteral(?variableJ) && ?variableG = ?variableJ) .
FILTER (?variableE = ?variableH)
}
UNION
{
GRAPH <GraphABaseURI>{
?variableA a graphA:ClassA .
?variableA graphA:propertyA ?variableB .
?variableB dcterms:title ?variableC .
?variableA graphA:propertyB ?variableD .
?variableL<propertyBURI> ?variableE .
?variableL <propertyFURI> ?variableD .
}
.
GRAPH <GraphDBaseURI>{
?variableM <propertyGURI> ?variableN .
?variableM <propertyHURI> ?variableO .
FILTER (isLiteral(?variableO) && REGEX(?variableO, "literalA", "i"))
}
.
FILTER (?variableE = ?variableN) .
}
UNION
{
GRAPH <GraphABaseURI>{
?variableA a graphA:ClassA .
?variableA graphA:propertyA ?variableB .
?variableB dcterms:title ?variableC .
?variableA graphA:propertyB ?variableD .
?variableL<propertyBURI> ?variableE .
?variableL <propertyIURI> ?variableD .
}
.
GRAPH <GraphDBaseURI>{
?variableM <propertyGURI> ?variableN .
?variableM <propertyHURI> ?variableO .
FILTER (isLiteral(?variableO) && REGEX(?variableO, "literalA", "i"))
}
.
FILTER (?variableE = ?variableN) .
}
. FILTER (isLiteral(?variableC) && REGEX(?variableC, "literalB", "i")) .
}
I would not expect someone to transform the above query (of course...). I am only posting the query to demonstrate the complexity and all the SPARQL structures used.
My questions:
STR()
casts and using the isLiteral()
function. Could you suggest anything else?Please note that I use Virtuoso Open source edition, built on Ubuntu, Version: 07.20.3214, Build: Oct 14 2015.
Regards, Pantelis Natsiavas
Upvotes: 3
Views: 847
Reputation: 9434
First thing -- your Virtuoso build is long outdated; updating to 7.20.3217 as of April 2016 (or later) is strongly recommended.
Optimization suggestions are naturally limited when looking at a simplified query., but here are several thoughts, in no particular order...
Index Scheme Selection, the RDF Performance Tuning doc section following RDF Index Scheme, offers a couple of alternative and/or additional indexes which may make sense for your queries and data. As you say that some of your patterns will have defined graph and object, and undefined subject and predicate, some other indexes may also make sense (e.g., GOPS
, GOSP
), depending on some other factors.
Depending on how much your data has changed since original load, it may be worth rebuilding the free-text indexes, with this SQL command (which may be issued through any SQL interface -- iSQL, ODBC, JDBC, etc.) —
VT_INC_INDEX_DB_DBA_RDF_OBJ ()
Using the bif:contains
predicate can result in substantially better performance than regex()
filters, for instance replacing —
FILTER (isLiteral(?variableO) && REGEX(?variableO, "literalA", "i")) .
— with —
?variableO bif:contains "'literalA'" .
FILTER ( isLiteral(?variableO) ) .
Explain()
and profile()
can be helpful in query optimization efforts. Much of this output is meant for analysis by Development, so it may not mean much to you, but providing it to other Virtuoso users can still yield helpful suggestions.
For a number of reasons, the rdf:type
predicate (often expressed as a
, thanks to SPARQL/Turtle semantic sugar) can be a performance killer. Removing those predicates from your graph pattern is likely to boost performance substantially. If needed, there are other ways to limit the solution set (such as by testing for attributes only possessed by entities your desired rdf:type
) which do not have such negative performance impacts.
(ObDisclaimer: OpenLink Software produces Virtuoso, and employs me.)
Upvotes: 4