Reputation: 3897
My database has information about documents, where each document has a category, e.g.
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX: <http://example.com>
:doc1 :hasCategory :category1 .
:category1 rdfs:label "Law" .
There are about 100k statements like this.
Running a simple query to get counts of documents per category:
SELECT ?category (count(distinct ?doc) as ?count) WHERE {
?doc :hasCategory ?category .
} GROUP BY ?category
takes about 0.1s to run.
But to return the category labels as well:
SELECT ?category ?label (count(distinct ?doc) as ?count) WHERE {
?doc :hasCategory ?category .
?category rdfs:label ?label .
} GROUP BY ?category ?label
this query takes more than 7s to run.
Why would the difference be so large, and is there a more optimised query I can use to get the labels?
Upvotes: 3
Views: 407
Reputation: 1193
GraphDB versions before 8.6 release implement the GROUP BY operation with a naive LinkedHashMap, where the hash key is composed of all elements part of the projection. To calculate the hashcode the engine will translate the internal identifier to a RDF value. If the strings are longer, they will be read from an external collection resulting an extra disk operation and additional CPU to calculate the hashcode.
The only way to optimise the query is to switch to GraphDB 8.6 (currently it's a late release candidate), which implements a more optimised aggregate algorithm or reduce the GROUP BY projection like you did in your answer.
Upvotes: 2
Reputation: 3897
I found I can get the desired result in 0.2s with the following query:
SELECT ?category (sample(?lbl) as ?label) ?count WHERE {
?category rdfs:label ?lbl .
{
SELECT ?category (count(distinct ?doc) as ?count) WHERE {
?doc :hasCategory ?category .
} GROUP BY ?category
}
} GROUP BY ?category ?count
But I don't really understand why it's more efficient.
Upvotes: 3