Roman Puchkovskiy
Roman Puchkovskiy

Reputation: 11865

Is there a way to efficiently execute ORDER BY + LIMIT using an index in Blazegraph?

I am playing with Blazegraph. I insert some triples representing 'events', each of 'event' contains 3 triples and looks like this:

<%event-iri%> <http://predicates/timestamp> '2020-01-02T03:04:05.000Z'^^xsd:dateTime .
<%event-iri%> <http://predicates/a> %RANDOM_UUID% .
<%event-iri%> <http://predicates/b> %RANDOM_UUID% .

Timestamps represent consecutive moments of time, each next event is 1 minute later than the previous one.

I have 10 million 'events' (so 30 million triples) in the graph.

I run the following query:

select ?event
where {
  ?event <http://predicates/timestamp> ?timestamp .
}
order by ?timestamp
limit 15

I expect it to be executed effectively using POS index for ordering, but it looks like it attempts to do a sort of all 10 million timestamps in memory as the query takes more than 2 minutes to execute.

Adding a hint saying that a range query is safe does not help either (well, the hint is about range queries, but I decided to still try it as it seems to actually be about 'all objects of this predicate are of the same type'):

select ?event
where {
  ?event <http://predicates/timestamp> ?timestamp .
  hint:Prior hint:rangeSafe true .
}
order by ?timestamp
limit 15

Still same 2 minutes and 20 seconds.

In relational databases and MongoDB you just add an index and such queries work fast.

Is there a way to execute such a query efficiently on Blazegraph?

Upvotes: 0

Views: 244

Answers (1)

Stanislav Kralin
Stanislav Kralin

Reputation: 11479

From Query Hints:

Blazegraph stores data in their natural (lexicographic) order in the POS(C) index and the OSP(C) index.

That order is not preserved through joins, so one can try something like this:

SELECT ?event ?timestamp {
  {
    hint:Query hint:optimizer "None" 
    SELECT ?event ?timestamp {
      ?event <http://predicates/timestamp> ?timestamp .
      # prevent "random" sampling
      hint:Query hint:pipelinedHashJoin false 
    } LIMIT 200 
  }
  ?event <http://predicates/c> :c 
} ORDER BY ?timestamp LIMIT 150 

However, my experiments show that result constancy is not guaranteed even using subtler hints. And obviously, this approach can not be extended to DESC ordering.

If events are so regular, you could estimate an upper bound yourself:

SELECT ?event {
  VALUES ?startTime {"2000-11-26T12:00:00"^^xsd:dateTime}
  BIND (STRDT(CONCAT("PT", STR(200), "M"), xsd:duration) AS ?duration)
  BIND (?startTime + ?duration AS ?endTime)
  ?event <http://predicates/timestamp> ?timestamp .
  hint:Prior hint:rangeSafe true 
  FILTER (?timestamp < ?endTime) 
  ?event <http://predicates/c> :c
} ORDER BY ?timestamp LIMIT 150

Upvotes: 2

Related Questions