silian-rail
silian-rail

Reputation: 144

Slow query using `pull` (datahike) to retrieve attributes on 400 entities

I am using Datahike 0.6.1531 (not Datomic) on the JVM. I have a list of book titles to display in a web app. If the book is "notable", I do something special, like apply a background-color or append an emoji to it.

I would like to return a vector that resembles something like this:

[{:db/id 339, :resource-name "Notation as a Tool of Thought"}
 {:db/id 338, :resource-name "The Science of Radio", :notable? :true}
 {:db/id 337, :resource-name "Journey Into Mathematics"}
 {:db/id 336, :resource-name "Street Fighting Mathematics"}
 ...]

Performing the following pull-many query with 3 attr-ids (including :db/id) on a range of 400 or so entities requires ~2,900 ms:

(require '[datahike.api :as d]) ; version 0.6.1531
(d/pull-many @conn [:db/id :resource-name :notable?] 
  (range 1 400))

Is the slow query time an inherent trade-off of EAV databases, or am I failing to optimize in some very obvious way?

Upvotes: 0

Views: 138

Answers (2)

silian-rail
silian-rail

Reputation: 144

This issue was addressed and solved by Datahike maintainers in this pull request: https://github.com/replikativ/datahike/pull/653

Upvotes: 1

claj
claj

Reputation: 5402

This question initially emerged in the DataHike channel in the Clojurians slack, so I edit up my answers from there into a longer single post answer. DataHike is one of many implementations of a Datalog query engine in Clojure/ClojureScript.

Sinceyou are using ordinary JVM Clojure (which tends to be most high performant) I would expect a quicker result. My experience from Datomic is that such a query should be many times faster.

DataScript (which is the datalog implementation DataHike was initially based up on) can sometimes be somewhat slow, but the result seems a bit too slow still for a clojure/script environment.

The implementations of the pull-api in DataScript vs the implementation of the pull-api in DataHike is quite different, where DataHike seems to do somewhat more book-keeping and checking than DataScript does (which makes DataHike much slower slower, but also easier to get it work correctly).

JIT/better performance metric

Your query is of at most 400 entities in the database. This is probably to few to trigger JIT recompilations/optimizations (which is what makes all this quicker in long running production code). Tools like Criterium (clojure) or Tufte (ClojureScript port of Criterium) will run a lot of tests to make sure that the JVM or JavaScript VM is "hot", that most of the JIT optimizations in already in place.

I suggest you test the performance with either Criterium or Tufte.

DataHike is still more experimental (but usable) and has a different scope than DataScript which might make the query engine much slower, still, but it should not be this slow (IMHO).

Upvotes: 0

Related Questions