ArangoDB get document by _id with cache per query

Question

Have document collections "items", and "dictionary1", "dictionary2"

collection "items" have 50000 records, like
{
label:"..."
dict1: "dictionary1/33333"
dict2: "dictionary2/44444"
....
}

When I join dictionaries with

FOR item IN items
LET dictname = FIRST(FOR d IN dictionary1 FILTER d._id == item.dict1 RETURN d.name)
RETURN { _id: item._id, name: item.name, dict: dictname }

Query execution time ~ 150ms its to much for this simple task. For experiment I am force dict for one _id

FOR item IN items
LET dictname = FIRST(FOR d IN dictionary1 FILTER d._id == "dictionary1/10000" RETURN d.name)
RETURN { _id: item._id, name: item.name, dict: dictname }

Query execution time ~ 130ms

Also I an tried DOCUMENT function

FOR item IN items
LET dictname = DOCUMENT("dictionary1", "dictionary1/10000")
RETURN { _id: item._id, name: item.name, dict: dictname.name }

Query execution time ~ 1500ms :((

And simple read from collection:

FOR item IN items
RETURN { _id: item._id, name: item.name }

Query execution time ~ 30ms

So result not cached when I get same document by _id 50000 times as join. Variant with duplicate content in one collection not useful for me. Now I move logic to client side, select all dictionaries in memory and join after query, and have ~60ms, but I don't like this method, it wrong.

Any way to get result with joined dictionaries(80% same _id) with short time?

stj · Accepted Answer

If you're sure that there is a matching item in dictionary1 for each items that you loop over in the outer loop, the query can be simplified to a simpler equi-join like this:

FOR item IN items
  FOR d IN dictionary1 
    FILTER d._id == item.dict1
    RETURN { _id: item._id, name: item.name, dict: d.name }

This removes the need for the correlated subquery and the function call. However, its result will only be equivalent to the original subquery solution if there are matching dictionary entries for each item. If there is no dictionary entry for an item, the equi-join query will simply suppress it.

If it's unclear whether there'll be an entry in dictionary1, and you want nulls to be returned for non-matches, the subquery solution can be simplified to (this removes the function call to FIRST):

FOR item IN items
  LET sub = (FOR d IN dictionary1 
    FILTER d._id == item.dict1
    RETURN d.name
  )
  RETURN { _id: item._id, name: item.name, dict: sub[0] }

Apart from that, joining on the _key attribute instead of joining on _id may provide minor speedups, as _key is shorter and less complex to produce than _id. But that will require storing the join values in items without the dictionary collection names prefixes.

The above suggestions shouldn't provide order-of-magnitude runtime improvements, but maybe they help.

ArangoDB get document by _id with cache per query

Answers (1)

Related Questions