Reputation: 55
I have a collection in ArangoDB where each document contains some attributes like
{
"contributor_name": "Rizano",
"action": "create",
"id": 3633,
"type": "newusers",
"logtitle": "What to do",
"timestamp": "2006-07-05",
"contributor_id": 7878
}
The collection contains millions of documents. Now I want to find out which contributor_name
is most occurring in the documents and their count.
Upvotes: 3
Views: 322
Reputation: 11865
You can simply group by contributor_name
and use the special COLLECT syntax variant WITH COUNT INTO ...
to efficient compute how often each value occurs in the dataset:
FOR doc IN coll
COLLECT name = doc.contributor_name WITH COUNT INTO count
RETURN { name, count }
The result may look like this:
[
{ "name": "Rizano", "count": 5 },
{ "name": "Felipe", "count": 8 },
...
]
You may merge the result together like this if you prefer that format:
[
{
"Rizano": 5,
"Felipe": 8
}
...
]
Query:
RETURN MERGE(
FOR doc IN coll
COLLECT name = doc.contributor_name WITH COUNT INTO count
RETURN { [name]: count }
)
You may also sort by count and limit the result to the most occurring values, e.g. like this (only top contributor):
FOR doc IN coll
COLLECT name = doc.contributor_name WITH COUNT INTO count
SORT count DESC
LIMIT 1
RETURN { name, count }
There's also COLLECT AGGREGATE, although there should be no difference in performance for this particular query:
FOR doc IN coll
COLLECT name = doc.contributor_name AGGREGATE count = LENGTH(1)
SORT count DESC
LIMIT 1
RETURN { name, count }
The value passed to LENGTH
doesn't really matter, all we want is that it returns a length of 1 (thus increasing the counter by 1 for the given contributor).
Upvotes: 3