Ryan
Ryan

Reputation: 1242

How to use sideEffect and dedup in a Gremlin traversal?

Background

We're trying to build up a query such that for a given label we want to get all of the neighbors of that label type for a list of ids. We have about 5 different labels, so we're building up a traversal to do it all in a single query.

Goal

We want to ensure that for a given label, we query N neighbors and for all the aggregated neighbors in the idsToBatchQuery list, we dedupe them all.

The problem we're facing is that the side-effect is only deduping for each vertex. For example - If we have 10 vertices in the idsToBatchQuery list and each vertex has 3 neighbors, we only dedupe across those 3 neighbors instead of deduping the 30 complete list of neighbors.

Question

How can we modify the below query to grab the first maxBreadth unique neighbors from all the vertices in idsToBatchQuery?

        var traversal = g.V(idsToBatchQuery.toArray())

        for (String label : labels) {
          traversal =
              traversal.sideEffect(
                  outE(EDGE_LABEL)
                      .inV()
                      .hasLabel(label)
                      .dedup()
                      .by(T.id)
                      .limit(maxBreadth)
                      .aggregate(label));
        }

Upvotes: 1

Views: 593

Answers (1)

Kelvin Lawrence
Kelvin Lawrence

Reputation: 14391

Using the air-routes data set, here is a similar query, but the dedup step has been replaced with where(without('labels')). This gives a more global deduplication across all values seen not just the two "local" ones.

g.V('1','2','3','56').
  sideEffect(
    outE('route').
    inV().
    hasLabel('airport').
    where(without('labels')).
    limit(2).
    aggregate('labels')).
  sideEffect(
    inE('contains').
    outV().
    hasLabel('country').
    where(without('labels')).
    limit(2).
    aggregate('labels')).
  cap('labels').
  unfold().
  values('desc')

Here's the output from running the query. The results would have had duplicates had dedup been used instead as in your original query.

1   Ontario International Airport
2   Greater Rochester International Airport
3   United States
4   Fairbanks International Airport
5   Reykjavik, Keflavik International Airport
6   Charlotte Douglas International Airport
7   Cancun International Airport
8   Phnom Penh International Airport
9   Gold Coast Airport
10  Singapore

If the original dedup is used, you can see that there are duplicates in the results.

1   Ontario International Airport
2   Ontario International Airport
3   Greater Rochester International Airport
4   United States
5   United States
6   United States
7   Fairbanks International Airport
8   Reykjavik, Keflavik International Airport
9   Charlotte Douglas International Airport
10  Phnom Penh International Airport

Upvotes: 1

Related Questions