Reputation: 5550
I'm running into a warning when profiling a slow gremlin traversal.
WARNING: >> OrderGlobalStep([[[CoalesceStep([[VertexStep(IN,[view],edge), ProfileStep, NeptuneHasStep([isActive.eq(true)]), ProfileStep, EdgeVertexStep(OUT), ProfileStep, NeptuneHasStep([~id.eq(63b944e2-481d-42c8-a1a3-c0bc3ad24484)]), ProfileStep, RangeGlobalStep(0,1), ProfileStep, CountGlobalStep, ProfileStep], [ConstantStep(0), ProfileStep]]), ProfileStep], asc], [value(rekognitionModerationDate), desc], [value(createdDate), desc]]) << (or one of its children) is not supported natively yet
The profiler is reporting this step takes up 62% of the total execution time so I'd like to optimize it. Here is a simplified version of the complete traversal:
g.V()
.hasLabel("post")
.order()
.by(
__.coalesce(
__.inE("view")
.has("isActive", true)
.outV()
.hasId(userId)
.limit(1)
.count(),
__.constant(0)
),
order.asc
)
The goal is to output post
vertices that do not have an incoming view
edge first. In other words show posts that haven't been viewed by the requesting user, followed by posts they have viewed. The current traversal works but is very slow. How can I refactor this to be 'native' so it will execute faster?
Edit: Apparently the problem is that Neptune doesn't have native support for order().by()
with a custom comparator as explained here:
https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-step-support.html
I am still interested in ideas of how to refactor this for pure native support.
Upvotes: 0
Views: 234
Reputation: 14371
The current Amazon Neptune query engine will optimize order
... by
steps in general. However, if any of the child traversals associated with the order
cannot be optimized that will cause the entire step to not be optimized. As you noticed in the documentation there are limitations on what can be within the by
modulator today when used with order
in terms of optimization. Also worthy of note are conditions where a coalesce step will not get optimized. The query optimizer is quite good at optimizing coalesce
steps but there is a case where today it does not. That case is when the LHS and RHS of the coalesce
yield different types of value or a constant
is used. So if a coalesce
for example always yields a vertex from each possible path that will likely get optimized. However, when the RHS is a constant
often that causes the coalesce
to not get optimized.
You can observe this with a query such as
g.V('3').coalesce(out().count(),constant(0))
as the result from a CountGlobalStep is not the same type as the result from a ConstantStep. This does not always mean you will see bad performance but this is the reason why, in this case, you are seeing the warning in the profile. In general, when a constant
is used with coalesce
you will see the warning with the current version of the engine. As with many things, these are point in time behaviors.
In your specific case however I think we can simplify things potentially and get the query optimized. As you are using count
if no paths exist the count will be 0 without the need for the pesky coalesce
. Here is an air-routes example that gets optimized.
g.V().hasLabel('airport').
order().
by(in('route').count()).
limit(10).
project('code','count').
by('code').
by(in('route').count())
which yields
1 {'count': 0, 'code': 'BVS'}
2 {'count': 0, 'code': 'TWB'}
3 {'count': 0, 'code': 'EKA'}
4 {'count': 0, 'code': 'TKQ'}
5 {'count': 0, 'code': 'ISL'}
6 {'count': 0, 'code': 'RIG'}
7 {'count': 0, 'code': 'INT'}
8 {'count': 0, 'code': 'APA'}
9 {'count': 0, 'code': 'BWU'}
10 {'count': 0, 'code': 'BID'}
Upvotes: 1