In gremlin (specifically tinkerpop), how do you query and then make a second query without ANY of the results from the first query?

Question

I'm making a face matching database. I have a list of faces that may be similar to another face. I'm trying to get the top 10 faces with the most similar faces. The first face should include all the faces. The second face should exclude ANY similar face that matches the first face.

// The graph
graph = TinkerFactory.createModern()
g = graph.traversal()

g.addV('face').property('faceId','face1')
g.addV('face').property('faceId','face2')
g.addV('face').property('faceId','face3')
g.addV('face').property('faceId','face4')
g.addV('face').property('faceId','face5')
g.addV('face').property('faceId','face6')
g.addV('face').property('faceId','face7')
g.addV('face').property('faceId','face8')
g.addV('face').property('faceId','face9')

g.V().has('face','faceId','face1').as('a').V().has('face','faceId','face2').as('b').addE('is similar').from('a').to('b')
g.V().has('face','faceId','face1').as('a').V().has('face','faceId','face3').as('b').addE('is similar').from('a').to('b')
g.V().has('face','faceId','face1').as('a').V().has('face','faceId','face4').as('b').addE('is similar').from('a').to('b')
g.V().has('face','faceId','face1').as('a').V().has('face','faceId','face5').as('b').addE('is similar').from('a').to('b')
g.V().has('face','faceId','face2').as('a').V().has('face','faceId','face3').as('b').addE('is similar').from('a').to('b')
g.V().has('face','faceId','face2').as('a').V().has('face','faceId','face4').as('b').addE('is similar').from('a').to('b')
g.V().has('face','faceId','face5').as('a').V().has('face','faceId','face6').as('b').addE('is similar').from('a').to('b')
g.V().has('face','faceId','face6').as('a').V().has('face','faceId','face7').as('b').addE('is similar').from('a').to('b')
g.V().has('face','faceId','face8').as('a').V().has('face','faceId','face9').as('b').addE('is similar').from('a').to('b')

This gets me the initial list. Only the top number is right.

:> g.V()
    .hasLabel('face')
    .as('a')
    .project('faceId','count')
    .by(select('a').values('faceId'))
    .by(bothE('is similar').otherV().id().dedup().count())
    .order()
    .by('count',desc)
    .range(0,10)

That gets me this:

    ==>[faceId:face1,count:4]
    ==>[faceId:face2,count:3]
    ==>[faceId:face5,count:3]
    ==>[faceId:face3,count:2]
    ==>[faceId:face4,count:2]
    ==>[faceId:face6,count:2]
    ==>[faceId:face7,count:1]

I've tried this to get additional lists

g.V().has('face','faceId',without(V().has('face','faceId','face1').bothE('is similar').bothV().values('faceId').dedup().fold())).valueMap()

Ideally, I want to get something like this:

==>[faceId:face1,count:4]
==>[faceId:face6,count:1]
==>[faceId:face9,count:1]

I'm okay fetching the 'master' list first and then processing the first one to get the similar faceIds then running the list again to get the next most if I need to. I just can't figure out how to get it to work.

This works, but I'm not sure how to combine that into the without() section of the first query.

g.V().has('face','faceId',without('face1','face2','face3','face4','face5')).as('a').project('faceId','count').by(select('a').values('faceId')).by(bothE('is similar').otherV().id().dedup().count()).order().by('count',desc).range(0,10)

stephen mallette · Accepted Answer

I'm not quite sure if this is what you want but it returns your result. It generally sounds like you don't want to traverse the same vertex more than once in which case you could just lazily store() those vertices and filter them out as you go:

gremlin> g.V().hasLabel('face').
......1>   project('faceId','count').
......2>     by('faceId').
......3>     by(where(without('a')).both('is similar').
......4>        where(without('a')).
......5>        store('a').
......6>        count())
==>[faceId:face1,count:4]
==>[faceId:face9,count:1]
==>[faceId:face2,count:0]
==>[faceId:face3,count:0]
==>[faceId:face4,count:0]
==>[faceId:face5,count:0]
==>[faceId:face6,count:1]
==>[faceId:face7,count:0]
==>[faceId:face8,count:0]

Depending on the graph you use this sort of query may not be deterministic as Gremlin doesn't assume the order of items in the traversal stream to be the same on each execution. You'd need to add your own order() steps to enforce that if it matters. This could end up being an expensive query as it is being globally executed over the graph and needs to track all the vertices traversed. In practice, you might need to further limit the paths traversed or perhaps use a different algorithm to determine similarity.

In gremlin (specifically tinkerpop), how do you query and then make a second query without ANY of the results from the first query?

Answers (1)

Related Questions