cts
cts

Reputation: 1820

Grouping on properties from different verticies

I have a graph that looks something like the following:

pathway -> pathway_component -> gene -> organism

You can make an example graph like so:

m1 = g.addV('pathway').property('pathway_name', 'M00002').next()
m2 = g.addV('pathway').property('pathway_name', 'M00527').next()

c1 = g.addV('pathway_component').property('name', 'K00001').next()
c2 = g.addV('pathway_component').property('name', 'K00002').next()
c3 = g.addV('pathway_component').property('name', 'K00003').next()

g.addE('partof').from(c1).to(m1).iterate()
g.addE('partof').from(c2).to(m1).iterate()
g.addE('partof').from(c3).to(m2).iterate()

g1 = g.addV('gene').property('name', 'G00001').next()
g2 = g.addV('gene').property('name', 'G00002').next()
g3 = g.addV('gene').property('name', 'G00003').next()
g4 = g.addV('gene').property('name', 'G00004').next()
g5 = g.addV('gene').property('name', 'G00005').next()
g6 = g.addV('gene').property('name', 'G00006').next()
g7 = g.addV('gene').property('name', 'G00007').next()
g8 = g.addV('gene').property('name', 'G00008').next()

g.addE('isa').from(g1).to(c1).iterate()
g.addE('isa').from(g2).to(c3).iterate()

g.addE('isa').from(g3).to(c1).iterate()
g.addE('isa').from(g4).to(c2).iterate()
g.addE('isa').from(g5).to(c3).iterate()

g.addE('isa').from(g6).to(c1).iterate()
g.addE('isa').from(g7).to(c1).iterate()
g.addE('isa').from(g8).to(c2).iterate()

o1 = g.addV('organism').property('name', 'O000001').next()
o2 = g.addV('organism').property('name', 'O000002').next()
o3 = g.addV('organism').property('name', 'O000003').next()
o4 = g.addV('organism').property('name', 'O000004').next()

g.addE('partof').from(g1).to(o1).iterate()
g.addE('partof').from(g2).to(o1).iterate()
g.addE('partof').from(g3).to(o2).iterate()
g.addE('partof').from(g4).to(o2).iterate()
g.addE('partof').from(g5).to(o3).iterate()
g.addE('partof').from(g6).to(o3).iterate()
g.addE('partof').from(g7).to(o4).iterate()
g.addE('partof').from(g8).to(o4).iterate()

I'd like to count the genes per pathway per organism, so that the results look something like:

organism_1 pathway_1 gene_count
organism_1 pathway_2 gene_count
organism_2 pathway_1 gene_count
organism_2 pathway_2 gene_count

But so far I haven't figured it out. I tried the following:

g.V().has('pathway', 'pathway_name', within('M00002', 'M00527')).project('organism', 'pathway', 'count'). 
   by(__.in().hasLabel('pathway_component').
       in().hasLabel('gene').
       out().hasLabel('organism').
       values('name')).
   by('pathway_name').
   by(__.in().hasLabel('pathway_component').
       in().hasLabel('gene').
       count())

But it looks like the grouping is wrong:

==>[organism:O000001,pathway:M00002,count:6]
==>[organism:O000001,pathway:M00527,count:2]

In this case it seems like all of the organisms and their counts are being grouped together (there are four organisms) for the two pathways listed. I'd expect to see something like:

O000001 M00002 1
O000001 M00527 1
O000002 M00002 2
O000002 M00527 0
O000003 M00002 1
O000003 M00527 1
O000004 M00002 2
O000004 M00527 0

How can I split out the results by both different organisms and different pathways?

Upvotes: 2

Views: 91

Answers (1)

Kelvin Lawrence
Kelvin Lawrence

Reputation: 14371

Hopefully the final query below helps. I showed the steps I used to get there, part of which was making sure I understood the structure of your data.

First I wanted to see the shape of the graph.

gremlin> g.V().hasLabel('pathway').
......1>       in().hasLabel('pathway_component').
......2>       in().hasLabel('gene').
......3>       out().hasLabel('organism').
......4>       path().
......5>         by('pathway_name').
......6>         by('name').
......7>         by('name').
......8>         by('name')
==>[M00002,K00001,G00006,O000003]
==>[M00002,K00001,G00007,O000004]
==>[M00002,K00001,G00001,O000001]
==>[M00002,K00001,G00003,O000002]
==>[M00002,K00002,G00004,O000002]
==>[M00002,K00002,G00008,O000004]
==>[M00527,K00003,G00005,O000003]
==>[M00527,K00003,G00002,O000001] 

Then I used path and group to learn a bit more about these relationship groupings.

gremlin> g.V().hasLabel('pathway').
......1>       in().hasLabel('pathway_component').
......2>       in().hasLabel('gene').
......3>       out().hasLabel('organism').as('org').
......4>       group().
......5>         by(select('org').by('name')).
......6>         by(
......7>       path().
......8>         by('pathway_name').
......9>         by('name').
.....10>         by('name').
.....11>         by('name').fold()).
.....12>       unfold()
==>O000004=[path[M00002, K00001, G00007, O000004], path[M00002, K00002, G00008, O000004]]
==>O000003=[path[M00002, K00001, G00006, O000003], path[M00527, K00003, G00005, O000003]]
==>O000002=[path[M00002, K00001, G00003, O000002], path[M00002, K00002, G00004, O000002]]
==>O000001=[path[M00002, K00001, G00001, O000001], path[M00527, K00003, G00002, O000001]] 

Finally I changed the above query to nest two groups

gremlin> g.V().hasLabel('pathway').as('pathway').
......1>       in().hasLabel('pathway_component').
......2>       in().hasLabel('gene').as('gene').
......3>       out().hasLabel('organism').as('org').
......4>       group().
......5>         by(select('org').by('name')).
......6>         by(
......7>           group().
......8>             by(select('pathway').by('pathway_name')).
......9>             by(select('gene').by('name').fold())).
.....10>       unfold()
==>O000004={M00002=[G00007, G00008]}
==>O000003={M00002=[G00006], M00527=[G00005]}
==>O000002={M00002=[G00003, G00004]}
==>O000001={M00002=[G00001], M00527=[G00002]}  

This yields the organism, the pathway name and the genes.

Building on that I changed the query again to generate the counts. I hope this is close to what you needed.

gremlin> g.V().hasLabel('pathway').as('pathway').
......1>       in().hasLabel('pathway_component').
......2>       in().hasLabel('gene').as('gene').
......3>       out().hasLabel('organism').as('org').
......4>       group().
......5>         by(select('org').by('name')).
......6>         by(
......7>           group().
......8>             by(select('pathway').by('pathway_name')).
......9>             by(select('gene').by('name').fold().count(local))).
.....10>       unfold()
==>O000004={M00002=2}
==>O000003={M00002=1, M00527=1}
==>O000002={M00002=2}
==>O000001={M00002=1, M00527=1}

Upvotes: 3

Related Questions