Reputation: 73
I have an OrientDB graph database with nodes connected in series with link of type NEXT
. (There are several distinct series in my data, and no node has more than one incoming and one outgoing "NEXT" link). The nodes all have a property called "name". I would like to find all the sequences of name that occur in traversing the paths from start to end.
I.e. to get one sequence of names, start at a node which has no incoming NEXT link, follow the NEXT links until you have reached a node with no outgoing "NEXT" links, and gather up into a list all the names of the nodes you have passed.
e.g. a sub-graph of form
(Bob)-[NEXT]->(Sharon)-[NEXT]->(Carl)
should give the list
["Bob", "Sharon", "Carl"]
For clarification, here is a Cypher (Neo4j) query which gets me all the possible lists.
match (start) -[:NEXT*]-> (end),
p = shortestPath(start-[:NEXT*]-> end)
where not ()-[:NEXT]->(start) and not (end)-[:NEXT]->()
return extract( s in nodes(p) | s.name ) as path
However, I need to do this in OrientDB, which doesn't use Cypher.
I would like to know whether this is possible in OrientDB, and if so whether it is simpler in the SQL language or Gremlin.
As a secondary question, ideally I would like to not return all the lists of names, because what I really care about is how often each list occurs. So instead I want to return the unique lists with how often that particular list was found. Is this possible to do in OrientDB, or would I have to retrieve all the path data from OrientDB as above and do the aggregation somewhere else?
UPDATE
I have created some example data here to match the statement of my original question.
create database plocal:people
create class Person extends V
create property Person.name string
create property Person.age float
create property Person.ident integer
insert into Person(name,age,ident) VALUES ("Bob", 30.5, 1)
insert into Person(name,age,ident) VALUES ("Bob", 30.5, 2)
insert into Person(name,age,ident) VALUES ("Carol", 20.3, 3)
insert into Person(name,age,ident) VALUES ("Carol", 19, 4)
insert into Person(name,age,ident) VALUES ("Laura", 75, 5)
insert into Person(name,age,ident) VALUES ("Laura", 60.5, 6)
insert into Person(name,age,ident) VALUES ("Laura", 46, 7)
insert into Person(name,age,ident) VALUES ("Mike", 16.3, 8)
insert into Person(name,age,ident) VALUES ("David", 86, 9)
insert into Person(name,age,ident) VALUES ("Alice", 5, 10)
insert into Person(name,age,ident) VALUES ("Nigel", 69, 11)
insert into Person(name,age,ident) VALUES ("Carol", 60, 12)
insert into Person(name,age,ident) VALUES ("Mike", 16.3, 13)
insert into Person(name,age,ident) VALUES ("Alice", 5, 14)
insert into Person(name,age,ident) VALUES ("Mike", 16.3, 15)
create class NEXT extends E
create edge NEXT from (select from Person where ident = 1) to (select from Person where ident = 3)
create edge NEXT from (select from Person where ident = 2) to (select from Person where ident = 4)
create edge NEXT from (select from Person where ident = 8) to (select from Person where ident = 12)
create edge NEXT from (select from Person where ident = 5) to (select from Person where ident = 15)
create edge NEXT from (select from Person where ident = 15) to (select from Person where ident = 14)
create edge NEXT from (select from Person where ident = 7) to (select from Person where ident = 13)
create edge NEXT from (select from Person where ident = 13) to (select from Person where ident = 10)
This should give me the following final result
["Bob", "Carol"]
["Laura", "Mike", "Alice"]
["Laura"]
["Mike", "Carol"]
["David"]
["Nigel"]
Here is what I get using the suggestions of neRok
Firstly selecting all starting nodes - this works as expected
orientdb {db=people}> select from Person where in_NEXT is null
----+------+------+-----+----+-----+--------
# |@RID |@CLASS|name |age |ident|out_NEXT
----+------+------+-----+----+-----+--------
0 |#11:0 |Person|Bob |30.5|1 |[#12:0]
1 |#11:1 |Person|Bob |30.5|2 |[#12:1]
2 |#11:4 |Person|Laura|75.0|5 |[#12:3]
3 |#11:5 |Person|Laura|60.5|6 |null
4 |#11:6 |Person|Laura|46.0|7 |[#12:5]
5 |#11:7 |Person|Mike |16.3|8 |[#12:2]
6 |#11:8 |Person|David|86.0|9 |null
7 |#11:10|Person|Nigel|69.0|11 |null
----+------+------+-----+----+-----+--------
Now if I try to get the arrays of names obtained by traversing out from these nodes
select $series.name from (select from Person where in_NEXT is null ) let $series = (traverse out('NEXT') from $current)
----+------+-------
# |@CLASS|$series
----+------+-------
0 |null |[0]
1 |null |[0]
2 |null |[0]
3 |null |[0]
4 |null |[0]
5 |null |[0]
6 |null |[0]
7 |null |[0]
----+------+-------
I think this means that it gets no results from the traversal, or it can't generate an array of names?
The final aggregation step treats all these rows as the same:
orientdb {db=people}> select series, sum(1) as number from (select $series.name as series from (select from Person where in_NEXT is null) let $series = (traverse out('NEXT') from $current)) group by series
----+------+------+------
# |@CLASS|series|number
----+------+------+------
0 |null |[0] |8
----+------+------+------
so I'm not yet getting the result I'd like.
I think the issue is extracting the array of names from the traversal? A single traversal query does find the expected traversal, but I can't work out how to manipulate the data to provide the array of names from that traversal.
Here is an example traversal from one node: orientdb {db=people}> traverse out('NEXT') from (select from Person where ident = 7)
----+------+------+-----+----+-----+--------+-------
# |@RID |@CLASS|name |age |ident|out_NEXT|in_NEXT
----+------+------+-----+----+-----+--------+-------
0 |#11:6 |Person|Laura|46.0|7 |[#12:5] |null
1 |#11:12|Person|Mike |16.3|13 |[#12:6] |[#12:5]
2 |#11:9 |Person|Alice|5.0 |10 |null |[#12:6]
----+------+------+-----+----+-----+--------+-------
Upvotes: 1
Views: 505
Reputation: 1005
First of all you need a query to get the nodes without incoming edges. This query will be used as a sub-query. The orientdb manual suggests something like select from Nodes where in('NEXT').size() = 0
, but the following seems a bit faster to me (YMMV) select from Nodes where in_NEXT is null
.
Now that we have a list of starting nodes, we can use traverse to get all the outgoing edges.
select $series.name from (
select from Nodes where in_NEXT is null
)
let $series = (traverse out('NEXT') from $current)
This query will return rows with the data ["Bob","Sharon","Carl"]
, like you wanted.
Now to your bit about counting the occurrence of each series. I always fumble my way through when dealing with unique/distinct and counts (so this may not be a good method), but the following seems to work;
select series, sum(1) as number from (
select $series.name as series from (
select from Nodes where in_NEXT is null
)
let $series = (traverse out('NEXT') from $current)
)
group by series
So we have wrapped the previous query in another query, which groups by the series, and then gets the count of each row.
On a side note, this seems like an 'expensive' query. Perhaps you can create your database in a more efficient manner? That would be another question/discussion though.
Update to address your question update; Using your database setup, all my queries work in the studio web-app. There are some 'quirks' in the console though. The first query gives me the following in the console;
orientdb {db=stackpeople}> select $series.name from (select from Person where in_NEXT is null ) let $series = (traverse out('NEXT') from $current)
----+------+-------
# |@CLASS|$series
----+------+-------
0 |null |[2]
1 |null |[2]
2 |null |[3]
3 |null |[1]
4 |null |[3]
5 |null |[2]
6 |null |[1]
7 |null |[1]
----+------+-------
I don't know why your results are showing every $series row as [0]. Perhaps you aren't querying on the correct fields (ie you aren't using your demo database from above). My other guess is that there is a problem in the OrientDB version you are using - I am using 2.1-rc6.
Null class is correct, because this is a projected query, not a record. It could be a record by changing the start of the query to select *, $series.name from ...
.
For whatever reason, $series is not expanded in the console, but it is showing the number of records that $series contains ($series must be a list). I don't usually use console, so I don't know if this is the intended result, or a bug (my guess is this is just the way it shows lists). I found a way to show the names though, with the following query;
orientdb {db=stackpeople}> select *, $series.name.asString() as names from (select from Person where in_NEXT is null ) let $series = (traverse out('NEXT') from $current)
----+------+------+-----+----+-----+--------+--------------------
# |@RID |@CLASS|name |age |ident|out_NEXT|names
----+------+------+-----+----+-----+--------+--------------------
0 |#11:0 |Person|Bob |30.5|1 |[size=1]|[Bob, Carol]
1 |#11:1 |Person|Bob |30.5|2 |[size=1]|[Bob, Carol]
2 |#11:4 |Person|Laura|75.0|5 |[size=1]|[Laura, Mike, Alice]
3 |#11:5 |Person|Laura|60.5|6 |null |[Laura]
4 |#11:6 |Person|Laura|46.0|7 |[size=1]|[Laura, Mike, Alice]
5 |#11:7 |Person|Mike |16.3|8 |[size=1]|[Mike, Carol]
6 |#11:8 |Person|David|86.0|9 |null |[David]
7 |#11:10|Person|Nigel|69.0|11 |null |[Nigel]
----+------+------+-----+----+-----+--------+--------------------
Regarding the last query (count of names), my query gives;
orientdb {db=stackpeople}> select series, sum(1) as number from (select $series.name as series from (select from Person where in_NEXT is null) let $series = (traverse out('NEXT') from $current)) group by series
----+------+------+------
# |@CLASS|series|number
----+------+------+------
0 |null |[2] |2
1 |null |[3] |2
2 |null |[1] |1
3 |null |[2] |1
4 |null |[1] |1
5 |null |[1] |1
----+------+------+------
Which is correct again (albeit the series is show as a count of results). This could be tweaked with the following;
orientdb {db=stackpeople}> select names.asString(), sum(1) as number from (select $series.name as names from (select from Person where in_NEXT is null) let $series = (traverse out('NEXT') from $current)) group by names
----+------+--------------------+------
# |@CLASS|names |number
----+------+--------------------+------
0 |null |[Bob, Carol] |2
1 |null |[Laura, Mike, Alice]|2
2 |null |[Laura] |1
3 |null |[Mike, Carol] |1
4 |null |[David] |1
5 |null |[Nigel] |1
----+------+--------------------+------
Note that changing the list to a string means you will need to split this string up in whatever program/code is receiving the query results, in order to know each name. Thus it's probably better to not change it to string, then the query will actually return a list.
Upvotes: 0