Reputation: 6616
I'm not sure that title is worded very well, but not sure how else to. I'm populating a Neo4j database with some data.
The data is mainly generated from data I have about pairs of users. There is a percentage relationship between users, such as:
80
A ---> B
But the reverse relation is not the same:
60
A <--- B
I could put both relations in, but I think what I might do, is use the mean:
70
A <--> B
But I'm open to suggestions here.
What I want to do in Neo4j, is get groups of related users. So for example, I want to get a group of users with mean % relation > 50. So, if we have:
A
40 / \ 60
B --- C ------ D
20 70
We'd get back a subset, something like:
A
\ 60
C ------ D
70
I have no idea how to do that. Another thing is that I'm pretty sure it'd not be possible to reach any node from any other node, I think it's disjointed. Like, several large graphs. But I'd like to be able to get everything that falls into the above, even if some groups of nodes are completely separate from other nodes.
As an idea for numbers, there will be around 100,000 nodes and 550,000 edges
Upvotes: 1
Views: 440
Reputation: 6331
Maybe something like http://tinyurl.com/c8tbth4 is applicable here?
start a=node(*) match p=a-[r]-()-[t]-() where r.percent>50 AND t.percent>50 return p, r.percent, t.percent
Upvotes: 1
Reputation: 14809
A couple thoughts.
First, it's fine if your graph isn't connected, but you need some way to access every component you want to analyze. In a disconnected graph in Neo4j, that either means Lucene indexing, some external "index" that holds node or relationship ids, or iterating through all nodes or relationships in the DB (which is slow, but might be required by your problem).
Second, though I don't know your domain, realize that sticking with the original representation and weights might be fine. Neo doesn't have undirected edges (though you can just ignore edge direction), and you might need that data later. OTOH, your modification taking the mean of the weights does simplify your analysis.
Third- depending on the size and characteristics of your graph, this could be very slow. It sounds like you want all connected components in the subgraph built from the edges with a weight greater 50. AFAIK, that requires an O(N) operation over your database, where N is the number of edges in the DB. You could iterate over all edges in your DB, filter based on their weight, and then cluster from there.
Using Gremlin/Groovy, you can do this pretty easily. Check out the Gremlin docs.
Another approach might be some sort of iterative clustering as you insert your data. It occurs to me that this could be a significant real-word performance improvement, though it doesn't get around the O(N) necessity.
Upvotes: 2