natsbat4ws
natsbat4ws

Reputation: 137

Find largest connected components AWS Neptune

In an AWS Neptune graph with billions of nodes and edges, how would one go about finding the largest connected components efficiently? The reason I am trying to find the answer to this question is because usually large connected components in my domain indicate fraud. Most nodes in my graph only are connected to like tens of other nodes. It is suspicious when nodes are connected to hundreds or thousands of other nodes.

I have several questions:

  1. Is AWS Neptune an appropriate database for finding large connected components in a graph with billions of nodes and edges?
  2. Would it be more efficient to calculate PageRank for the graph? A high PageRank would similarly indicate fraud I believe. If so, how would I go about calculating PageRank?
  3. What architecture and algorithm could find the largest connected components?
  4. I am not just trying to find fraud that happened in the past but I am also trying to identify fraud in real time. As data is ingested, what would be a good way to identify a fraudulent node in real time? I am thinking that Neptune Streams and doing DFS on the node to get the entire connected component would be appropriate here.
  5. Eventually, years from now, when I've identified enough fraud, I am thinking I could do some sort of supervised machine learning. Not sure what the benefit of this would be though since most large connected components are fraud. It might be better at identifying harder to distinguish cases?
  6. Similarly to connected components and PageRank, are there other graph attributes I should look into that might indicate fraud in my case? I know this might be difficult to answer since I haven't revealed my domain.

Any help is much appreciated!

Upvotes: 2

Views: 828

Answers (1)

Kelvin Lawrence
Kelvin Lawrence

Reputation: 14371

Connected component finding queries can be expressed in Gremlin, but whether or not those queries will be efficient, is going to depend on the complexity of the graph. I would start by looking at the Gremlin Recipes document.

You will find several algorithms discussed there.

At very large scale, you may want to export data from the graph and run a Spark job (or similar) to find the fraud rings etc.

UPDATE 2024-01-29

In December 2023, Neptune Analytics was released. It includes support for built in algorithms, including many that can be used for community detection use cases. The documentation for the algorithms is here

Upvotes: 2

Related Questions