Reputation: 5218

Patterns for adding inferred relationships to a neo4j database

We are importing a large amount of data into a Neo4J database using the batch insertion API . The database will be used to power a readonly API (embedded server).

The data we are importing is a very close copy of the domain concepts/entities that are held in the existing database schema. We are exploiting these relationships to find additional relationships in the data and drive additional features on our website.

For example, if we had the following: person-[:reads]->book-[:writtenBy]->person , we might decide that this implies an additional relationship person-[:isAFanOf]->person. This makes our code a little more evident (as we talk about the "is fan of" relationship), and many of our queries and traversals a lot more performant as there is no need to hop across two entities.

Where would be the best place to do this? We came up with a number of suggestions:

In the batch insert code, after all the relevant entities have been imported.
In a process that 'spiders' the network, looking for users to add inferred relationships, adding these and then scheduling their neighbours for the same.
At read time in the API - not idea as this could make quite a long initial load time for the user consuming the data

Another complication is that the database will be updated every 24 hours with newly created data so we need something that helps us in our full and our partial import case.

Examples/experience very much welcome.

Upvotes: 1

Answers (3)

Nicholas

Reputation: 7521

Are you required to use Cypher? If not, since you have 60 million nodes, the cypher query listed by tstorms looks nice and works fine, but it might be tough as it would do a transaction around all those, which could lead to big memory usage.

You could use the Java API(I'm assuming you're using Java) to do this manually.

        RelationshipType readsRelationshipType = DynamicRelationshipType.withName("reads");
        RelationshipType writtenByRelationshipType = DynamicRelationshipType.withName("writtenBy");
        RelationshipType isAFanOfRelationshipType = DynamicRelationshipType.withName("isAFanOf");
        int counter = 0;
        Transaction tx = db.beginTx();
        try {
            for (Node reader : GlobalGraphOperations.at(db).getAllNodes()) {
                for (Relationship reads : reader.getRelationships(Direction.OUTGOING, readsRelationshipType)) {
                    Node book = reads.getOtherNode(reader);
                    for (Relationship writtenBy : book.getRelationships(Direction.OUTGOING, writtenByRelationshipType)) {
                        Node author = reads.getOtherNode(book);
                        try {
                            reader.createRelationshipTo(author, isAFanOfRelationshipType);
                        } catch (Exception e) {
                            // TODO: Something for exception
                        }
                    }
                }
                counter++;
                if (counter % 100000 == 0) {
                    tx.success();
                    tx.finish();
                    tx = db.beginTx();
                }
            }
            tx.success();
        } catch (Exception e) {
            tx.failure();
        } finally {
            tx.finish();
        }
    }

This code assumes error handling, and number of transactions, but you can adjust those as you need.

Upvotes: 3

RaduK

Reputation: 1463

I think the query @tstorms proposed will not work in a reasonable amount of time for 60 million nodes.

If you really want to do it, there are some improvements you can do on @tstorms solution:

use indexes for start entities (for instance person in your case) and start the queries from those ones.
you mentioned the fact that you have to do this operation incrementally, so you probably need to keep indexes for the last batch operation so you would have to iterate on already processed nodes.

I personally wouldn't do it unless it's really necessary: for performance issue I'd wait and see before optimizing upfront, and for query simplification your can use named paths in cypher (http://docs.neo4j.org/chunked/milestone/query-match.html#match-named-path) or user defined steps in Gremlin (https://github.com/tinkerpop/gremlin/wiki/User-Defined-Steps)

Upvotes: 0

tstorms

Reputation: 5001

I'd probably do it right after the import. The following Cypher statement should do the trick:

START p=node(*)
MATCH p-[:reads]->book-[:writtenBy]->p2
CREATE p-[:isAFanOf]->p2

Upvotes: 2

Patterns for adding inferred relationships to a neo4j database

Answers (3)

Related Questions