Reputation: 5218
We are importing a large amount of data into a Neo4J database using the batch insertion API . The database will be used to power a readonly API (embedded server).
The data we are importing is a very close copy of the domain concepts/entities that are held in the existing database schema. We are exploiting these relationships to find additional relationships in the data and drive additional features on our website.
For example, if we had the following: person-[:reads]->book-[:writtenBy]->person , we might decide that this implies an additional relationship person-[:isAFanOf]->person. This makes our code a little more evident (as we talk about the "is fan of" relationship), and many of our queries and traversals a lot more performant as there is no need to hop across two entities.
Where would be the best place to do this? We came up with a number of suggestions:
Another complication is that the database will be updated every 24 hours with newly created data so we need something that helps us in our full and our partial import case.
Examples/experience very much welcome.
Upvotes: 1
Views: 718
Reputation: 7521
Are you required to use Cypher? If not, since you have 60 million nodes, the cypher query listed by tstorms
looks nice and works fine, but it might be tough as it would do a transaction around all those, which could lead to big memory usage.
You could use the Java API(I'm assuming you're using Java) to do this manually.
RelationshipType readsRelationshipType = DynamicRelationshipType.withName("reads");
RelationshipType writtenByRelationshipType = DynamicRelationshipType.withName("writtenBy");
RelationshipType isAFanOfRelationshipType = DynamicRelationshipType.withName("isAFanOf");
int counter = 0;
Transaction tx = db.beginTx();
try {
for (Node reader : GlobalGraphOperations.at(db).getAllNodes()) {
for (Relationship reads : reader.getRelationships(Direction.OUTGOING, readsRelationshipType)) {
Node book = reads.getOtherNode(reader);
for (Relationship writtenBy : book.getRelationships(Direction.OUTGOING, writtenByRelationshipType)) {
Node author = reads.getOtherNode(book);
try {
reader.createRelationshipTo(author, isAFanOfRelationshipType);
} catch (Exception e) {
// TODO: Something for exception
}
}
}
counter++;
if (counter % 100000 == 0) {
tx.success();
tx.finish();
tx = db.beginTx();
}
}
tx.success();
} catch (Exception e) {
tx.failure();
} finally {
tx.finish();
}
}
This code assumes error handling, and number of transactions, but you can adjust those as you need.
Upvotes: 3
Reputation: 1463
I think the query @tstorms proposed will not work in a reasonable amount of time for 60 million nodes.
If you really want to do it, there are some improvements you can do on @tstorms solution:
I personally wouldn't do it unless it's really necessary: for performance issue I'd wait and see before optimizing upfront, and for query simplification your can use named paths in cypher (http://docs.neo4j.org/chunked/milestone/query-match.html#match-named-path) or user defined steps in Gremlin (https://github.com/tinkerpop/gremlin/wiki/User-Defined-Steps)
Upvotes: 0
Reputation: 5001
I'd probably do it right after the import. The following Cypher statement should do the trick:
START p=node(*)
MATCH p-[:reads]->book-[:writtenBy]->p2
CREATE p-[:isAFanOf]->p2
Upvotes: 2