mcsoini
mcsoini

Reputation: 6642

Neo4j data loading performance: driver vs custom procedure

I'm switching from Neo4j Java custom procedures to an approach based on the Neo4j Java driver. I want to end up with some sort of microservice running my graph algorithm, instead of calling the custom procedure through Cypher. I implemented the traversal using a bunch of standard HashMaps: Once the data is loaded from Neo4j to these HashMaps, the graph traversal is much faster than in my original custom procedure, so that's very promising.

Now to my question: Within the custom procedure I was able to load the graph (40 mio edges, 10 mio nodes) to the Hashmaps like so:

@Context
public GraphDatabaseService db;
...

HashMap<Long, Long> mapNodeIdProperty = new HashMap<>();
db.beginTx().getAllNodes().stream().forEach((org.neo4j.graphdb.Node n) 
              -> mapNodeIdProperty.put(n.getId(),
                     (Long) n.getProperty("combProp")));

This takes about a minute, which I consider acceptable as the service's start-up time.

Now, the best solution I could find using the driver goes something like this:

driver = GraphDatabase.driver( uri, AuthTokens.basic( user, password ) );

...

try ( Session session = driver.session() )
{
    String status = session.writeTransaction( new TransactionWork<String>()
    {
        @Override
        public String execute( Transaction tx )
        {
            Stream<Record> resultStream = (Stream<Record>) tx.run(
                           "MATCH (n) RETURN n").stream();
            resultStream.forEach((Record n) -> listNodes.add(((Record) n).get("n").asNode()));
            return "ok; length=" + listNodes.size();
        }
    });
    System.out.println( status);
}

This takes too much time, even if I limit the number of returned nodes to just a few thousand using the cypher query. I never waited for the full graph to load.

What would be my best bet for getting the same speed using the driver (or alternative approaches ) as compared the stored procedure? Are there fundamental limitations which would inhibit this?

Upvotes: 0

Views: 244

Answers (2)

mcsoini
mcsoini

Reputation: 6642

As InverseFalcon points out, the transfer over the bolt protocol comes with speed limitations, not present in procedures. What I was looking for is the embedded DBMS described here. I am now using this like this:

public class GraphAccess implements AutoCloseable
{
    ...

    private final DatabaseManagementService managementService =
        new DatabaseManagementServiceBuilder(NEO4J_DATA_PATH)
        .setConfig( GraphDatabaseSettings.read_only, true )
                        .setConfig(GraphDatabaseSettings.logs_directory,
                                   NEO4J_LOGS_DIRECTORY).build();
    public final GraphDatabaseService db = managementService.database( NEO4J_DATABASE_NAME );

    public final HashMap<Integer, String> mapNodeProp = new HashMap<>();

    ...

    private void getNodeData() {

        try ( Transaction tx = db.beginTx() ) {
            tx.getAllNodes().forEach((Node n) -> {
                
                final Integer nodeId = (int) n.getId();
                mapNodeProp.put(nodeId, (String) n.getProperty("name"));
            
            }
        }
    }
}

This has the advantage of high throughput (since it doesn't use the driver, but accesses the neo4j database data directly) and the possibility to use it in the context of a Spring app (since it's not implemented as a custom procedure).

Upvotes: 0

InverseFalcon
InverseFalcon

Reputation: 30397

Keep in mind that procedure code executes on the server itself, it is effectively embedded with Neo4j.

Compare that to the need to transfer all nodes and their properties over the network. That's a lot of extra I/O that wasn't needed with procedures.

Upvotes: 1

Related Questions