Liam
Liam

Reputation: 493

Neo4j Adding Multiple Nodes and Edges Efficiently

I have the below example.

Example

I was wondering what is the best and quickest way to add a list of nodes and edges in a single transaction? I use standard C# Neo4j .NET packages but open to the Neo4jClient as I've read that's faster. Anything that supports .NET and 4.5 to be honest.

I have an lists of about 60000 FooA objects that need to be added into Neo4j and it can take hours!

Firstly, FooB objects hardly change so I don't have to add them everyday. The performance issues is with adding new FooA objects twice a day.

Each FooA object has a list of FooB objects has two lists containing the relationships I need to add; RelA and RelB (see below).

public class FooA
{
  public long Id {get;set;} //UniqueConstraint
  public string Name {get;set;}
  public long Age {get;set;}
  public List<RelA> ListA {get;set;}
  public List<RelB> ListB {get;set;}
}

public class FooB
{
  public long Id {get;set;} //UniqueConstraint
  public string Prop {get;set;}
}

public class RelA
{
      public string Val1 {get;set;} 
      pulic NodeTypeA Node {get;set;
}

public class RelB
{
 public FooB Start {get;set;}
 public FooB End {get;set;}
 public string ValExample {get;set;} 

}

Currently, I check if Node 'A' exists by matching by Id. If it does then I completely skip and move onto the next item. If not, I create Node 'A' with its own properties. I then create the edges with their own unique properties.

That's quite a few transactions per item. Match node by Id -> add nodes -> add edges.

    foreach(var ntA in FooAList)
    {
        //First transaction.
        MATCH (FooA {Id: ntA.Id)})

        if not exists
        {
           //2nd transaction
           CREATE (n:FooA {Id: 1234, Name: "Example", Age: toInteger(24)})

           //Multiple transactions.
           foreach (var a in ListA)
           {
              MATCH (n:FooA {Id: ntA.Id}), (n2:FooB {Id: a.Id }) with n,n2 LIMIT 1
              CREATE (n)-[:RelA {Prop: a.Val1}]-(n2)
           }

            foreach (var b in Listb)
            {
               MATCH (n:FooB {Id: b.Start.Id}), (n2:FooB {Id: b.End.Id }) with n,n2 LIMIT 1
               CREATE (n)-[:RelA {Prop: b.ValExample}]-(n2)
            }
         }

How would one go about adding a list of FooA's using for example Neo4jClient and UNWIND or any other way apart from CSV import.

Hope that makes sense, and thanks!

Upvotes: 1

Views: 969

Answers (1)

Charlotte Skardon
Charlotte Skardon

Reputation: 6270

The biggest problem is the nested lists, which mean you have to do your foreach loops, so you end up executing a minimum of 4 queries per FooA, which for 60,000 - well - that's a lot!

Quick Note RE: Indexing

First and foremost - you need an index on the Id property of your FooA and FooB nodes, this will speed up your queries dramatically.

I've played a bit with this, and have it storing 60,000 FooA entries, and creating 96,000 RelB instances in about 12-15 seconds on my aging computer.

The Solution

I've split it into 2 sections - FooA and RelB:

FooA

I've had to normalise the FooA class into something I can use in Neo4jClient - so let's introduce that:

public class CypherableFooA
{
    public CypherableFooA(FooA fooA){
        Id = fooA.Id;
        Name = fooA.Name;
        Age = fooA.Age;
    }
    
    public long Id { get; set; }
    public string Name { get; set; }
    public long Age { get; set; }
    
    public string RelA_Val1 {get;set;}
    public long RelA_FooBId {get;set;}
}

I've added the RelA_Val1 and RelA_FooBId properties to be able to access them in the UNWIND. I convert your FooA using a helper method:

public static IList<CypherableFooA> ConvertToCypherable(FooA fooA){
    var output = new List<CypherableFooA>();

    foreach (var element in fooA.ListA)
    {
        var cfa = new CypherableFooA(fooA);
        cfa.RelA_FooBId = element.Node.Id;
        cfa.RelA_Val1 = element.Val1;
        output.Add(cfa);
    }
    
    return output;
}

This combined with:

var cypherable = fooAList.SelectMany(a => ConvertToCypherable(a)).ToList();

Flattens the FooA instances, so I end up with 1 CypherableFooA for each item in the ListA property of a FooA. e.g. if you had 2 items in ListA on every FooA and you have 5,000 FooA instances - you would end up with cypherable containing 10,000 items.

Now, with cypherable I call my AddFooAs method:

public static void AddFooAs(IGraphClient gc, IList<CypherableFooA> fooAs, int batchSize = 10000, int startPoint = 0)
{
    var batch = fooAs.Skip(startPoint).Take(batchSize).ToList();
    Console.WriteLine($"FOOA--> {startPoint} to {batchSize + startPoint} (of {fooAs.Count}) = {batch.Count}");
    
    if (batch.Count == 0)
        return;

    gc.Cypher
        .Unwind(batch, "faItem")
        .Merge("(fa:FooA {Id: faItem.Id})")
        .OnCreate().Set("fa = faItem")
        .Merge("(fb:FooB {Id: faItem.RelA_FooBId})")
        .Create("(fa)-[:RelA {Prop: faItem.RelA_Val1}]->(fb)")
        .ExecuteWithoutResults();
    
    AddFooAs(gc, fooAs, batchSize, startPoint + batch.Count);
}

This batches the query into batches of 10,000 (by default) - this takes about 5-6 seconds on mine - about the same as if I try all 60,000 in one go.

RelB

You store RelB in your example with FooA, but the query you're writing doesn't use the FooA at all, so what I've done is extract and flatten all the RelB instances in the ListB property:

var relBs = fooAList.SelectMany(a => a.ListB.Select(lb => lb));

Then I add them to Neo4j like so:

public static void AddRelBs(IGraphClient gc, IList<RelB> relbs, int batchSize = 10000, int startPoint = 0)
{
    var batch = relbs.Select(r => new { StartId = r.Start.Id, EndId = r.End.Id, r.ValExample }).Skip(startPoint).Take(batchSize).ToList();
    Console.WriteLine($"RELB--> {startPoint} to {batchSize + startPoint} (of {relbs.Count}) = {batch.Count}");
    if(batch.Count == 0)
        return;

    var query = gc.Cypher
        .Unwind(batch, "rbItem")
        .Match("(fb1:FooB {Id: rbItem.StartId}),(fb2:FooB {Id: rbItem.EndId})")
        .Create("(fb1)-[:RelA {Prop: rbItem.ValExample}]->(fb2)");

    query.ExecuteWithoutResults();
    AddRelBs(gc, relbs, batchSize, startPoint + batch.Count);
}

Again, batching defaulted to 10,000.

Obviously time will vary depending on the number of rels in ListB and ListA - My tests has one item in ListA and 2 in ListB.

Upvotes: 1

Related Questions