Reputation: 493
I have the below example.
I was wondering what is the best and quickest way to add a list of nodes and edges in a single transaction? I use standard C# Neo4j .NET packages but open to the Neo4jClient as I've read that's faster. Anything that supports .NET and 4.5 to be honest.
I have an lists of about 60000 FooA objects that need to be added into Neo4j and it can take hours!
Firstly, FooB objects hardly change so I don't have to add them everyday. The performance issues is with adding new FooA objects twice a day.
Each FooA object has a list of FooB objects has two lists containing the relationships I need to add; RelA and RelB (see below).
public class FooA
{
public long Id {get;set;} //UniqueConstraint
public string Name {get;set;}
public long Age {get;set;}
public List<RelA> ListA {get;set;}
public List<RelB> ListB {get;set;}
}
public class FooB
{
public long Id {get;set;} //UniqueConstraint
public string Prop {get;set;}
}
public class RelA
{
public string Val1 {get;set;}
pulic NodeTypeA Node {get;set;
}
public class RelB
{
public FooB Start {get;set;}
public FooB End {get;set;}
public string ValExample {get;set;}
}
Currently, I check if Node 'A' exists by matching by Id. If it does then I completely skip and move onto the next item. If not, I create Node 'A' with its own properties. I then create the edges with their own unique properties.
That's quite a few transactions per item. Match node by Id -> add nodes -> add edges.
foreach(var ntA in FooAList)
{
//First transaction.
MATCH (FooA {Id: ntA.Id)})
if not exists
{
//2nd transaction
CREATE (n:FooA {Id: 1234, Name: "Example", Age: toInteger(24)})
//Multiple transactions.
foreach (var a in ListA)
{
MATCH (n:FooA {Id: ntA.Id}), (n2:FooB {Id: a.Id }) with n,n2 LIMIT 1
CREATE (n)-[:RelA {Prop: a.Val1}]-(n2)
}
foreach (var b in Listb)
{
MATCH (n:FooB {Id: b.Start.Id}), (n2:FooB {Id: b.End.Id }) with n,n2 LIMIT 1
CREATE (n)-[:RelA {Prop: b.ValExample}]-(n2)
}
}
How would one go about adding a list of FooA's using for example Neo4jClient and UNWIND or any other way apart from CSV import.
Hope that makes sense, and thanks!
Upvotes: 1
Views: 969
Reputation: 6270
The biggest problem is the nested lists, which mean you have to do your foreach
loops, so you end up executing a minimum of 4 queries per FooA
, which for 60,000 - well - that's a lot!
First and foremost - you need an index on the Id
property of your FooA
and FooB
nodes, this will speed up your queries dramatically.
I've played a bit with this, and have it storing 60,000 FooA entries, and creating 96,000 RelB instances in about 12-15 seconds on my aging computer.
I've split it into 2 sections - FooA and RelB:
I've had to normalise the FooA
class into something I can use in Neo4jClient
- so let's introduce that:
public class CypherableFooA
{
public CypherableFooA(FooA fooA){
Id = fooA.Id;
Name = fooA.Name;
Age = fooA.Age;
}
public long Id { get; set; }
public string Name { get; set; }
public long Age { get; set; }
public string RelA_Val1 {get;set;}
public long RelA_FooBId {get;set;}
}
I've added the RelA_Val1
and RelA_FooBId
properties to be able to access them in the UNWIND
. I convert your FooA
using a helper method:
public static IList<CypherableFooA> ConvertToCypherable(FooA fooA){
var output = new List<CypherableFooA>();
foreach (var element in fooA.ListA)
{
var cfa = new CypherableFooA(fooA);
cfa.RelA_FooBId = element.Node.Id;
cfa.RelA_Val1 = element.Val1;
output.Add(cfa);
}
return output;
}
This combined with:
var cypherable = fooAList.SelectMany(a => ConvertToCypherable(a)).ToList();
Flattens the FooA
instances, so I end up with 1 CypherableFooA
for each item in the ListA
property of a FooA
. e.g. if you had 2 items in ListA
on every FooA
and you have 5,000 FooA
instances - you would end up with cypherable
containing 10,000 items.
Now, with cypherable
I call my AddFooAs
method:
public static void AddFooAs(IGraphClient gc, IList<CypherableFooA> fooAs, int batchSize = 10000, int startPoint = 0)
{
var batch = fooAs.Skip(startPoint).Take(batchSize).ToList();
Console.WriteLine($"FOOA--> {startPoint} to {batchSize + startPoint} (of {fooAs.Count}) = {batch.Count}");
if (batch.Count == 0)
return;
gc.Cypher
.Unwind(batch, "faItem")
.Merge("(fa:FooA {Id: faItem.Id})")
.OnCreate().Set("fa = faItem")
.Merge("(fb:FooB {Id: faItem.RelA_FooBId})")
.Create("(fa)-[:RelA {Prop: faItem.RelA_Val1}]->(fb)")
.ExecuteWithoutResults();
AddFooAs(gc, fooAs, batchSize, startPoint + batch.Count);
}
This batches the query into batches of 10,000 (by default) - this takes about 5-6 seconds on mine - about the same as if I try all 60,000 in one go.
You store RelB
in your example with FooA
, but the query you're writing doesn't use the FooA
at all, so what I've done is extract and flatten all the RelB
instances in the ListB
property:
var relBs = fooAList.SelectMany(a => a.ListB.Select(lb => lb));
Then I add them to Neo4j like so:
public static void AddRelBs(IGraphClient gc, IList<RelB> relbs, int batchSize = 10000, int startPoint = 0)
{
var batch = relbs.Select(r => new { StartId = r.Start.Id, EndId = r.End.Id, r.ValExample }).Skip(startPoint).Take(batchSize).ToList();
Console.WriteLine($"RELB--> {startPoint} to {batchSize + startPoint} (of {relbs.Count}) = {batch.Count}");
if(batch.Count == 0)
return;
var query = gc.Cypher
.Unwind(batch, "rbItem")
.Match("(fb1:FooB {Id: rbItem.StartId}),(fb2:FooB {Id: rbItem.EndId})")
.Create("(fb1)-[:RelA {Prop: rbItem.ValExample}]->(fb2)");
query.ExecuteWithoutResults();
AddRelBs(gc, relbs, batchSize, startPoint + batch.Count);
}
Again, batching defaulted to 10,000.
Obviously time will vary depending on the number of rels in ListB
and ListA
- My tests has one item in ListA
and 2 in ListB
.
Upvotes: 1