Reputation: 340
I've constructed a multithreaded program that reads an 80GB XML file and inserts entities into SQL Server using EF Core bulk extensions.
The XML is a list of graphs that i want to insert as fast as possible.
I need to insert the objects as a graph to maintain their relations where i afterwords intend to run a series of cleanups to remove duplicates.
My main problem right now is that the
BulkInsert(entities, EncludeGraph = true)
isn't as fast as I want it to be.
The entities are created using model-first, where I've disabled the CascadeDelete
option from the foreign key constraints. Each navigation property is automatically generated and creates a nonclustered index (which I suppose is OK).
Any advice on how to improve the insert speed?
Upvotes: 0
Views: 2784
Reputation: 340
After reading the Sourcecode of EFCore.BulkInsert it's pretty obvious that the way that it inserts a whole graph of objects, is by first bulk inserting all the parents - then retrieve the ID's, update the 1st level childrens parentid then bulk insert them etc until the final leaf nodes.
This is suboptimal since it takes several roundtrips to complete a medium nested graph.
The only way i can i imagine making this faster, is by doing identi-insert by manually creating the Identity values outside of SQL and then traverse the graph and update all child elements with their parent id.
THEN bulk insert using identity-insert without any roundtrips.
UPDATE:
I made it all work quite nicely. Inserting about 30.000 rows pr second (50-60MB/s SQL write speed) and it finishes the 80GB XML file in about 45 minutes. Which is quite acceptable compared to the previous 12+ hour process.
I've made a series of threads that does single tasks:
Thread 1. Read from Zipped XML Archive into a ConcurrentQueue
Thread 2. Scan the Queue for Start-Stop elements. Merging them using StringBuilder into ConcurrentQueue
Thread 3. Remove namespace from XML-entity and add to ConcurrentQueue
Thread 4. Deserialize XML into Object and add to ConcurrentQueue
Thread 5. Create Lists of 1000 Objects and add to ConcurrentQueue<List>
Thread 6. Use GraphUtil.GetTopologicallySortedGraph from EFCore.BulkExtensions to split the object into List of Entity Types and insert to DB using BulkInsert ( KeepIdentity=true )
var grp = GraphUtil.GetTopologicallySortedGraph(db, list).GroupBy(g => g.Entity.GetType());
foreach (var typecollection in grp)
{
var entityClrType = typecollection.Key;
var sublist = typecollection.Select(s => s.Entity).ToList();
ObjectCount += sublist.Count;
db.BulkInsert(sublist, config, type: entityClrType);
}
Note: When Deserializing the XML's their ID's are created:
private static int _EntityID;
public Entity()
{
EntityID= ++_EntityID;
}
Note: I'm using EFCore Codefirst mechanics, which automatically handles all navigation properties and their relation ID - very important ! - updating the child ID automatically ensures the parent relation.
Note: I'm running 1 Deserializer (can't run more than one because of static identity counters) and 6 BulkInserter threads
Upvotes: 1