Nick Reeve
Nick Reeve

Reputation: 1680

Entity Framework Bulk Insert/Update with Relationships

I have the scenario where I need to perform DB updates for a large amount of data. There are foreign key relationships that need to be added at the same time and I am getting a list of the foreign objects so I don't have to hit the database each time to check if they exist/add them etc:

using(DbEntities db = new DbEntities())
{
   // Get list of all books so don't have to hit every time
   Dictionary<int, Book> books = db.Books.ToDictionary(k => k.BookId, v => v);

   // Loop through big file to import each row
   foreach(var item in bigFile)
   {
      Person person = new Person { FirstName = item.FirstName, LastName = item.LastName };

      foreach(var book in item.Books)
      {
         if(!books.ContainsKey(book.BookId))
         {
            // Add book to DB if doesn't exist
            Book bookToAdd = new Book { BookId = book.BookId, Name = book.Name };
            db.Books.Add(bookToAdd);

            books.Add(bookToAdd.BookId, bookToAdd);
         }

         person.Books.Add(books[book.BookId]);
      }

      db.People.Add(person);
   }

   db.SaveChanges();
}

The problem with this solution is that the import starts fast and slows down as it goes and becomes really slow. This seems to be down to the context getting bloated with change tracking.

I have seen posts suggesting using db.Configuration.AutoDetectChangesEnabled = false but when I do that the relationships do not get added. I can make that work by forcing DetectChanges() but that seems to defeat the purpose as I will have to do it every iteration of the loop.

So, I moved the DB context inside the loop so it is recreated each time. Doing this means I can no longer have the detached list of books so I have to do an .Any() and .Single() call to the DB for each row (I don't know if that is a major performance issue but always try to hit the DB as infrequently as possible):

// Loop through big file to import each row
foreach(var item in bigFile)
{
   // Create DB context each time
   using(DbEntities db = new DbEntities())
   {
      Person person = new Person { FirstName = item.FirstName, LastName = item.LastName };

      foreach(var book in item.Books)
      {
         if(!db.Books.Any(m => m.BookId = bookId))
         {
            // Add book to DB if doesn't exist
            Book bookToAdd = new Book { BookId = bookId, Name = book.Name

            db.Books.Add(bookToAdd);
         }

         person.Books.Add(db.Books.Single(m => m.BookId = bookId));
      }

      db.People.Add(person);

      db.SaveChanges();
   }
}

This speeds it up considerably but it still starts slows down after around 5,000-10,000 rows and I was wondering what my options are? ...other than just doing it all with stored procedures!

Upvotes: 2

Views: 2123

Answers (3)

Ivan Stoev
Ivan Stoev

Reputation: 205589

IMO both solutions are not good. The first is loading the whole existing Books table in memory (and db context) and the second performs 2 db queries per person book - one with Any and one with Single.

Since my tests does not show performance issues with context change tracking, I would use a variation of the first approach with elements from second. Instead of loading the whole Books table, I would use a local dictionary populated on demand, with single db query for each new book Id:

using (DbEntities db = new DbEntities())
{
    // The local book dictionary
    Dictionary<int, Book> books = new Dictionary<int, Book>();

    // Loop through big file to import each row
    foreach (var item in bigFile)
    {
        Person person = new Person { FirstName = item.FirstName, LastName = item.LastName };

        foreach (var itemBook in item.Books)
        {
            Book book;

            // Try get from local dictionary
            if (!books.TryGetValue(itemBook.BookId, out book))
            {
                // Try get from db
                book = db.Books.FirstOrDefault(e => e.BookId == itemBook.BookId);
                if (book == null)
                {
                    // Add book to DB if doesn't exist
                    book = new Book { BookId = itemBook.BookId, Name = itemBook.Name };
                    db.Books.Add(book);
                }
                // add to local dictionary
                books.Add(book.BookId, book);
            }

            person.Books.Add(book);
        }

        db.People.Add(person);
    }

    db.SaveChanges();
}

Upvotes: 1

Jonathan Magnan
Jonathan Magnan

Reputation: 11337

db.Books.Any

I don't know if that is a major performance issue but always try to hit the DB as infrequently as possible

Yes, this is a major performance issue. For every book, you make a database round-trip which is quite inefficient.

Proposed Solution

(Yes, this is the same solution as your first example)

Make one database round-trip and use a dictionary instead.

// var bookIds = ctx.EntitySimples.Select(x => x.Id).ToDictionary(x => x);
var books = db.Books.ToDictionary(k => k.BookId, v => v);

// ...code...

if(!bookIds.ContainsKey(bookId))
{
}

Add + AutoDectectChangesEnabled = false vs AddRange

I have seen posts suggesting using db.Configuration.AutoDetectChangesEnabled = false but when I do that the relationships do not get added.

Disabling AutoDetectChanges allow to get the around the same performance between Add and AddRange. However, if that doesn't work, that for sure become an issue!

Proposed Solution

Use AddRange instead

using(DbEntities db = new DbEntities())
{
   var listToAdd = new List<Book>();
   var personToAdd = new List<Person>();

   // Get list of all books so don't have to hit every time
   Dictionary<int, Book> books = db.Books.ToDictionary(k => k.BookId, v => v);

   // Loop through big file to import each row
   foreach(var item in bigFile)
   {
      Person person = new Person { FirstName = item.FirstName, LastName = item.LastName };

      foreach(var book in item.Books)
      {
         if(!books.ContainsKey(book.BookId))
         {
            // Add book to DB if doesn't exist
            Book bookToAdd = new Book { BookId = book.BookId, Name = book.Name };

            // ADD to list instead
            listToAdd.Add(bookToAdd);
         }

         person.Books.Add(books[book.BookId]);
      }

      // ADD to list instead
      personToAdd.Add(person);
   }

   // USE AddRange here instead
   db.Books.AddRange(listToAdd);
   db.People.AddRange(person);

   db.SaveChanges();
}

SaveChanges

For every book and for every person, you need to add or update, a database round-trip is performed.

So if you need to insert 10000 books, 10000 database round-trip will be performed wich is INSANELY SLOW.

Proposed Solution

Use a library that allows you to perform bulk operations.


Disclaimer: I'm the owner of the project Entity Framework Extensions

This library allows you to perform all bulk operations:

  • BulkSaveChanges
  • BulkInsert
  • BulkUpdate
  • BulkDelete
  • BulkMerge
  • BulkSynchronize

Example:

// Easy to use
context.BulkSaveChanges();

// Easy to customize
context.BulkSaveChanges(bulk => bulk.BatchSize = 100);

// Perform Bulk Operations
context.BulkDelete(customers);
context.BulkInsert(customers);
context.BulkUpdate(customers);

// Customize Primary Key
context.BulkMerge(customers, operation => {
   operation.ColumnPrimaryKeyExpression = 
        customer => customer.Code;
});

EDIT: Answer sub-question

I see the benefit of AddRange but what if I was to extend that code example to not only add new Person entities but also update if they already exist?

You can use the same solution as you currently do with books if everything can be loading in memory.

var people = db.People.ToDictionary(k => k.PersonId, v => v);

And simply edit it.

Keep in mind that if you don't have any concurrency check, you may override modified value if the importation takes a lot of time.

Be careful of common Pitfall:

  • Some people will recommend AddOrUpdate method which makes a database roundtrip every time. So increate the importation time.

Another technique could be using the BulkMerge method from my library.

Upvotes: 0

bower
bower

Reputation: 166

It sounds like you have memory leak, I've previously used PerfView to compare what objects are in memory at different times. At a guess I'd say that your context classes aren't being disposed of (i.e. they are being retained for some reason). If you have some experience with using performance tools like these then that'd be a good place to start, however there is a steep learning curve if you haven't.

Personally, I would use a single stored procedure and one or more table value parameters to do large data imports like yours. In my experience they are considerably much faster.

*edit

Just noticed some mistakes in your code... you're missing some comparison operators:

 // Loop through big file to import each row
foreach(var item in bigFile)
{
   // Create DB context each time
   using(DbEntities db = new DbEntities())
   {
      Person person = new Person { FirstName = item.FirstName, LastName = item.LastName };

      foreach(var book in item.Books)
      {
         if(!db.Books.Any(m => m.BookId == bookId))
         {
            // Add book to DB if doesn't exist
            Book bookToAdd = new Book { BookId = bookId, Name = book.Name

            db.Books.Add(bookToAdd);
         }

         person.Books.Add(db.Books.Single(m => m.BookId == bookId));
      }

      db.People.Add(person);

      db.SaveChanges();
   }
}

Upvotes: 0

Related Questions