Hubro
Hubro

Reputation: 59368

How do I find out how many Mongo documents were actually inserted?

I have a function that looks like this:

def insert_multiple_cakes(cake_list)
  ensure_indexes

  insert_list = cake_list.map { |cake| mongofy_values(cake.to_hash) }

  inserted = db[CAKE_COLLECTION].insert(insert_list, w: 0)

  return inserted.length
end

The goal of the function is to insert all cakes from cake_list into the Mongo database. Any cake that already exists in the database should be ignored. The function should return the number of cakes inserted, so if cake_list contains 5 cakes and 2 of those cakes already exist in the database, the function should return 3.

My problem is that after an hour of experimenting, I have concluded the following:

So it seems to me that the only way to achieve this is to iterate over the input list, perform a search for EVERY document and filter away those that already exist. My application is going to deal with (at least) thousands of inserts at a time, so I like this plan about as much as I'd like to jump off a bridge.

Have I misunderstood something, or is the Ruby client perhaps bugged?


To demonstrate, this function does exactly what I want and works:

def insert_multiple_cakes(cake_list)
  ensure_indexes

  collection = db[CAKE_COLLECTION]

  # Filters away any cakes that already exists in the database.
  filtered_list = cake_list.reject { |cake|
    collection.count(query: {"name" => cake.name}) == 1
  }

  insert_list = filtered_list.map { |cake| mongofy_values(cake.to_hash) }

  inserted = collection.insert(insert_list)

  return inserted.length
end

The problem is that it performs about a gazillion searches where it should only really have to do one insert.


Documentation for Mongo::Collection#insert

Upvotes: 2

Views: 276

Answers (2)

Randall Hunt
Randall Hunt

Reputation: 12582

You can do something like this (source):

coll = MongoClient.new().db('test').collection('cakes')
  bulk = coll.initialize_unordered_bulk_op
  bulk.insert({'_id' => "strawberry"})
  bulk.insert({'_id' => "strawberry"}) # duplicate key
  bulk.insert({'_id' => "chocolate"})
  bulk.insert({'_id' => "chocolate"}) # duplicate key
begin
  bulk.execute({:w => 1}) # this is the default but don't change it to 0 or you won't get the errors
rescue => ex
  p ex
  p ex.result
end

ex.result contains ninserted and the reason each one failed.

{"ok"=>1,
 "n"=>2,
 "code"=>65,
 "errmsg"=>"batch item errors occurred",
 "nInserted"=>2,
 "writeErrors"=>
  [{"index"=>1,
    "code"=>11000,
    "errmsg"=>
     "insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.cakes.$_id_  dup key: { : \"strawberry\" }"},
   {"index"=>3,
    "code"=>11000,
    "errmsg"=>
     "insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.cakes.$_id_  dup key: { : \"chocolate\" }"}]}

Upvotes: 5

Hubro
Hubro

Reputation: 59368

Bulk operations was the way to go. I'm accepting ranman's answer, but I thought I should share my final code:

def insert_documents(collection_name, documents)
  collection = db[collection_name]

  bulk = collection.initialize_unordered_bulk_op
  inserts = 0

  documents.each { |doc|
    bulk.insert doc

    inserts += 1
  }

  begin
    bulk.execute
  rescue Mongo::BulkWriteError => e
    inserts = e.result["nInserted"]
  end

  return inserts
end

def insert_cakes(cakes)
  ensure_cake_indexes

  doc_list = cakes.map { |cake|
    mongofy_values(cake.to_hash)
  }

  return insert_documents(CAKE_COLLECTION, doc_list)
end

Upvotes: 1

Related Questions