Reputation: 59368
I have a function that looks like this:
def insert_multiple_cakes(cake_list)
ensure_indexes
insert_list = cake_list.map { |cake| mongofy_values(cake.to_hash) }
inserted = db[CAKE_COLLECTION].insert(insert_list, w: 0)
return inserted.length
end
The goal of the function is to insert all cakes from cake_list
into the Mongo database. Any cake that already exists in the database should be ignored. The function should return the number of cakes inserted, so if cake_list
contains 5 cakes and 2 of those cakes already exist in the database, the function should return 3.
My problem is that after an hour of experimenting, I have concluded the following:
If the write concern (the :w
option) is 0, then the insert call silently ignores all duplicate inserts, and the return value contains all the input documents, even those that weren't inserted. It doesn't matter what I set :continue_on_error
or :collect_on_error
, the return value always contains all the documents, and the list of collected errors is always empty.
If the write concern is 1, then the insert call fails with an Mongo::OperationFailure
if there are any duplicates among the input documents. It doesn't matter what I set :continue_on_error
or :collect_on_error
to, the insert always fails when there are duplicates.
So it seems to me that the only way to achieve this is to iterate over the input list, perform a search for EVERY document and filter away those that already exist. My application is going to deal with (at least) thousands of inserts at a time, so I like this plan about as much as I'd like to jump off a bridge.
Have I misunderstood something, or is the Ruby client perhaps bugged?
To demonstrate, this function does exactly what I want and works:
def insert_multiple_cakes(cake_list)
ensure_indexes
collection = db[CAKE_COLLECTION]
# Filters away any cakes that already exists in the database.
filtered_list = cake_list.reject { |cake|
collection.count(query: {"name" => cake.name}) == 1
}
insert_list = filtered_list.map { |cake| mongofy_values(cake.to_hash) }
inserted = collection.insert(insert_list)
return inserted.length
end
The problem is that it performs about a gazillion searches where it should only really have to do one insert.
Documentation for Mongo::Collection#insert
Upvotes: 2
Views: 276
Reputation: 12582
You can do something like this (source):
coll = MongoClient.new().db('test').collection('cakes')
bulk = coll.initialize_unordered_bulk_op
bulk.insert({'_id' => "strawberry"})
bulk.insert({'_id' => "strawberry"}) # duplicate key
bulk.insert({'_id' => "chocolate"})
bulk.insert({'_id' => "chocolate"}) # duplicate key
begin
bulk.execute({:w => 1}) # this is the default but don't change it to 0 or you won't get the errors
rescue => ex
p ex
p ex.result
end
ex.result
contains ninserted
and the reason each one failed.
{"ok"=>1,
"n"=>2,
"code"=>65,
"errmsg"=>"batch item errors occurred",
"nInserted"=>2,
"writeErrors"=>
[{"index"=>1,
"code"=>11000,
"errmsg"=>
"insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.cakes.$_id_ dup key: { : \"strawberry\" }"},
{"index"=>3,
"code"=>11000,
"errmsg"=>
"insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.cakes.$_id_ dup key: { : \"chocolate\" }"}]}
Upvotes: 5
Reputation: 59368
Bulk operations was the way to go. I'm accepting ranman's answer, but I thought I should share my final code:
def insert_documents(collection_name, documents)
collection = db[collection_name]
bulk = collection.initialize_unordered_bulk_op
inserts = 0
documents.each { |doc|
bulk.insert doc
inserts += 1
}
begin
bulk.execute
rescue Mongo::BulkWriteError => e
inserts = e.result["nInserted"]
end
return inserts
end
def insert_cakes(cakes)
ensure_cake_indexes
doc_list = cakes.map { |cake|
mongofy_values(cake.to_hash)
}
return insert_documents(CAKE_COLLECTION, doc_list)
end
Upvotes: 1