daveharris
daveharris

Reputation: 395

Query Mongo Embedded Documents with a size

I have a ruby on rails app using Mongoid and MongoDB v2.4.6.

I have the following MongoDB structure, a record which embeds_many fragments:

{
  "_id" : "76561198045636214",
  "fragments" : [
    {
        "id" : 76561198045636215,
        "source_id" : "source1"
    },
    {
        "id" : 76561198045636216,
        "source_id" : "source2"
    },
    {
        "id" : 76561198045636217,
        "source_id" : "source2"
    }
  ]
}

I am trying to find all records in the database that contain fragments with duplicate source_ids.

I'm pretty sure I need to use $elemMatch as I need to query embedded documents.

I have tried

Record.elem_match(fragments: {source_id: 'source2'})

which works but doesn't restrict to duplicates.

I then tried

Record.elem_match(fragments: {source_id: 'source2', :source_id.with_size => 2})

which returns no results (but is a valid query). The query Mongoid produces is:

selector: {"fragments"=>{"$elemMatch"=>{:source_id=>"source2", "source_id"=>{"$size"=>2}}}}

Once that works I need to update it to $size is >1.

Is this possible? It feels like I'm very close. This is a one-off cleanup operation so query performance isn't too much of an issue (however we do have millions of records to update!)

Any help is much appreciated!

I have been able to achieve desired outcome but in testing it's far too slow (will take many weeks to run across our production system). The problem is double query per record (we have ~30 million records in production).

Record.where('fragments.source_id' => 'source2').each do |record|
  query = record.fragments.where(source_id: 'source2')
  if query.count > 1
    # contains duplicates, delete all but latest
    query.desc(:updated_at).skip(1).delete_all
  end
  # needed to trigger after_save filters
  record.save!
end

Upvotes: 3

Views: 306

Answers (1)

Neil Lunn
Neil Lunn

Reputation: 151122

The problem with the current approach in here is that the standard MongoDB query forms do not actually "filter" the nested array documents in any way. This is essentially what you need in order to "find the duplicates" within your documents here.

For this, MongoDB provides the aggregation framework as probably the best approach to finding this. There is no direct "mongoid" style approach to the queries as those are geared towards the existing "rails" style of dealing with relational documents.

You can access the "moped" form though through the .collection accessor on your class model:

Record.collection.aggregate([

    # Find arrays two elements or more as possibles
    { "$match" => {
        "$and" => [
            { "fragments" => { "$not" => { "$size" => 0 } } },
            { "fragments" => { "$not" => { "$size" => 1 } } }
        ]
    }},

    # Unwind the arrays to "de-normalize" as documents
    { "$unwind" => "$fragments" },

    # Group back and get counts of the "key" values
    { "$group" => {
        "_id" => { "_id" => "$_id", "source_id" => "$fragments.source_id" },
        "fragments" => { "$push" => "$fragments.id" },
        "count" => { "$sum" => 1 }
    }},

    # Match the keys found more than once
    { "$match" => { "count" => { "$gte" => 2 } } }
])

That would return you results like this:

{
    "_id" : { "_id": "76561198045636214", "source_id": "source2" },
    "fragments": ["76561198045636216","76561198045636217"],
    "count": 2
}

That at least gives you something to work with on how to deal with the "duplicates" here

Upvotes: 1

Related Questions