user3467349
user3467349

Reputation: 3191

MongoDB Collection with Non-repeated field value

So I'm rather new to MongoDB. Here is an imaginary database with the following format.

{
    "_id": "message_id", 
    "headers": {
        "from": <from_email>,
        "to": <to_email>, 
        "timestamp": <timestamp>
    },   
    "message": {
        "message": <the message contents>, 
        "signature": <signature contents>
    }
} 

Suppose all emails received are inserted into it and sometimes emails are double sent. How can one return a collection of emails from an author without any double sends.

I thought this might do it but it doesn't seem to work as expected:

db.mycoll.find({"headers.from": <authorname>}).distinct("message.message")

Edit: Please excuse me, It seems I have been making some kind of typo, the above query works, but it only returns messages.messages without the Headers, How would I keep the headers intact as well?

Upvotes: 1

Views: 631

Answers (5)

user3467349
user3467349

Reputation: 3191

Building on Neil Lunn's answer above: I think one can do

db.collection.aggregate([{"$match": {"headers.from": <from email>} } ,
{"$group": { "_id": "$message.message"}, 
"headers": {"$first": "$headers"}, 
"signature": {"$first": "$message.signature"},
 "message_id": "$_id" }}, 
{"$project" : { "_id": "$message_id", 
"headers": "$headers", 
"message": { "message": "$_id", "signature": "$signature" } } }]) 

Since _id must be unique the consequence is that duplicate messages will not make the list, and then $project will restructure it to the original object structure with correct key names.

I guess I only have one question in this regard - is there a way to force uniqueness without aggregating into _id or is this generally considered the correct way to do it in MongoDB ?

Upvotes: 0

Neil Lunn
Neil Lunn

Reputation: 151170

Hard to really determine from your question which part is the "duplicate" or therefore should be unique. It stands to reason though that things such as the message "_id" and "timestamp" are not going to duplicate, so this only really leaves the message content, with the possible additional paranoia of that message being "from" the same person.

Document reshaping is generally best handled by the aggregation framework:

db.collection.aggregate([
    { "$group": {
        "_id": { "message": "$message.message", "from": "$headers.from" },
        "message_id": { "$first": "$_id" },
        "headers": { "$first": "$headers" },
        "message": { "$first": "$message" }
    }},
    { "$project": {
        "_id": "$message_id",
        "headers": 1,
        "message": 1
    }}
])

The $group will filter out any matching message content with the $first operations selecting only the "first" found item for the matching field on the document grouping boundary.

There is an assumption in here that the existing order is by "timestamp" but if not then you might want to apply a $sort as the first pipeline stage before the others:

{ "$sort": { "headers.timestamp": 1 } }

The final $project really just restores the original document form and removes the "grouping key" that was supplied earlier. Just prettier than duplicating information and/or putting things out of place.

Upvotes: 1

Barno
Barno

Reputation: 3341

db.mycoll.aggregate([
{
  $match:{"headers.from": <authorname>}
},{
   $group:{
   _id:"$headers.from",
   "message":{$addToSet:"$message.message"}
}
}

])

Upvotes: 0

user926321
user926321

Reputation: 400

What you're looking for is not currently implemented (at least as far as I know). One work around would be this

Upvotes: 0

manu2013
manu2013

Reputation: 1235

You could use distinct() to return an array of distinct messages from a specific author as follows:

db.collection.distinct('message.message', {"headers.from": <authorname>})

Upvotes: 0

Related Questions