Reputation: 3191
So I'm rather new to MongoDB. Here is an imaginary database with the following format.
{
"_id": "message_id",
"headers": {
"from": <from_email>,
"to": <to_email>,
"timestamp": <timestamp>
},
"message": {
"message": <the message contents>,
"signature": <signature contents>
}
}
Suppose all emails received are inserted into it and sometimes emails are double sent. How can one return a collection of emails from an author without any double sends.
I thought this might do it but it doesn't seem to work as expected:
db.mycoll.find({"headers.from": <authorname>}).distinct("message.message")
Edit: Please excuse me, It seems I have been making some kind of typo, the above query works, but it only returns messages.messages without the Headers, How would I keep the headers intact as well?
Upvotes: 1
Views: 631
Reputation: 3191
Building on Neil Lunn's answer above: I think one can do
db.collection.aggregate([{"$match": {"headers.from": <from email>} } ,
{"$group": { "_id": "$message.message"},
"headers": {"$first": "$headers"},
"signature": {"$first": "$message.signature"},
"message_id": "$_id" }},
{"$project" : { "_id": "$message_id",
"headers": "$headers",
"message": { "message": "$_id", "signature": "$signature" } } }])
Since _id
must be unique the consequence is that duplicate messages will not make the list, and then $project
will restructure it to the original object structure with correct key names.
I guess I only have one question in this regard - is there a way to force uniqueness without aggregating into _id
or is this generally considered the correct way to do it in MongoDB
?
Upvotes: 0
Reputation: 151170
Hard to really determine from your question which part is the "duplicate" or therefore should be unique. It stands to reason though that things such as the message "_id" and "timestamp" are not going to duplicate, so this only really leaves the message content, with the possible additional paranoia of that message being "from" the same person.
Document reshaping is generally best handled by the aggregation framework:
db.collection.aggregate([
{ "$group": {
"_id": { "message": "$message.message", "from": "$headers.from" },
"message_id": { "$first": "$_id" },
"headers": { "$first": "$headers" },
"message": { "$first": "$message" }
}},
{ "$project": {
"_id": "$message_id",
"headers": 1,
"message": 1
}}
])
The $group
will filter out any matching message content with the $first
operations selecting only the "first" found item for the matching field on the document grouping boundary.
There is an assumption in here that the existing order is by "timestamp" but if not then you might want to apply a $sort
as the first pipeline stage before the others:
{ "$sort": { "headers.timestamp": 1 } }
The final $project
really just restores the original document form and removes the "grouping key" that was supplied earlier. Just prettier than duplicating information and/or putting things out of place.
Upvotes: 1
Reputation: 3341
db.mycoll.aggregate([
{
$match:{"headers.from": <authorname>}
},{
$group:{
_id:"$headers.from",
"message":{$addToSet:"$message.message"}
}
}
])
Upvotes: 0
Reputation: 400
What you're looking for is not currently implemented (at least as far as I know). One work around would be this
Upvotes: 0
Reputation: 1235
You could use distinct()
to return an array of distinct messages from a specific author as follows:
db.collection.distinct('message.message', {"headers.from": <authorname>})
Upvotes: 0