Reputation: 1313
I'm building a data-model using Mongoengine to store some metada for, in this case, email files. The very stripped down model, with only the field relevant for my issue, is:
class Email(Document):
headers = DictField()
The headers dictionary will contain extracted header-data from emails which works in a key-value relationship.
In this header-data I know, at times, there will be a header present with the name (and dictionary key): x-mailer
(all header keys are auto lower-cased). I've build a simple query to see if a header contains this key like this:
xmailer_emails = Email.objects.filter(headers__exists='x-mailer')
However, the results of this don't contain all the entries in the email
collection but it contains quite a few that don't have x-mailer
as a key in the headers
dictionary. Here is some code I used to check the resulting data:
xmailer_emails = Email.objects.filter(headers__exists='x-mailer'))
log(xmailer_emails._query)
log('Total email count: ' + str(Email.objects.count()))
log('X-Mailer email acount: ' + str(xmailer_emails.count()))
no_xmailer = len([e for e in xmailer_emails if 'x-mailer' not in e.headers])
log('Filtered no x-mailer count: ' + str(no_xmailer))
has_xmailer = len([e for e in xmailer_emails if 'x-mailer' in e.headers])
log('Filtered has x-mailer count: ' + str(has_xmailer))
Here is the output I get:
[21:32:54 05/27/19] {'headers': {'$exists': 'x-mailer'}}
[21:32:54 05/27/19] Total email count: 86
[21:32:54 05/27/19] X-Mailer email count: 79
[21:32:54 05/27/19] Filtered no x-mailer count: 55
[21:32:54 05/27/19] Filtered has x-mailer count: 24
So while there are 86 entries in the collection it pulls back 79 of them on a filter for x-mailer
in headers
. Filtering out of those only 24 actually have that key-value in the dictionary. I seem to be doing something wrong but I don't know what.
Here's a short dump of a few of the items that it also pulls back that do not contain the x-mailer
key, these are the keys:
dict_keys(['x-receiver', 'to', 'mime-version', 'received', 'x-priority', 'x-sender', 'date', 'content-type', 'message-id', 'subject', 'x-riferimento-message-id', 'from'])
dict_keys(['content-type', 'mime-version'])
dict_keys(['x-wum-to', 'date', 'x-uidl', 'message-id', 'in-reply-to', 'x-wum-replyto', 'x-message-delivery', 'mime-version', 'received', 'x-savecopy', 'authentication-results', 'x-wum-nature', 'x-account-key', 'x-wum-from', 'message-context', 'x-originalarrivaltime', 'x-me-spamrating', 'x-me-spamlevel', 'to', 'x-dkim-result', 'x-mozilla-status', 'x-mozilla-status2', 'references', 'x-auth-result', 'x-store-info', 'return-path', 'x-message-info', 'x-sid-pra', 'x-message-status', 'sender', 'x-wum-cci', 'content-type', 'reply-to', 'subject', 'from'])
dict_keys(['x-ms-has-attach', 'to', 'mime-version', 'received', 'thread-topic', 'x-auto-response-suppress', 'x-ms-exchange-organization-authsource', 'content-language', 'content-type', 'date', 'subject', 'message-id', 'thread-index', 'reply-to', 'from'])
I can't figure out what is going on, while the exists
operator works it also includes non-matching documents but not every one in the collection.
I've also raised this question as an issue on the mongoengine repository with no reply (besides a linked SO issue for the exists
operator): https://github.com/MongoEngine/mongoengine/issues/2059
Upvotes: 0
Views: 1022
Reputation: 6354
The pymongo query you actually need to execute is the following:
c.find({'headers.x-mailer': {'$exists': True}})
If you have simple keys (e.g "xmailer" instead of "x-mailer"), you could achieve this in mongoengine with:
Email.objects.filter(headers__xmailer__exists=True))
But you can achieve it using the raw operator:
Email.objects.filter(__raw__={'headers.x-mailer': {'$exists': True}})
Upvotes: 1