mongoengine query for duplicates in embedded documentlist

Question

I'm making a python app with mongoengine where i have a mongodb database of n users and each user holds n daily records. I have a list of n new record per user that I want to add to my db

I want to check if a record for a certain date already exists for an user before adding a new record to the user

what i found in the docs is to iterate through every embedded document in the list to check for duplicate fields but thats an O(n^2) algorithm and took 5 solid seconds for 300 records, too long. below an abbreviated version of the code

There's gotta be a better way to query right? I tried accessing something like user.records.date but that throws a not found

import mongoengine

#snippet here is abbreviated and does not run
# xone of interest in conditional_insert(), line 16

class EmbeddedRecord(mongoengine.EmbeddedDocument):
    date = mongoengine.DateField(required = True)
    #contents = ...

class User(mongoengine.Document):
    #meta{}
    #account details
    records = mongoengine.EmbeddedDocumentListField(EmbeddedRecord)


def conditional_insert(user, new_record):
    # the docs tell me to iterate tthrough every record in the user
    # there has to be a better way
    for r in user.records:
        if str(new_record.date) == str(r.date): #i had to do that in my program 
            #because python kep converting datetime obj to str
            return
    # if record of duplicate date not found, insert new record
    save_record(user, new_record)

def save_record(): pass


if __name__ == "__main__":
    lst_to_insert = [] # list of (user, record_to_insert)

    for object in lst_to_insert: #O(n)
        conditional_insert(object[0],object[1]) #O(n)

    #and I have n lst_to_insert so in reality I'm currently at O(n^3)

rubhemaju · Accepted Answer

Hi everyone (and future me who will probably search for the same question 10 years later)

I optimized the code using the idea of a search tree. Instead of putting all records in a single List in User I broke it down by year and month

class EmbeddedRecord(mongoengine.EmbeddedDocument):
    date = mongoengine.DateField(required = True)
    #contents = ...

class Year(mongoengine.EmbeddedDocument):
    monthly_records = mongoengine.EmbeddedDocumentListField(Month)

class Month(mongoengine.EmbeddedDocument):
    daily_records = mongoengine.EmbeddedDocumentListField(EmbeddedRecord)

class User(mongoengine.Document):
    #meta{}
    #account details
    yearly_records = mongoengine.EmbeddedDocumentListField(Year)

because it's mongodb, I can later partition by decades, heck even centuries but by that point I dont think this code will be relevant

I then group my data to insert by months into separate pandas dataframe and feed each dataframe separately. The data flow thus looks like:

0) get monthly df    
1) loop through years until we get the right one (lets say 10 steps, i dont think my program will live that long)
2) loop through months until we get the right one (12 steps)
3) for each record in df loop through each daily record in month to check for duplicates

The algorithm to insert with check is still O(n^2) but since there are maximum 31 records at the last step, the code is much faster. I tested 2000 duplicate records and it ran in under a second (didnt actually time it but as long as it feels instant it wont matter that much in my use case)

mongoengine query for duplicates in embedded documentlist

Answers (2)

Related Questions