Reputation: 39
I'm making a python app with mongoengine where i have a mongodb database of n users and each user holds n daily records. I have a list of n new record per user that I want to add to my db
I want to check if a record for a certain date already exists for an user before adding a new record to the user
what i found in the docs is to iterate through every embedded document in the list to check for duplicate fields but thats an O(n^2) algorithm and took 5 solid seconds for 300 records, too long. below an abbreviated version of the code
There's gotta be a better way to query right? I tried accessing something like user.records.date but that throws a not found
import mongoengine
#snippet here is abbreviated and does not run
# xone of interest in conditional_insert(), line 16
class EmbeddedRecord(mongoengine.EmbeddedDocument):
date = mongoengine.DateField(required = True)
#contents = ...
class User(mongoengine.Document):
#meta{}
#account details
records = mongoengine.EmbeddedDocumentListField(EmbeddedRecord)
def conditional_insert(user, new_record):
# the docs tell me to iterate tthrough every record in the user
# there has to be a better way
for r in user.records:
if str(new_record.date) == str(r.date): #i had to do that in my program
#because python kep converting datetime obj to str
return
# if record of duplicate date not found, insert new record
save_record(user, new_record)
def save_record(): pass
if __name__ == "__main__":
lst_to_insert = [] # list of (user, record_to_insert)
for object in lst_to_insert: #O(n)
conditional_insert(object[0],object[1]) #O(n)
#and I have n lst_to_insert so in reality I'm currently at O(n^3)
Upvotes: 1
Views: 196
Reputation: 39
Hi everyone (and future me who will probably search for the same question 10 years later)
I optimized the code using the idea of a search tree. Instead of putting all records in a single List in User I broke it down by year and month
class EmbeddedRecord(mongoengine.EmbeddedDocument):
date = mongoengine.DateField(required = True)
#contents = ...
class Year(mongoengine.EmbeddedDocument):
monthly_records = mongoengine.EmbeddedDocumentListField(Month)
class Month(mongoengine.EmbeddedDocument):
daily_records = mongoengine.EmbeddedDocumentListField(EmbeddedRecord)
class User(mongoengine.Document):
#meta{}
#account details
yearly_records = mongoengine.EmbeddedDocumentListField(Year)
because it's mongodb, I can later partition by decades, heck even centuries but by that point I dont think this code will be relevant
I then group my data to insert by months into separate pandas dataframe and feed each dataframe separately. The data flow thus looks like:
0) get monthly df
1) loop through years until we get the right one (lets say 10 steps, i dont think my program will live that long)
2) loop through months until we get the right one (12 steps)
3) for each record in df loop through each daily record in month to check for duplicates
The algorithm to insert with check is still O(n^2) but since there are maximum 31 records at the last step, the code is much faster. I tested 2000 duplicate records and it ran in under a second (didnt actually time it but as long as it feels instant it wont matter that much in my use case)
Upvotes: 1
Reputation: 20460
Mongo cannot conveniently offer you suitable indexes, very sad.
You frequently iterate over user.records
.
If you can afford to allocate the memory for 300 users,
just iterate once and throw them into a set
, which
When you save a user, also make note of if with cache.add((user_id, str(new_record.date)))
.
EDIT
If you can't afford the memory for all those (user_id, date)
tuples,
then sure, a relational database JOIN is fair,
it's just an out-of-core merge of sorted records.
I have had good results with using sqlalchemy to hit local sqlite (memory or file-backed), or heavier databases like Postgres or MariaDB.
Bear in mind that relational databases offer lovely ACID guarantees,
but you're paying for those guarantees. In this application
it doesn't sound like you need such properties.
Something as simple as /usr/bin/sort
could do an out-of-core
ordering operation that puts all of a user's current
records right next to his historic records,
letting you filter them appropriately.
Sleepycat is not an RDBMS, but its B-tree does offer
external sorting, sufficient for the problem at hand.
(And yes, one can do transactions with sleepycat,
but again this problem just needs some pretty pedestrian reporting.)
Bench it and see. Without profiling data for a specific workload, it's pretty hard to tell if any extra complexity would be worth it. Identify the true memory or CPU bottleneck, and focus just on that.
You don't necessarily need ordering, as hashing would suffice, given enough core. Send those tuples to a redis cache, and make it his problem to store them somewhere.
Upvotes: 0