Edgard Knive
Edgard Knive

Reputation: 835

pymongo find_one_and_update performance on millions of inserts/updates

Update: If you read this to improve your insert/update speeds, check if you have pymongo C extensions enabled on your system by running pymongo.has_c() from a python console first. If it resolves to False, you either need to compile pymongo with C extensions or do pip install --upgrade pymongo

It improved my workflow from 17 seconds on 10K rows to about 0.57 seconds.

I have thousands of txt files, containing millions of rows of data that I am trying to import into a mongodb collection.

I'm currently using the following def:

import re, pymongo
coll = pymongo.MongoClient().database.collection
rx = re.compile(r'[:; ]')
rx_email = re.compile(r'\S+@\S+\.\S+$')

def parser(path):
    with open(path, "rb") as f:
        for line in f:
            try:
                fields = rx.split(line.decode('utf-8'))
                email = ''
                username = ''
                for field in fields:
                    if rx_email.match(field):
                        email = field
                    elif field != fields[-1]:
                        username = field
                password = fields[-1]
                if email:
                    coll.find_one_and_update({'email': email}, {'$addToSet': {'passwords': password}}, upsert=True)
                elif username:
                    coll.find_one_and_update({'username': username}, {'$addToSet': {'passwords': password}}, upsert=True)
                else:
                    pass
            except UnicodeDecodeError:
                pass

if __name__ == "__main__":
    parser('path/to/file.txt')

When I try to run the script on a file with 10K rows, it took 74.58974479999999 seconds. I assume this is due to the amount of items MongoDB has to match against when I insert? Running the same loop without db interaction took 0.022998 seconds.

EDIT: As suggested in Fast or Bulk Upsert in pymongo, I have also tried to use UpdateOne with bulk_write as follows:

def parser(path):
    ops = []
    with open(path, "rb") as f:
        for line in f:
            if (len(ops) == 1000):
                LOCAL_DB.bulk_write(ops, ordered=False)
                ops = []
            try:
                fields = rx.split(line.decode('utf-8'))
                email = ''
                username = ''
                for field in fields:
                    if rx_email.match(field):
                        email = field
                    elif field != fields[-1]:
                        username = field
                password = fields[-1]
                if email:
                    pass
                    ops.append((UpdateOne({'identifier': email}, {'$addToSet': {'passwords': password}}, upsert=True)))
                elif username:
                    pass
                    ops.append((UpdateOne({'identifier': username}, {'$addToSet': {'passwords': password}}, upsert=True)))
                else:
                    pass
            except UnicodeDecodeError:
                pass

Time to complete 10K lines is 17 seconds, which is however way to slow for the amount of files and rows I am trying to update.

Are there any better (and hopefully faster) ways of doing this?

Some requirements:

  1. email and/or username should be unique.
  2. The array containing passwords should only list each password once (also unique).
  3. 1M rows should (if possible) take less than 1 minute to insert.

Upvotes: 2

Views: 1367

Answers (1)

Edgard Knive
Edgard Knive

Reputation: 835

It seems I managed, with some guidance by @JohnnyHK in the comments, to get my initial upsert time from ~74 to ~0.5 seconds for 10K rows by doing the following to my initial code:

import re, pymongo
rx = re.compile(r'[:; ]')
rx_email = re.compile(r'\S+@\S+\.\S+$')

def parse(path):
    ops = []
    with open(path, "rb") as f:
        for line in f:
            if (len(ops) == 1000):
                pymongo.MongoClient().database.collection.bulk_write(ops, ordered=False)
                ops = []
            try:
                fields = rx.split(line.decode('utf-8'))
                email = ''
                username = ''
                for field in fields:
                    if rx_email.match(field):
                        email = field
                    elif field != fields[-1]:
                        username = field
                password = fields[-1]
                if email:
                    ops.append((pymongo.UpdateOne({'_id': email}, {'$addToSet': {'passwords': password}}, upsert=True)))
                elif username:
                    ops.append((pymongo.UpdateOne({'_id': username}, {'$addToSet': {'passwords': password}}, upsert=True)))
                else:
                    pass # logic removed
            except UnicodeDecodeError:
                pass # logic removed

if __name__ == "__main__":
    parse(path/to/file.txt)

I found that pymongo C extensions was missing on my system:

>>> import pymongo
>>> pymongo.has_c()
>>> False

From there I did pip install --upgrade pymongo (luckily for me) and it resolved to True

I also used _id instead of identifier for the unique fields, which improved speed even more.

Hope this helps people down the road. I will update with more findings as I learn.

Upvotes: 1

Related Questions