Reputation: 835
Update: If you read this to improve your insert/update speeds, check if you have pymongo C extensions enabled on your system by running
pymongo.has_c()
from a python console first. If it resolves toFalse
, you either need to compile pymongo with C extensions or dopip install --upgrade pymongo
It improved my workflow from 17 seconds on 10K rows to about 0.57 seconds.
I have thousands of txt files, containing millions of rows of data that I am trying to import into a mongodb collection.
I'm currently using the following def:
import re, pymongo
coll = pymongo.MongoClient().database.collection
rx = re.compile(r'[:; ]')
rx_email = re.compile(r'\S+@\S+\.\S+$')
def parser(path):
with open(path, "rb") as f:
for line in f:
try:
fields = rx.split(line.decode('utf-8'))
email = ''
username = ''
for field in fields:
if rx_email.match(field):
email = field
elif field != fields[-1]:
username = field
password = fields[-1]
if email:
coll.find_one_and_update({'email': email}, {'$addToSet': {'passwords': password}}, upsert=True)
elif username:
coll.find_one_and_update({'username': username}, {'$addToSet': {'passwords': password}}, upsert=True)
else:
pass
except UnicodeDecodeError:
pass
if __name__ == "__main__":
parser('path/to/file.txt')
When I try to run the script on a file with 10K rows, it took 74.58974479999999 seconds. I assume this is due to the amount of items MongoDB has to match against when I insert? Running the same loop without db interaction took 0.022998 seconds.
EDIT: As suggested in Fast or Bulk Upsert in pymongo, I have also tried to use UpdateOne
with bulk_write
as follows:
def parser(path):
ops = []
with open(path, "rb") as f:
for line in f:
if (len(ops) == 1000):
LOCAL_DB.bulk_write(ops, ordered=False)
ops = []
try:
fields = rx.split(line.decode('utf-8'))
email = ''
username = ''
for field in fields:
if rx_email.match(field):
email = field
elif field != fields[-1]:
username = field
password = fields[-1]
if email:
pass
ops.append((UpdateOne({'identifier': email}, {'$addToSet': {'passwords': password}}, upsert=True)))
elif username:
pass
ops.append((UpdateOne({'identifier': username}, {'$addToSet': {'passwords': password}}, upsert=True)))
else:
pass
except UnicodeDecodeError:
pass
Time to complete 10K lines is 17 seconds, which is however way to slow for the amount of files and rows I am trying to update.
Are there any better (and hopefully faster) ways of doing this?
Some requirements:
Upvotes: 2
Views: 1367
Reputation: 835
It seems I managed, with some guidance by @JohnnyHK in the comments, to get my initial upsert
time from ~74 to ~0.5 seconds for 10K rows by doing the following to my initial code:
import re, pymongo
rx = re.compile(r'[:; ]')
rx_email = re.compile(r'\S+@\S+\.\S+$')
def parse(path):
ops = []
with open(path, "rb") as f:
for line in f:
if (len(ops) == 1000):
pymongo.MongoClient().database.collection.bulk_write(ops, ordered=False)
ops = []
try:
fields = rx.split(line.decode('utf-8'))
email = ''
username = ''
for field in fields:
if rx_email.match(field):
email = field
elif field != fields[-1]:
username = field
password = fields[-1]
if email:
ops.append((pymongo.UpdateOne({'_id': email}, {'$addToSet': {'passwords': password}}, upsert=True)))
elif username:
ops.append((pymongo.UpdateOne({'_id': username}, {'$addToSet': {'passwords': password}}, upsert=True)))
else:
pass # logic removed
except UnicodeDecodeError:
pass # logic removed
if __name__ == "__main__":
parse(path/to/file.txt)
I found that pymongo C extensions was missing on my system:
>>> import pymongo
>>> pymongo.has_c()
>>> False
From there I did pip install --upgrade pymongo
(luckily for me)
and it resolved to True
I also used _id
instead of identifier
for the unique fields, which improved speed even more.
Hope this helps people down the road. I will update with more findings as I learn.
Upvotes: 1