How to speed up this PyMongo query/loop

Question

I am using the Riot API for League of Legends to make pulls in Python. I have one collection that is only IDs for matchmade games that I have not pulled from the API and another that is the pulled matchmade games. Altogether I currently have 94,040 pulled matches and 224,346 match IDs that I have not pulled yet.

In my code I am checking out profiles of people for new match IDs to add against the matches I already have before adding it to the collection of IDs I haven't pulled yet. It is taking around 7 seconds for each iteration through the second collection.find, and I am not sure why. I have both of the collections indexed, and when I was using collection.find({}).sort({}) it was actually taking longer for some reason.

Here is my code:

import pymongo
import time


client = pymongo.MongoClient('mongodb://localhost')
match_db = client.matched_games
summoner_db = client.summoner_match_history
match_id_db = client.match_id

matches = match_db.matches
summoners = summoner_db.summoners
match_id = match_id_db.match


def main():
    for matchid in summoners.find({}, {'matches.gameId': 1, '_id': 0}, no_cursor_timeout=True):
        timer = time.time()
        try:
            for match in matchid['matches']:
                print(match)
                match = match['gameId']
                for k in matches.find({}):
                    k = k['gameId']
                    if k == match:
                        print(time.time() - timer)
                        timer = time.time()
                        print('Already have match #' + str(k))
                        break
                else:
                    print(time.time() - timer)
                    timer = time.time()
                    print('Inserting match #' + str(match) + '.')
                    match_id.insert({'match_id': match})
        except KeyError as e:
            print(e)
            pass
        except pymongo.errors.DuplicateKeyError as e:
            # print(e)
            pass

if __name__ == '__main__':
    main()

Generic Guy · Accepted Answer

Instead of iterating through all of them one at a time, you can do a query like:

if matches.find('game_id': 'match_id').count() == 0 //Match does not exist in the database.

So you were having the database return every single document in the matches collection, which I think is the database that has ~95,000 entries (correct me if I am wrong), and then iterating through each of them to check if the id already existed. So that means:

~95,000 iterations by the database
~95,000 iterations by your program checking for a matching id

So you are iterating ~190,000 times, and that is not including all of the overhead of your program storing and fetching the database results for you to iterate through.

However, by using matches.find().count(), you are putting the databases indexes to work for you. Instead of iterating through all 95,000 documents sequentially, the database can use its indexes to see if what you are trying to find exists much faster. Additionally, you don't have to iterate the additional 95,000 times once the database returns the value of .count(), since you only need an if statement to see if that value is 0 or 1. This saves iterations and reduces the overhead of your program significantly, thus speeding up the executing.

I am not sure I made that clear enough, so please ask questions in the comments to make sure you fully understand what is going on.

How to speed up this PyMongo query/loop

Answers (1)

Related Questions