Improve Reading performance of large JSON File

Question

I am not a IT engineer but Mechanical engineer so do not hesitate to ask me for more details.

I have a huge amount of Magic The Gathering card and wrote a program to read the card from a picture via OpenCV. It treat the picture, extract the name of the card, search for it in JSON file and append it into my library.

I try to optimize the reading of the JSON file as it go through all possible to match the detected name of the card from the pictures. The Json file gathering all the data is about 210Mo and available online at https://mtgjson.com/downloads/all-files/

In example below, considering already a card name extracted in variable "keyVal" it takes about 10 seconds :

import json
from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

keyVal ="Arc électrique"

json_file = open("AllPrintings.json", "r", encoding="utf-8")
bdd = json.load(json_file)   

for item in bdd["data"]:
    for card in bdd["data"][item]["cards"]:
        for langue in card["foreignData"]:
            if similar(langue["name"],keyVal) > 0.85:
                print(langue["name"],card["name"], card["type"], card["artist"], bdd["data"][item]["name"], card["number"], card["identifiers"]["multiverseId"])
            if similar(card["name"], keyVal) > 0.85:
                print(card["name"], card["type"], card["artist"], bdd["data"][item]["name"],
                      card["number"], card["identifiers"]["multiverseId"])

My first intent was to read the Json file and record only the data I need but it turned into a very very huge file...

Do you have any idea on how improve the research timing ?

Thanks and do not hesitate to ask for clarifications.

Tomalak · Accepted Answer

Here is your program, translated to an SQL-based approach using Python's own sqlite3 and the SQLite database that https://mtgjson.com/downloads/all-files/ conveniently already offers:

import sqlite3
from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

conn = sqlite3.connect(r"C:\Users\Tomalak\Downloads\AllPrintings.sqlite")
conn.create_function("SIMILAR", 2, similar)

def find_similar_cards(key_val):
    return conn.execute("""
        SELECT
            c.number, c.name, c.type, c.artist, c.multiverseId,
            fd.name AS local_name, fd.language
        FROM
            cards AS c
            INNER JOIN foreign_data AS fd ON fd.uuid = c.uuid
        WHERE
            SIMILAR(fd.name, ?) > 0.85
    """, [key_val])

for row in find_similar_cards("Arc électrique"):
    print(row)

You can see that it's instantly a lot more obvious when reading it, so that's already a big plus.

SQLite allows importing user-defined functions and makes them available for use in SQL queries, so I have imported the SequenceMatcher.

Unfortunately, this is also the culprit. It has to scan through each one of the 237,000 records in foreign_data and analyze each name for its similarity value. This is slow, and not a whole lot can be done about it. On my (older) Laptop it takes a little over 10 seconds to complete this query and print

('97', 'Arc Lightning', 'Sorcery', 'Seb McKinnon', '386478', 'Arc électrique', 'French')
('97', 'Arc Lightning', 'Sorcery', 'Seb McKinnon', '394068', 'Arc électrique', 'French')
('174', 'Arc Lightning', 'Sorcery', 'Andrew Goldhawk', '5733', 'Arc électrique', 'French')

But there is room for optimization. The foreign_data table only contains 160,000 distinct names. It would be possible to create a helper table with those unique names that is faster to scan through, and then join back to the cards table. But no matter what you do, searching for "fuzzy" values will always take some time.

In general, the options to improve search times are

reduce the work load (i.e. work on a table with fewer records, such as "distinct names only")
use a faster comparison mechanism (i.e. switch from SequenceMatcher to something else)
use pre-calculated results (not suitable for all situations, for example not in this case)
possibly: use a full-text index (SQLite supports that. It might take some time to wrap your head around it, but could be worth it in the end. Full text indexes are very fast and reasonably "fuzzy")

Apart from that, the downloaded SQLite DB has no indexes defined at all, depending on what kind of data you query often, there is room for improvement here, too.

As soon as you're not searching for calculated values, and proper indexes are in place, this will be blazing fast.

Improve Reading performance of large JSON File

Answers (2)

Related Questions