Reputation: 33
I am not a IT engineer but Mechanical engineer so do not hesitate to ask me for more details.
I have a huge amount of Magic The Gathering card and wrote a program to read the card from a picture via OpenCV. It treat the picture, extract the name of the card, search for it in JSON file and append it into my library.
I try to optimize the reading of the JSON file as it go through all possible to match the detected name of the card from the pictures. The Json file gathering all the data is about 210Mo and available online at https://mtgjson.com/downloads/all-files/
In example below, considering already a card name extracted in variable "keyVal" it takes about 10 seconds :
import json
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
keyVal ="Arc électrique"
json_file = open("AllPrintings.json", "r", encoding="utf-8")
bdd = json.load(json_file)
for item in bdd["data"]:
for card in bdd["data"][item]["cards"]:
for langue in card["foreignData"]:
if similar(langue["name"],keyVal) > 0.85:
print(langue["name"],card["name"], card["type"], card["artist"], bdd["data"][item]["name"], card["number"], card["identifiers"]["multiverseId"])
if similar(card["name"], keyVal) > 0.85:
print(card["name"], card["type"], card["artist"], bdd["data"][item]["name"],
card["number"], card["identifiers"]["multiverseId"])
My first intent was to read the Json file and record only the data I need but it turned into a very very huge file...
Do you have any idea on how improve the research timing ?
Thanks and do not hesitate to ask for clarifications.
Upvotes: 1
Views: 671
Reputation: 33
I did two things based on you answer @Tomalak. I created and saved a dedicated table in sqlite bdd using sqlitebrowser and this code only with the data I need:
CREATE table res AS SELECT
c.number, c.name, c.artist, fd.name AS local_name, fd.language, st.name as local_print
FROM
cards AS c
LEFT JOIN foreign_data AS fd ON fd.uuid = c.uuid
LEFT JOIN sets AS st ON st.code = c.setCode
WHERE
fd.language IS NULL OR fd.language = "French"
Then, I called it in python using FTS4 and doing successively two requests to measure the virtual table initialisation timing with a request and then the timing of a single request using same virtualtable:
import sqlite3
import time
init = time.time()
keyVal ="Act of Treason"
conn = sqlite3.connect(r"AllPrintings.sqlite")
conn.create_function("SIMILAR", 2, similar)
cur = conn.cursor()
cur.execute('''DROP TABLE IF EXISTS mtgsearch''')
cur.execute('''CREATE VIRTUAL TABLE mtgsearch USING fts4(number, name, artist, namefr, language, local_print)''')
cur.execute('''INSERT INTO mtgsearch(number, name, artist, namefr, language, local_print) SELECT c.number AS number, c.name AS name, c.artist AS artist, c.local_name AS namefr, c.language AS language, c.local_print AS local_print FROM res AS c''')
conn.commit()
stock = cur.execute('''SELECT * FROM mtgsearch WHERE name= ?''',[keyVal])
for row in stock:
print(row[0], row[1], row[2], row[3], row[4], row[5])
print(time.time() - init)
init = time.time()
keyVal= "Air Elemental"
stock = cur.execute('''SELECT * FROM mtgsearch WHERE name= ?''',[keyVal])
for row in stock:
print(row[0], row[1], row[2], row[3], row[4], row[5])
print(time.time() - init)
Results are mindblowing compared to previously ==> first result in 0.4 sec, second result in 0.015 seconds.
If I use Sequencematcher, first request in 1.7 sec and second one in 1.3. So next target is to find a way to improve a faster similarity algorithm. Any ideas ?
Thanks anyway for the help, I learned a lot about SQLite. I did not know anything about writing first post.
Upvotes: 1
Reputation: 338178
Here is your program, translated to an SQL-based approach using Python's own sqlite3
and the SQLite database that https://mtgjson.com/downloads/all-files/ conveniently already offers:
import sqlite3
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
conn = sqlite3.connect(r"C:\Users\Tomalak\Downloads\AllPrintings.sqlite")
conn.create_function("SIMILAR", 2, similar)
def find_similar_cards(key_val):
return conn.execute("""
SELECT
c.number, c.name, c.type, c.artist, c.multiverseId,
fd.name AS local_name, fd.language
FROM
cards AS c
INNER JOIN foreign_data AS fd ON fd.uuid = c.uuid
WHERE
SIMILAR(fd.name, ?) > 0.85
""", [key_val])
for row in find_similar_cards("Arc électrique"):
print(row)
You can see that it's instantly a lot more obvious when reading it, so that's already a big plus.
SQLite allows importing user-defined functions and makes them available for use in SQL queries, so I have imported the SequenceMatcher
.
Unfortunately, this is also the culprit. It has to scan through each one of the 237,000 records in foreign_data
and analyze each name
for its similarity value. This is slow, and not a whole lot can be done about it. On my (older) Laptop it takes a little over 10 seconds to complete this query and print
('97', 'Arc Lightning', 'Sorcery', 'Seb McKinnon', '386478', 'Arc électrique', 'French') ('97', 'Arc Lightning', 'Sorcery', 'Seb McKinnon', '394068', 'Arc électrique', 'French') ('174', 'Arc Lightning', 'Sorcery', 'Andrew Goldhawk', '5733', 'Arc électrique', 'French')
But there is room for optimization. The foreign_data
table only contains 160,000 distinct names. It would be possible to create a helper table with those unique names that is faster to scan through, and then join back to the cards
table. But no matter what you do, searching for "fuzzy" values will always take some time.
In general, the options to improve search times are
Apart from that, the downloaded SQLite DB has no indexes defined at all, depending on what kind of data you query often, there is room for improvement here, too.
As soon as you're not searching for calculated values, and proper indexes are in place, this will be blazing fast.
Upvotes: 3