why is QuerySet iteration so slow?

Question

I'm trying to create an accurate and efficient searching algorithm for the system. I installed Postgresql to utilize its trigram similarity query, and this is how i searched for objects:

objects_found = Question.objects.extra(where=["CHAR_LENGTH(answer) > 300"])).filter(question__trigram_similar=message

This was incredibly quick, It took it less than a 0.5s to perform most of the queries. All the objects of objects_found queryset are similar to query text, but i needed to find out the most similar one.

I know two algorithms that are really good in this case, first one is Cosine Similarity, and the second one is Ratcliff/Obershelp pattern recognition (which has built-in implementation in Python).

I tried making an iteration, testing each of them out, Cosine Similarity was around 1.5x times faster in majority of cases (as expected, considering that vectors are measured much more quickly), but SequenceMatcher would give more accurate results. Therefore i still chose SequenceMatcher. Please note that this iteration took a really long time.

Finally, I tried implementing SequenceMatcher in the code:

objects_found = (Question.objects.extra(where=["CHAR_LENGTH(answer) > 300"])).filter(question__trigram_similar=message).iterator()
zsim = ("", 0)
for i in objects_found:
    rsim = _search.ratcliff_obershelp(querytext, i.question)
    if zsim[1] < rsim:
       zsim = (i.answer, rsim)
       if rsim > 0.75:  # works in most of the cases
            break
response = zsim[0]

There are ~1GB of ~5 million rows in database, and it takes postgresql less than 0.5s to pick the correct rows with trigram similarity. Out of ~5 million rows only 10-90 are filtered, and it takes queryset iteration around 62s to find the most similar one.

This is the case even if the iteration breaks at the beginning, so for example if there are only 4 rows to iterate through to reach more than 75% of similarity, Django still loads 90 rows.

I really doubt that the similarity algorithm itself is the issue, it just seems to be queryset that takes too long time to load rows, and once they are loaded algorithm does everything almost instantly.

Why is this happening? Is there any way to make Queryset iteration more time efficient? Will database level iteration produce much faster results?

p.s Times are measured by python's time module.

Adam Barnes · Accepted Answer

The confusion you're facing is caused by Django's deferred evaluation of QuerySets. The 0.5s you're waiting for Django is actually only preparing the SQL - that is to say, converting the ORM calls into a SQL query, (or several SQL queries), that it can later execute.

QuerySets are evaluated as late as possible then cached, so to find out how long it's actually taking through Django, you'll need to force evaluation of the QuerySet, to get Django to execute the SQL. You can do this in a couple of ways, such as:

print(objects_found)

or

list(objects_found)

or

for item in objects_found:
    pass

why is QuerySet iteration so slow?

Answers (1)

Related Questions