Reputation: 80
EDIT: Best solution thanks to Hakan--
queriedForms.filter(pk__in=list(formtype.form_set.all().filter(formrecordattributevalue__record_value__contains=constraint['TVAL'], formrecordattributevalue__record_attribute_type__pk=rtypePK).values_list('pk', flat=True))).count()
I tried more of his suggestions but I can't avoid an INNER JOIN--this seems to be a a stable solution that does get me small, but predictable speed increases across the board. Look through his answer for more details!
I've been struggling with a problem I haven't seen an answer to online.
When chaining two filters in Django e.g.
masterQuery = bigmodel.relatedmodel_set.all()
masterQuery = masterQuery.filter(name__contains="test")
masterQuery.count()
#returns 100,000 results in < 1 second
#test filter--all 100,000+ names have "test x" where x is 0-9
storedCount = masterQuery.filter(name__contains="9").count()
#returns ~50,000 results but takes 5-6 seconds
Trying a slightly different way:
masterQuery = masterQuery.filter(name__contains="9")
masterQuery.count()
#also returns ~50,000 results in 5-6 seconds
performing an & merge seems to ever so slightly improve performance, e.g
masterQuery = bigmodel.relatedmodel_set.all()
masterQuery = masterQuery.filter(name__contains="test")
(masterQuery & masterQuery.filter(name__contains="9")).count()
It seems as if count takes a significantly longer time beyond a single filter in a queryset.
I assume it may have something to do with mySQL, which apparently doesn't like nested statements--and I assume that two filters are creating a nested query that slows mySQL down, regardless of the SELECT COUNT(*) django uses
So my question is: Is there anyway to speed this up? I'm getting ready to do a lot of regular nested querying only using queryset counts (I don't need the actual model values) without database hits to load the models. e.g. I don't need to load 100,000 models from the database, I just need to know there are 100,000 there. It's obviously much faster to do this through querysets than len() but even at 5 secs a count when I'm running 40 counts for an entire complex query is 3+ minutes--I'd prefer it be under a minute. Am I just fantasizing or does someone have a suggestion as to how this could be accomplished outside of increasing the server's processor speed?
EDIT: If it's helpful--the time.clock() speed is .3 secs for the chained filter() count--the actual time to console and django view output is 5-6s
EDIT2: To answer any questions about indexing, the filters use both an indexed and non indexed value for each link in the chain:
mainQuery = masterQuery = bigmodel.relatedmodel_set.all()
mainQuery = mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=1)
#Where "record_attribute_type" is another foreign key being used as a filter
mainQuery.count() #produces 100,000 results in < 1sec
mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="9", reverseforeignkeytestmodel__record_attribute_type__pk=5).count()
#produces ~50,000 results in 5-6 secs
So each filter in the chain is functionally similar, it is an AND filter(condition,condition) where one condition is indexed, and the other is not. I can't index both conditions.
Edit 3: Similar queries that result in smaller results, e.g. < 10,000 are much faster, regardless of the nesting--e.g. the first filter in the chain produces 10,000 results in ~<1sec but the second filter in the chain will produce 5,000 results in ~<1sec
Edit 4: Still not working based on @Hakan's solution
mainQuery = bigmodel.relatedmodel_set.all()
#Setup the first filter as normal
mainQuery = mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=1)
#Grab a values list for the second chained filter instead of chaining it
values = bigmodel.relatedmodel_set.all().filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=8).values_list('pk', flat=True)
#filter the first query based on the values_list rather than a second filter
mainQuery = mainQuery.filter(pk__in=values)
mainQuery.count()
#Still takes on average the same amount of time after enough test runs--seems to be slightly faster than average--similar to the (quersetA & querysetB) merge solution I tried.
It's possible I did this wrong--but the count results are consistent between the new value_list filter technique, e.g. I'm getting the same # of results. So it's definitely working--but seemingly taking the same amount of time
EDIT 5: Also based on @Hakan's solution with some slight tweaks
mainQuery.filter(pk__in=list(formtype.form_set.all().filter(formrecordattributevalue__record_value__contains=constraint['TVAL'], formrecordattributevalue__record_attribute_type__pk=rtypePK).values_list('pk', flat=True))).count()
This seems to operate faster for larger results in a queryset, e.g. > 50,000, but is actually much slower on smaller queryset results, e.g. < 50,000--where they used to be <1sec--sometimes 2-3 running in 1 second for chain filtering, they now all take 1 second individually. Essentially the speed gains in the larger queryset have been nullified by the speed loss in the smaller querysets.
I'm still going to try and break up the queries as per his suggestion further--but I'm not sure I'm able to. I'll update again(possibly on Monday) when I figure that out and let everyone interested know the progress.
Upvotes: 4
Views: 8634
Reputation: 23074
Not sure if this helps, since I don't have a mysql project to test with.
The QuerySet API reference contains a section about the performance of nested queries.
Performance considerations
Be cautious about using nested queries and understand your database server’s performance characteristics (if in doubt, benchmark!). Some database backends, most notably MySQL, don’t optimize nested queries very well. It is more efficient, in those cases, to extract a list of values and then pass that into the second query. That is, execute two queries instead of one:
values = Blog.objects.filter( name__contains='Cheddar').values_list('pk', flat=True) entries = Entry.objects.filter(blog__in=list(values))
Note the list() call around the Blog QuerySet to force execution of the first query. Without it, a nested query would be executed, because QuerySets are lazy.
So, maybe you can improve the performance by trying something like this:
masterQuery = bigmodel.relatedmodel_set.all()
pks = list(masterQuery.filter(name__contains="test").values_list('pk', flat=True))
count = masterQuery.filter(pk__in=pks, name__contains="9")
Since your initial MySQL performance is so slow, it might even be faster to do the second step in Python instead of in the database.
names = masterQuery.filter(name__contains='test').values_list('name')
count = sum('9' in n for n in names)
Edit:
From your updates, I see that you are querying fields in related models, which result in multiple sql JOIN
operations. That's likely a big reason why the query is slow.
To avoid joins, you could try something like this. The goal is to avoid doing deeply chained lookups across relations.
# query only RelatedModel, avoid JOIN
related_pks = RelatedModel.objects.filter(
record_value__contains=constraint['TVAL'],
record_attribute_type=rtypePK,
).values_list('pk', flat=True)
# list(queryset) will do a database query, resulting in a list of integers.
pks_list = list(related_pks)
# use that result to filter your main model.
count = MainModel.objects.filter(
formrecordattributevalue__in=pks_list
).count()
I'm assuming that the relation is defined as a foreign key from MainModel to RelatedModel.
Upvotes: 2