Sort by number of matches on queries based on m2m field

Question

I hope the title is not misleading.

Anyway, I have two models, both have m2m relationships with a third model.

class Model1: keywords = m2m(Keyword)
class Model2: keywords = m2m(Keyword)

Given the keywords for a Model2 instance like this:

keywords2 = model2_instance.keywords.all()

I need to retrieve the Model1 instances which have at least a keyword that is in keywords2, something like:

Model1.objects.filter(keywords__in=keywords2)

and sort them by the number of keywords that match (dont think its possible via 'in' field lookup). Question is, how do i do this?

I'm thinking of just manually interating through each of Model1 instances, appending them to a dictionary of results for every match, but I need this to scale, for say tens of thousands of records. Here is how I imagined it would be like:

result = {}
keywords2_ids = model2.keywords.all().values_list('id',flat=True)
for model1 in Model1.objects.all():
    keywords_matched = model1.keywords.filter(id__in=keywords2_ids).count()
    objs = result.get(str(keywords_matched), [])
    result[str(keywords_matched)] = objs.append(obj)

There must be an faster way to do this. Any ideas?

vim · Accepted Answer

You can just switch to raw SQL. What you have to do is to write a custom manager for Model1 to return the sorted set of ids of Model1 objects based on the keyword match counts. The SQL is simple as joining the two many to many tables(Django automatically creates a table to represent a many to many relationship) on keyword ids and then grouping on Model1 ids for COUNT sql function. Then using an ORDER BY clause on those counts will produce the sorted Model1 id list you need. In MySQL,

SELECT appname_model1_keywords.model1_id, count(*) as match_count FROM appname_model1_keywords
JOIN appname_model2_keywords 
ON (appname_model1_keywords.keyword_id = appname_model2_keywords.keyword_id)
WHERE appname_model2_keywords.model2_id = model2_object_id
GROUP BY appname_model1_keywords.model1_id
ORDER BY match_count

Here model2_object_id is the model2_instance id. This will definitely be faster and more scalable.

Sort by number of matches on queries based on m2m field

Answers (1)

Related Questions