Edwin Harly
Edwin Harly

Reputation: 439

How do I query a pair of objects from a single Django Model efficiently?

I have a Track model and currently I do a nested loop through the models id to get the pair and then pass it into a function to count the similarity between these two non-equivalent track object

track_set = Track.objects.all()

track_ids = [track.id for track in track_set]
pointer_a = 0
pointer_b = 1

for pointer_a in range(len(track_ids) - 1):
    for pointer_b in range(pointer_a + 1, len(track_ids)):
        track_a = Track.objects.get(pk=track_ids[pointer_a])
        track_b = Track.objects.get(pk=track_ids[pointer_b])
        counter += 1
        count_it_sim(track_a, track_b)

I think the way I get the object is not very efficient, is there any way to optimize it ?

Edit: this count_it_sim will calculate the similarity value between track_a and track_b, and I need to calculate it for all the pairs in Track model
models.py

class Tag(models.Model):
    name = models.CharField(max_length=255, unique=True)

class Tagged(models.Model):
    track = models.ForeignKey(Track, on_delete=models.CASCADE)
    tag = models.ForeignKey(Tag, on_delete=models.CASCADE)
    frequency = models.IntegerField(
        default=0,
        validators=[MinValueValidator(0)],
    )
    class Meta:
        unique_together = ('track', 'tag')

class Track(models.Model):
    track_id = models.CharField(max_length=24)
    title = models.CharField(max_length=120)
    link = models.URLField(max_length=120, blank=True)
    tags = models.ManyToManyField(Tag, through='Tagged', blank=True)
    similarity = models.ManyToManyField(
        'self',
        blank=True,
        through='Similar',
        related_name='similar_to',
        symmetrical=False
    )
    users = models.ManyToManyField(User, through='PlayTrack', blank=True)

class Similar(models.Model):
    track1 = models.ForeignKey(Track, on_delete=models.CASCADE, related_name='track1')
    track2 = models.ForeignKey(Track, on_delete=models.CASCADE, related_name='track2')
    similarity = models.FloatField(
        validators=[MinValueValidator(0), MaxValueValidator(1)],
    )

and what count_it_sim will do is, it will get all tag's frequency of track_a and track_b through a associative entity, which is Tagged model, and calculate it to get the similarity value between track_a and track_b

def count_it_sim(track_a: Track, track_b: Track):
    tag_set = Tag.objects.all()
    part1 = 0
    part2 = 0
    part3 = 0
    for tag in tag_set:
        try:
            freq_tag_of_track_a = Tagged.objects.get(track=track_a, tag=tag).frequency
        except Tagged.DoesNotExist:
            continue
        try:
            freq_tag_of_track_b = Tagged.objects.get(track=track_b, tag=tag).frequency
        except Tagged.DoesNotExist:
            continue
        part1 += freq_tag_of_track_a * freq_tag_of_track_b
        part2 += freq_tag_of_track_a ** 2
        part3 += freq_tag_of_track_b ** 2
    try:
        it_sim = part1 / (math.sqrt(part2) * math.sqrt(part3))
    except ZeroDivisionError:
        it_sim = None

Edit 2: On count_it_sim, instead of iterate through all of the tags from Tag.objects.all(), I query only those tags which exist in Tagged and the result is way faster than the previous one, here is my current code now

def count_it_sim(track_a: Track, track_b: Track):
    filtered_tagged = Tagged.objects.filter(Q(track=track_a) | Q(track=track_b))
    tag_ids = filtered_tagged.values_list('tag', flat=True).distinct()
    part1 = 0
    part2 = 0
    part3 = 0
    for tag_id in tag_ids:
        try:
            freq_tag_of_track_a = filtered_tagged.get(track=track_a, tag__id=tag_id).frequency
        except Tagged.DoesNotExist:
            freq_tag_of_track_a = 0
        try:
            freq_tag_of_track_b = filtered_tagged.get(track=track_b, tag__id=tag_id).frequency
        except Tagged.DoesNotExist:
            freq_tag_of_track_b = 0
        part1 += freq_tag_of_track_a * freq_tag_of_track_b
        part2 += freq_tag_of_track_a ** 2
        part3 += freq_tag_of_track_b ** 2
    try:
        it_sim = part1 / (math.sqrt(part2) * math.sqrt(part3))
    except ZeroDivisionError:
        it_sim = None

Edit 3: There are some changes in the models. Instead of storing the frequency of each tag of a track, now the frequency will be calculated by counting how many users have tag a track with a particular tag. Here is the update

# models.py
...
class Tagged(models.Model):
    user = models.ForeignKey(User, on_delete=models.CASCADE, default=1)
    track = models.ForeignKey(Track, on_delete=models.CASCADE)
    tag = models.ForeignKey(Tag, on_delete=models.CASCADE)
...

and the count_it_sim function become

def count_it_sim(track_a: Track, track_b: Track):
    filtered_tagged = Tagged.objects.filter(Q(track=track_a) | Q(track=track_b))
    tag_ids = filtered_tagged.values_list('tag', flat=True).distinct()
    part1 = 0
    part2 = 0
    part3 = 0
    for tag_id in tag_ids:
        try:
            freq_tag_of_track_a = filtered_tagged.filter(track=track_a, tag__id=tag_id).count()  # UPDATED LINE
        except Tagged.DoesNotExist:
            freq_tag_of_track_a = 0

        try:
            freq_tag_of_track_b = filtered_tagged.filter(track=track_b, tag__id=tag_id).count()  #UPDATED LINE
        except Tagged.DoesNotExist:
            freq_tag_of_track_b = 0
        part1 = accumulate(part1, freq_tag_of_track_a * freq_tag_of_track_b)
        part2 = accumulate(part2, freq_tag_of_track_a ** 2)
        part3 = accumulate(part3, freq_tag_of_track_b ** 2)
    try:
        it_sim = final_calc(part1, part2, part3)
    except ZeroDivisionError:
        it_sim = None
    return it_sim

Upvotes: 0

Views: 420

Answers (2)

raratiru
raratiru

Reputation: 9636

As per our discussion in the comments, the aim is to get the the needed data using the least number of database queries, in a structure that will easily allow further data manipulation.

We will try to fetch all the Tagged.frequency data that correspond to each Track in the following dictionary structure:

{
    track_id1: list(all_related_frequencies), 
    track_id2: list(all_related_frequencies),
}

The idea is to get all the Track instances and prefetch their related Tagged instances in an attribute called prefetched_tagged.

Each instance of Track has an attribute tagged_set which allows reverse access to all the Tagged instances related to it. This is the query that prefetches them:

from django.db.models import Prefetch

the_data = (
    Track.objects.prefetch_related(
        Prefetch(
            'tagged_set',
            to_attr='prefetched_tagged',
        )
    ).all()
)

Now, we saved in the_data variable, all instances of Track where each instance has an attribute prefetched_tagged that contains a list of all Tagged instances related to each Track instance.

With the following dictionary comprehension, we iterate the_data variable to create a dictionary with all Track.track_id as keys. Each key, will have a list as its value wich will contain all its related Tagged.frequency.

To create the list, we will use a list comprehension, inside the dictionary comprehension:

result = {
    each_track.track_id: [
        each_tagged.frequency for each_tagged in each_track.prefetched_tagged
    ] for each_track in the_data
}

Now, the variable result contains the data in the structure we need in order to further manipulate them. It took 2 database hits in order to load all the database data into memory. This is the whole code that also measures the database hits:

from django import db
from django.db.models import Prefetch

db.reset_queries()    

the_data = (
    Track.objects.prefetch_related(
        Prefetch(
            'tagged_set',
            to_attr='prefetched_tagged',
        )
    ).all()
)

result = {
    each_track.track_id: [
        each_tagged.frequency for each_tagged in each_track.prefetched_tagged
    ] for each_track in the_data
}

print(result)
print ("Queries Used: {0}".format(len(db.connection.queries))

Upvotes: 1

Burhan Khalid
Burhan Khalid

Reputation: 174682

I think the way I get the object is not very efficient, is there any way to optimize it ?

Well, you already have the objects (in track_set) so you don't need to fetch them again; you just need to get pairs of objects.

What I need is (1,2) (1,3) (1,4) (2,3) (2,4) (3,4)

For that, you can use itertools.combinations:

import itertools

for a,b in itertools.combinations(track_set, 2):
   count_it_sim(a, b)

You will have to make sure you fetch the objects from the database in the right order; because there is no guarantee on how the items will be returned:

If a query doesn’t have an ordering specified, results are returned from the database in an unspecified order. A particular ordering is guaranteed only when ordering by a set of fields that uniquely identify each object in the results.

In your case, it seems you need them in primary key order; so I would modify the initial query to:

track_set = Track.objects.order_by('pk')

The queryset documentation has details on order_by and the model reference has details on specifying the default ordering.

Upvotes: 3

Related Questions