Reputation: 439
I have a Track model and currently I do a nested loop through the models id to get the pair and then pass it into a function to count the similarity between these two non-equivalent track object
track_set = Track.objects.all()
track_ids = [track.id for track in track_set]
pointer_a = 0
pointer_b = 1
for pointer_a in range(len(track_ids) - 1):
for pointer_b in range(pointer_a + 1, len(track_ids)):
track_a = Track.objects.get(pk=track_ids[pointer_a])
track_b = Track.objects.get(pk=track_ids[pointer_b])
counter += 1
count_it_sim(track_a, track_b)
I think the way I get the object is not very efficient, is there any way to optimize it ?
Edit: this count_it_sim
will calculate the similarity value between track_a and track_b, and I need to calculate it for all the pairs in Track model
models.py
class Tag(models.Model):
name = models.CharField(max_length=255, unique=True)
class Tagged(models.Model):
track = models.ForeignKey(Track, on_delete=models.CASCADE)
tag = models.ForeignKey(Tag, on_delete=models.CASCADE)
frequency = models.IntegerField(
default=0,
validators=[MinValueValidator(0)],
)
class Meta:
unique_together = ('track', 'tag')
class Track(models.Model):
track_id = models.CharField(max_length=24)
title = models.CharField(max_length=120)
link = models.URLField(max_length=120, blank=True)
tags = models.ManyToManyField(Tag, through='Tagged', blank=True)
similarity = models.ManyToManyField(
'self',
blank=True,
through='Similar',
related_name='similar_to',
symmetrical=False
)
users = models.ManyToManyField(User, through='PlayTrack', blank=True)
class Similar(models.Model):
track1 = models.ForeignKey(Track, on_delete=models.CASCADE, related_name='track1')
track2 = models.ForeignKey(Track, on_delete=models.CASCADE, related_name='track2')
similarity = models.FloatField(
validators=[MinValueValidator(0), MaxValueValidator(1)],
)
and what count_it_sim will do is, it will get all tag's frequency of track_a and track_b through a associative entity, which is Tagged model, and calculate it to get the similarity value between track_a and track_b
def count_it_sim(track_a: Track, track_b: Track):
tag_set = Tag.objects.all()
part1 = 0
part2 = 0
part3 = 0
for tag in tag_set:
try:
freq_tag_of_track_a = Tagged.objects.get(track=track_a, tag=tag).frequency
except Tagged.DoesNotExist:
continue
try:
freq_tag_of_track_b = Tagged.objects.get(track=track_b, tag=tag).frequency
except Tagged.DoesNotExist:
continue
part1 += freq_tag_of_track_a * freq_tag_of_track_b
part2 += freq_tag_of_track_a ** 2
part3 += freq_tag_of_track_b ** 2
try:
it_sim = part1 / (math.sqrt(part2) * math.sqrt(part3))
except ZeroDivisionError:
it_sim = None
Edit 2: On count_it_sim, instead of iterate through all of the tags from Tag.objects.all()
, I query only those tags which exist in Tagged
and the result is way faster than the previous one, here is my current code now
def count_it_sim(track_a: Track, track_b: Track):
filtered_tagged = Tagged.objects.filter(Q(track=track_a) | Q(track=track_b))
tag_ids = filtered_tagged.values_list('tag', flat=True).distinct()
part1 = 0
part2 = 0
part3 = 0
for tag_id in tag_ids:
try:
freq_tag_of_track_a = filtered_tagged.get(track=track_a, tag__id=tag_id).frequency
except Tagged.DoesNotExist:
freq_tag_of_track_a = 0
try:
freq_tag_of_track_b = filtered_tagged.get(track=track_b, tag__id=tag_id).frequency
except Tagged.DoesNotExist:
freq_tag_of_track_b = 0
part1 += freq_tag_of_track_a * freq_tag_of_track_b
part2 += freq_tag_of_track_a ** 2
part3 += freq_tag_of_track_b ** 2
try:
it_sim = part1 / (math.sqrt(part2) * math.sqrt(part3))
except ZeroDivisionError:
it_sim = None
Edit 3: There are some changes in the models. Instead of storing the frequency
of each tag
of a track
, now the frequency
will be calculated by counting how many users
have tag a track
with a particular tag
. Here is the update
# models.py
...
class Tagged(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE, default=1)
track = models.ForeignKey(Track, on_delete=models.CASCADE)
tag = models.ForeignKey(Tag, on_delete=models.CASCADE)
...
and the count_it_sim function become
def count_it_sim(track_a: Track, track_b: Track):
filtered_tagged = Tagged.objects.filter(Q(track=track_a) | Q(track=track_b))
tag_ids = filtered_tagged.values_list('tag', flat=True).distinct()
part1 = 0
part2 = 0
part3 = 0
for tag_id in tag_ids:
try:
freq_tag_of_track_a = filtered_tagged.filter(track=track_a, tag__id=tag_id).count() # UPDATED LINE
except Tagged.DoesNotExist:
freq_tag_of_track_a = 0
try:
freq_tag_of_track_b = filtered_tagged.filter(track=track_b, tag__id=tag_id).count() #UPDATED LINE
except Tagged.DoesNotExist:
freq_tag_of_track_b = 0
part1 = accumulate(part1, freq_tag_of_track_a * freq_tag_of_track_b)
part2 = accumulate(part2, freq_tag_of_track_a ** 2)
part3 = accumulate(part3, freq_tag_of_track_b ** 2)
try:
it_sim = final_calc(part1, part2, part3)
except ZeroDivisionError:
it_sim = None
return it_sim
Upvotes: 0
Views: 420
Reputation: 9636
As per our discussion in the comments, the aim is to get the the needed data using the least number of database queries, in a structure that will easily allow further data manipulation.
We will try to fetch all the Tagged.frequency
data that correspond to each Track
in the following dictionary structure:
{
track_id1: list(all_related_frequencies),
track_id2: list(all_related_frequencies),
}
The idea is to get all the Track
instances and prefetch their related Tagged
instances in an attribute called prefetched_tagged
.
Each instance of Track
has an attribute tagged_set
which allows reverse access to all the Tagged
instances related to it. This is the query that prefetches them:
from django.db.models import Prefetch
the_data = (
Track.objects.prefetch_related(
Prefetch(
'tagged_set',
to_attr='prefetched_tagged',
)
).all()
)
Now, we saved in the_data
variable, all instances of Track
where each instance has an attribute prefetched_tagged
that contains a list
of all Tagged
instances related to each Track
instance.
With the following dictionary comprehension, we iterate the_data
variable to create a dictionary with all Track.track_id
as keys. Each key, will have a list
as its value wich will contain all its related Tagged.frequency
.
To create the list, we will use a list comprehension, inside the dictionary comprehension:
result = {
each_track.track_id: [
each_tagged.frequency for each_tagged in each_track.prefetched_tagged
] for each_track in the_data
}
Now, the variable result
contains the data in the structure we need in order to further manipulate them. It took 2 database hits in order to load all the database data into memory. This is the whole code that also measures the database hits:
from django import db
from django.db.models import Prefetch
db.reset_queries()
the_data = (
Track.objects.prefetch_related(
Prefetch(
'tagged_set',
to_attr='prefetched_tagged',
)
).all()
)
result = {
each_track.track_id: [
each_tagged.frequency for each_tagged in each_track.prefetched_tagged
] for each_track in the_data
}
print(result)
print ("Queries Used: {0}".format(len(db.connection.queries))
Upvotes: 1
Reputation: 174682
I think the way I get the object is not very efficient, is there any way to optimize it ?
Well, you already have the objects (in track_set
) so you don't need to fetch them again; you just need to get pairs of objects.
What I need is (1,2) (1,3) (1,4) (2,3) (2,4) (3,4)
For that, you can use itertools.combinations
:
import itertools
for a,b in itertools.combinations(track_set, 2):
count_it_sim(a, b)
You will have to make sure you fetch the objects from the database in the right order; because there is no guarantee on how the items will be returned:
If a query doesn’t have an ordering specified, results are returned from the database in an unspecified order. A particular ordering is guaranteed only when ordering by a set of fields that uniquely identify each object in the results.
In your case, it seems you need them in primary key order; so I would modify the initial query to:
track_set = Track.objects.order_by('pk')
The queryset documentation has details on order_by
and the model reference has details on specifying the default ordering.
Upvotes: 3