Reputation: 3289
I have very large dataset and growing, and I need to create many filters but it is going to quickly get out of control and was hoping someone can help me combine some of the queries into a single call. Below is the start of my view.
Call #1 - for loop to display table of all results
traffic = Traffic.objects.all()
Call #2 - Combined aggregate sum query
totals = Traffic.objects.aggregate(Sum('sessions'), Sum('new_users'), Sum('reminder'), Sum('campaigns'), Sum('new_sales'), Sum('sales_renewals'))
total_sessions = totals.get('sessions__sum')
total_new_users = totals.get('new_users__sum')
total_reminder = totals.get('reminder__sum')
total_campaigns = totals.get('campaigns__sum')
total_new_sales = totals.get('new_sales__sum')
total_sales_renewals = totals.get('sales_renewals__sum')
Call #3, #4, #5, #6 and so on... - To filter the database by month and day of week
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=2).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=3).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=4).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=5).aggregate(Sum('sessions'))
total_sessions_2014_m = Traffic.objects.filter(created__year='2014', created__week_day=6).aggregate(Sum('sessions'))
The problem is , I need to create several dozen more filters because I have 3 years of data with multiple data points per column that we need totals the sum for.
Questions:
As you can see, this is going to get out of control very quickly. Any help would be hugely appreciated. Thank you.
Updated to add Traffic Model
class Timestamp(models.Model):
created = models.DateField()
class Meta:
abstract = True
class Traffic(Timestamp):
sessions = models.IntegerField(blank=True, null=True)
new_users = models.IntegerField(blank=True, null=True)
reminder = models.IntegerField(blank=True, null=True)
campaigns = models.IntegerField(blank=True, null=True)
new_sales = models.IntegerField(blank=True, null=True)
sales_renewals = models.IntegerField(blank=True, null=True)
# Meta and String
class Meta:
verbose_name = 'Traffic'
verbose_name_plural = 'Traffic Data'
def __str__(self):
return "%s" % self.created
Upvotes: 12
Views: 9242
Reputation: 8970
There are dozens of ways to optimize your database queries with the Django ORM. As usual, the Django documentation is great and has a good list of them. Here's some quick tips for query optimization:
1) iterator()
If you are accessing the queryset
only once. So for example you can use this as,
traffic = Traffic.objects.all()
for t in traffic.iterator():
...
...
While defining fields of your models
. As the Django documentation says,
This is a number one priority, after you have determined from profiling what indexes should be added. Use Field.db_index or Meta.index_together to add these from Django. Consider adding indexes to fields that you frequently query using filter(), exclude(), order_by(), etc. as indexes may help to speed up lookups.
Hence you can modify your model as,
class Traffic(Timestamp):
sessions = models.IntegerField(blank=True, null=True, db_index=True)
new_users = models.IntegerField(blank=True, null=True, db_index=True)
reminder = models.IntegerField(blank=True, null=True, db_index=True)
campaigns = models.IntegerField(blank=True, null=True, db_index=True)
new_sales = models.IntegerField(blank=True, null=True, db_index=True)
3) prefetch_related()
or select_related()
If you have relations within your models
, using prefetch_related
or select_related
would be a choice. As per the Django documentation,
select_related
works by creating a SQL join
and including the fields of the related object in the SELECT statement. For this reason, select_related
gets the related objects in the same database query. However, to avoid the much larger result set that would result from joining across a 'many' relationship, select_related
is limited to single-valued relationships - foreign key and one-to-one.
prefetch_related
, on the other hand, does a separate lookup for each
relationship, and does the 'joining' in Python. This allows it to prefetch
many-to-many and many-to-one objects, which cannot be done using
select_related
, in addition to the foreign key and one-to-one relationships that are supported by select_related
.
select_related
does a join
, prefetch_related
does two separate queries. Using these you can make your queries upto 30% faster.
If your template
design allows you to display results in multiple pages your can use Pagination
.
You also need to understand that the Django Querysets are lazy which means that it won't query the database untill its being used/evaluated. A queryset in Django represents a number of rows in the database, optionally filtered by a query. For example,
traffic = Traffic.objects.all()
The above code doesn’t run any database queries. You can can take the traffic
queryset and apply additional filters, or pass it to a function, and nothing will be sent to the database. This is good, because querying the database is one of the things that significantly slows down web applications. To fetch the data from the database, you need to iterate over the queryset:
for t in traffic.iterator():
print(t.sessions)
Django Debug Toolbar is a configurable set of panels that display various debug information about the current request/response and when clicked, display more details about the panel's content. This includes:
Modifying your code: (remember Querysets are Lazy)
traffic = Traffic.objects.all()
totals = traffic.aggregate(Sum('sessions'), Sum('new_users'), Sum('reminder'), Sum('campaigns'), Sum('new_sales'), Sum('sales_renewals'))
total_sessions = totals.get('sessions__sum')
total_new_users = totals.get('new_users__sum')
total_reminder = totals.get('reminder__sum')
total_campaigns = totals.get('campaigns__sum')
total_new_sales = totals.get('new_sales__sum')
total_sales_renewals = totals.get('sales_renewals__sum')
t_2014 = traffic.filter(created__year='2014')
t_sessions_2014_wd2 = t_2014.filter(created__week_day=2).aggregate(Sum('sessions'))
...
...
For Call #1 in template (for loop to display table of all results):
{% for t in traffic.iterator %}
{{ t.sessions }}
...
...
{% endfor %}
Upvotes: 24
Reputation: 9446
Not addressing the questions directly, but I think you should consider a different approach.
Based on my understanding:
There is no need to perform the same queries each time someone requests the view.
Load all the data in one step and perform manipulations inside the view. You can use a library like Pandas and create complicated data sets. The view will be now CPU bound so use a caching system like Redis to avoid recalculating. Invalidate when the data has changed.
Another approach: perform the calculations periodically by using a task queue like Celery and populate Redis.
Upvotes: 2
Reputation: 31
As for Question 1, it shouldn't be a problem to reuse the queryset from the first call.
traffic = Traffic.objects.all()
totals = traffic.aggregate(Sum('sessions'), Sum('new_users'), Sum('reminder'), Sum('campaigns'), Sum('new_sales'), Sum('sales_renewals'))
This should spare you an additional call to the database.
Regarding Question 2, you can again reuse the queryset from the first call, and filter the year, which gives you a new queryset, e.g.
traffic_2014 = traffic.filter(created__year='2014')
You can then continue filtering the days and aggregating with this new queryset, like you did before, or create new querysets for each day, assuming you aggregate multiple attributes every day, thus saving you another dozen database calls.
I hope this helps you.
Upvotes: 3