Reputation: 991
I am working with Django to see how to handle large databases. I use a database with fields name, age, date of birth(dob) and height. The database has about 500000 entries. I have to find the average height of persons of (1) same age and (2) born in same year. The aggregate function in querying table takes about 10s. Is it usual or am I missing something?
For age:
age = [i[0] for i in Data.objects.values_list('age').distinct()]
ht = []
for each in age:
aggr = Data.objects.filter(age=each).aggregate(ag_ht=Avg('height')
ht.append(aggr)
From dob,
age = [i[0].year for i in Data.objects.values_list('dob').distinct()]
for each in age:
aggr = Data.objects.filter(dob__contains=each).aggregate(ag_ht=Avg('height')
ht.append(aggr)
The year has to be extracted from dob. It is SQLite and I cannot use __year (join).
Upvotes: 0
Views: 1648
Reputation:
full version with time compare loop and query set version
import time
from dd.models import Data
from django.db.models import Avg
from django.db.models.functions import ExtractYear
for age
start = time.time()
age = [i[0] for i in Data.objects.values_list('age').distinct()]
ht = []
for each in age:
aggr = Data.objects.filter(age=each).aggregate(ag_ht=Avg('height'))
ht.append(aggr)
end = time.time()
loop_time = end - start
start = time.time()
qs = Data.objects.values('age').annotate(ag_ht=Avg('height')).order_by('age')
ht_qs = qs.values_list('age', 'ag_ht')
end = time.time()
qs_time = end - start
print loop_time / qs_time
for dob year, with easy refactoring your version(add set in the years)
start = time.time()
years = set([i[0].year for i in Data.objects.values_list('dob').distinct()])
ht_year_loop = []
for each in years:
aggr = Data.objects.filter(dob__contains=each).aggregate(ag_ht=Avg('height'))
ht_year_loop.append((each, aggr.get('ag_ht')))
end = time.time()
loop_time = end - start
start = time.time()
qs = Data.objects.annotate(dob_year=ExtractYear('dob')).values('dob_year').annotate(ag_ht=Avg('height'))
ht_qs = qs.values_list('dob_year', 'ag_ht')
end = time.time()
qs_time = end - start
print loop_time / qs_time
Upvotes: 0
Reputation: 180290
For these queries to be efficient, you have to create indexes on the age
and dob
columns.
You will get a small additional speedup by using covering indexes, i.e., using two-column indexes that also include the height
column.
Upvotes: 1