Reputation: 303
It said that
Record.objects.order_by('?')[:n]
have performance issues, and recommend doing something like this: (here)
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
Since that, why not do it directly like this:
result = random.sample(Record.objects.all(),n)
I have no idea about when these code running what is django actually doing in background. Please tell me the one-line-code at last is more efficient or not? why?
================Edit 2013-5-12 23:21 UCT+8 ========================
I spent my whole afternoon to do this test.
My computer : CPU Intel i5-3210M RAM 8G
System : Win8.1 pro x64 Wampserver2.4-x64 (with apache2.4.4 mysql5.6.12 php5.4.12) Python2.7.5 Django1.4.6
What I did was:
CharField
content, then Syncdb
. settings.py
that Django
can output log into console. This is views.py
:
def test1(request):
start = datetime.datetime.now()
result = Record.objects.order_by('?')[:20]
l = list(result) # Queryset是惰性的,强制将Queryset转为list
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start).microseconds/1000))
def test2(request):
start = datetime.datetime.now()
sample = random.sample(xrange(Record.objects.count()),20)
result = [Record.objects.all()[i] for i in sample]
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
def test3(request):
start = datetime.datetime.now()
result = random.sample(Record.objects.all(),20)
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
As @Yeo said,result = random.sample(Record.objects.all(),n)
is crap. I won't talk about that.
But interestingly, Record.objects.order_by('?')[:n]
always better then others, especially the table smaller then 1m rows. Here is the data:
and the charts:
So, what's happened?
In the last test, 5,195,536 rows in tatget table, result = random.sample(Record.objects.all(),n)
actually did ths:
(22.275) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` ORDER BY RAND() LIMIT 20; args=()
Every one is right. And it used 22 seconds. And
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
actually did ths:
(1.393) SELECT COUNT(*) FROM `randomrecords_record`; args=()
(3.201) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` LIMIT 1 OFFSET 4997880; args=()
...20 lines
As you see, get one row, cost 3 seconds. I find that the larger index, the more time needed.
But... why?
My think is:
If there is some way can speed up the large index query,
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
should be the best. Except(!) the table is smaller then 1m rows.
Upvotes: 1
Views: 1128
Reputation: 56517
The problem with .order_by(?)
is that under the hood it does ORDER BY RAND()
(or equivalent, depending on DB) which basically has to create a random number for each row and do the sorting. This is a heavy operation and requires lots of time.
On the other hand doing Record.objects.all()
forces your app to download all objects and then you choose from it. It is not that heavy on the database side (it will be faster then sorting) but it is heavy on network and memory. Thus it can kill your performance as well.
So that's the tradeoff.
Now this is a lot better:
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
simply because it avoids all the problems mentioned above (note that Record.objects.all()[i]
gets translated to SELECT * FROM table LIMIT 1 OFFSET i
, depending on DB).
However it may still be inefficient since .count
might be slow (as usual: depends on DB).
Upvotes: 2
Reputation: 11794
Record.objects.count()
gets translated into very light SQL Query.
SELECT COUNT(*) FROM TABLE
Record.objects.all()[0]
is also translated into a very light SQL Query.
SELECT * FROM TABLE LIMIT 1
Record.objects.all()
usually the results get slice off to increase the performance
SELECT * FROM table LIMIT 20; // or something similar
list(Record.objects.all())
will query all the data and put it into a list data structure.
SELECT * FROM TABLE
Thus, any time you convert a Queryset into a list, that's where the expensive happened
In your example, random.sample()
will convert into a list. (If I'm not wrong).
Thus when you do result = random.sample(Record.objects.all(),n)
it will do the Full Queryset and convert into a list and then random pick the list.
Just imagine if you have millions of records. Are you going to query and store it into a list with millions element? or would you rather query one by one
Upvotes: 1