How to efficiently select a random row using a biased prob. distribution with Rails 2.3 and PostgreSQL 8?

Question

I have wrote a few simple Rails application, accessing the database via the ActiveRecord abstraction, so I am afraid that I don't know much about the inner working of PostgreSQL engine. However, I am writing a Rails app that need to support 100000+ rows with dynamically updated content and am wondering whether I am using the random functions efficiently:

Database migration schema setting:
t.float: attribute1
t.integer: ticks

add_index :mytable, :attribute1
add_index :mytable, :ticks

Basically, I want to have the following random function distribution:

a) row that has the top 10% value in attribute1 = 30% chance of being selected

b) the middle 60% (in attribute1) row = 50% chance of being selected

c) the lowest 30% (in attribute1) with less than 100 ticks = 15% chance of being selected,

d) and for those with the lowest 30% attribute1 that have more than X (use X = 100 in this question) ticks = 5% chance of being selected.

At the moment I have the following code:

@rand = rand()
if @rand>0.7
    @randRowNum = Integer(rand(0.1) * Mytable.count )
    @row = Mytable.find(:first, :offset =>@randRowNum , :order => '"attribute1" DESC')
else
    if @rand>0.2
        @randRowNum = Integer((rand(0.6)+0.1) * Mytable.count)
        @row= Mytable.find(:first, :offset =>@randRowNum , :order => '"attribute1" DESC')
    else
        @row= Mytable.find(:first, :offset =>Integer(0.7 * Mytable.count), :order => '"attribute1" DESC')
        if !(@row.nil?) 
            if (@rand >0.05)
                @row= Mytable.find(:first, :order => 'RANDOM()', :conditions => ['"attribute1" <= '"#{@row.attribute1}", '"ticks" < 100' ] )
            else
                @row= Mytable.find(:first, :order => 'RANDOM()', :conditions => ['"attribute1" <= '"#{@row.attribute1}", '"ticks" >= 100' ] )
            end
        end
    end
end

1) One thing I would like to do is avoid the use of :order => 'RANDOM()' as according to my research, it seems that each time it is called, involves the SQL engine first scanning through all the rows, assigning them a random value. Hence the use of @randRowNum = Integer(rand(0.1) * Mytable.count ) and offset by @randRowNum. for a) and b). Am I actually improving the efficiency? Is there any better way?

2) Should I be doing the same as 1) for c) and d), and is by using the :conditions => ['"attribute1" <= '"#{@row.attribute1}", '"ticks" >= 100' ], am I forcing the SQL engine to scan through all the rows anyway? Is there anything apart from indexing that I can improve the efficiency of this call (with as little space/storage overhead as possible too)?

3) There is a chance that the total number of entries in Mytable may have been changed between the Mytable.count and Mytable.find calls. I could put the two calls within a transaction, but it seems excessive to lock that entire table just for a read operation (at the moment, I have extra code to fall back to a simple random row selection if I got a @row.nil from the above code). Is it psosible to move that .count call within a single atomic SQL query in Rails? Or would it have the same efficiently as locking via transaction in Rails?

4) I have also been reading up on stored procedure in PostgreSQL... but for this particular case, is there any gain in efficiency to be achieved, worth moving the code from Activerecording abstraction to stored procedure?

Many thanks!

P.S. development/deployment environment: Rube 1.8.7 Rails 2.3.14 PostgreSQL 8 (on Heroku)

tc. · Accepted Answer

Your question seems a bit vague, so correct me if my interpretation is wrong.

If you didn't split (c) and (d), I would just convert the uniform random variable over [0,1) to a biased random variable over [0,1) and use that to select your row offset.

Unfortunately, (c) and (d) is split based on the value of "ticks" — a different column. That's the hard part, and also makes the query much less efficient.

After you fetch the value of attribute1 at 70% from the bottom, also fetch the number of (c) rows; something like SELECT COUNT(*) FROM foo WHERE attribute1 <= partiton_30 AND ticks < 100. Then use that to find the offset into either the ticks < 100 or the ticks >= 100 cases. (You probably want an index on (attribute1, ticks) or something; the order which is best depends on your data).

If the "ticks" threshold is known in advance and doesn't need to change often, you can cache it in a column (BOOL ticks_above_threshold or whatever) which makes the query much more efficient if you have an index on (ticks_above_threshold, attribute1) (note the reversal). Of course, every time you change the threshold you need to write to every row.

(I think you can also use a "materialized view" to avoid cluttering the main table with an extra column, but I'm not sure what the difference in efficiency is.)

There are obviously some efficiency gains possible by using stored procedures. I wouldn't worry about it too much, unless latency to the server is particularly high.

EDIT:

To answer your additional questions:

Indexing (BOOL ticks_above_threshold, attribute1) should work better than (ticks, attribute1) (I may have the order wrong, though) because it lets you compare the first index column for equality. This is important.

Indexes generally use some sort of balanced tree to do a lookup on what is effectively a list. For example, take A4 B2 C3 D1 (ordered letter,number) and look up "number of rows with letter greater than B and number greater than 2". The best you can do is start after the Bs and iterate over the whole table. If you order by number,letter, you get 1D 2B 3C 4A. Again, start after the 2s.

If you instead your index is on is_greater_than_2,letter, it looks like false,B false,D true,A true,C. You can ask the index for the position of (true,B) — between true,A and true,C — and count the number of entries until the end. Counting the number of items between two index positions is fast.

Google App Engine's Restrictions on Queries goes one step further:

Inequality Filters Are Allowed on One Property Only

A query may only use inequality filters (<, <=, >=, >, !=) on one property across all of its filters.

...

The query mechanism relies on all results for a query to be adjacent to one another in the index table, to avoid having to scan the entire table for results. A single index table cannot represent multiple inequality filters on multiple properties while maintaining that all results are consecutive in the table.
If none of you other queries benefit from an index on ticks, then yes.

In some cases, it might be faster to index a instead of a,b (or b,a) if the clause including b is almost always true and you're fetching row data (not just getting a COUNT()) (i.e. if 1000 <= a AND a <= 1010 matches a million rows and b > 100 only fails for two rows, then it might end up being faster to do two extra row lookups than to work with the bigger index).

How to efficiently select a random row using a biased prob. distribution with Rails 2.3 and PostgreSQL 8?

Answers (2)

Related Questions