I have a website where people can vote on cars. 4 cars are showed to the user and he/she can vote on the car they like most. The table cars has the important columns: car_id int(10) (not auto_increment, so has gaps) views int(7) points int(7) car_type int(1) (value = 1, 2 or 3) At the moment I use a mapping table for all car_types, which has a PK which has no gaps. I select the max ID of the mapping table and create 4 random numbers (PHP), select those rows from mapping and get the corresponding car_id's. These numbers I use to select the cars from cars table. The problem is that cars added later to the database have less chance to get on the same points as cars added earlier. My question is how to show 4 cars which have the same amount of points(random) ordered by the least number of views (views asc). Also the important notes: The select should only query cars with at least 1 point. The database will contain more than 30M cars, it's not about cars but for the question I think it's easier :). When 70% of the cars have 1 points, 20% have 2 points and 10% have 3 points, than the random points should select cars 70% of the time with 1 point 20% with 2 points and 10% with 3 points. The query will be used to show 4 cars to the visitor, we all know users are impatient so the faster the query the better :) I could (if needed) use a mapping table, which will have no gaps in the PK (as I have now). Cars only within a certain car_type will be showed. Example, 4 randoms of car type 2 (which is family cars), since I don't want to show sport cars and family cars at same time. If you know another solution to solve the problem above, I'm willing to accept all kind of solutions (PHP/SQL). Bounty because it's a bigger question (/answer) than the average Stackoverflow question. Bounty will be rewarded to person which describes the solution or (preferred) the code of the solution. Anyways it's my way to thank the persons helping me and to ensure I appreciate your help greatly. UPDATE: Thanks for all the answers so far! You're all right in your answers. I did think a lot about it last few hours and I started to realize that databases were actually never build for things like this (showing random data), it was created to show precise and accurate data with fast access. That's why selects on PK's with 30M rows or more are still very fast. Thats why I'm thinking about doing all the random stuff in PHP. So I generate in PHP like 40 random numbers and select those 40 rows from the mapping table of the right car type. This select with IN is really fast (like 0.0006 seconds). After this select I got 40 car_ids which I also select with IN from the cars table. I loop the cars and put them in an array and do some custom sorts (based on points and views). After this I select a random number from all the points within the 40 cars and grab the cars from the array closest to this number of points and with the least views. This way PHP takes care of the randomness and the views part and the queries because you ask for precise data are very fast (0.0006 seconds each).

Reputation: 311

SQL Random rows in a big table (with where clause)

I have a website where people can vote on cars. 4 cars are showed to the user and he/she can vote on the car they like most.

The table cars has the important columns:

car_id   int(10) (not auto_increment, so has gaps)
views    int(7)
points   int(7)
car_type int(1) (value = 1, 2 or 3)

At the moment I use a mapping table for all car_types, which has a PK which has no gaps. I select the max ID of the mapping table and create 4 random numbers (PHP), select those rows from mapping and get the corresponding car_id's. These numbers I use to select the cars from cars table.

The problem is that cars added later to the database have less chance to get on the same points as cars added earlier.

My question is how to show 4 cars which have the same amount of points(random) ordered by the least number of views (views asc). Also the important notes:

The select should only query cars with at least 1 point.
The database will contain more than 30M cars, it's not about cars but for the question I think it's easier :).
When 70% of the cars have 1 points, 20% have 2 points and 10% have 3 points, than the random points should select cars 70% of the time with 1 point 20% with 2 points and 10% with 3 points.
The query will be used to show 4 cars to the visitor, we all know users are impatient so the faster the query the better :)
I could (if needed) use a mapping table, which will have no gaps in the PK (as I have now).
Cars only within a certain car_type will be showed. Example, 4 randoms of car type 2 (which is family cars), since I don't want to show sport cars and family cars at same time.

If you know another solution to solve the problem above, I'm willing to accept all kind of solutions (PHP/SQL).

Bounty because it's a bigger question (/answer) than the average Stackoverflow question. Bounty will be rewarded to person which describes the solution or (preferred) the code of the solution. Anyways it's my way to thank the persons helping me and to ensure I appreciate your help greatly.

UPDATE:

Thanks for all the answers so far! You're all right in your answers. I did think a lot about it last few hours and I started to realize that databases were actually never build for things like this (showing random data), it was created to show precise and accurate data with fast access. That's why selects on PK's with 30M rows or more are still very fast. Thats why I'm thinking about doing all the random stuff in PHP. So I generate in PHP like 40 random numbers and select those 40 rows from the mapping table of the right car type. This select with IN is really fast (like 0.0006 seconds). After this select I got 40 car_ids which I also select with IN from the cars table. I loop the cars and put them in an array and do some custom sorts (based on points and views). After this I select a random number from all the points within the 40 cars and grab the cars from the array closest to this number of points and with the least views. This way PHP takes care of the randomness and the views part and the queries because you ask for precise data are very fast (0.0006 seconds each).

Upvotes: 3

Answers (7)

Ali Avcı

Reputation: 870

Oracle uses Rownum to represent a fake ID that has no gaps.

Select rownum, c.* from Cars where points > 1

will give you a resultset that has an ordered ID.

With that given hint, you can simulate the same thing with the following mysqlCode referred here

SELECT @rownum:=@rownum+1 rownum, c.* 
  FROM (SELECT @rownum:=0) r, Cars c where points > 1;

Now having the expected result set similar to as in oracle you can select any 4 rows by using subselects. Assuming you know the rowcount of cars that have more than one point.

select * from 
 (the above select)
where rownum in (Random[0], Random[1], Random[2], Random[3])

or any way you want to construct the query.

This won't move the whole data in between mysql and php, but it will put some pressure on mysql.

Upvotes: 0

mproffitt

Reputation: 2527

I think you can do this in vanilla sql referencing a post from http://www.kahunaburger.com/2008/10/13/selecting-random-weighted-records-from-mysql/ which I traced via this link: MySQL: Select Random Entry, but Weight Towards Certain Entries

The difficult bit with your question is the probability and that post posed the same problem. The solution is given in the comments as being ORDER BY -LOG(1.0 - RAND()) / weighting.

SELECT * FROM cars
WHERE points >= 1
    AND car_type = ROUND((RAND() * 2) + 1)
ORDER BY -LOG(1.0 - RAND()) / 70
LIMIT 4;

Hope this helps.

Upvotes: 0

adamkonrad

Reputation: 7122

Make sure you have combined index on car_type and points columns. First, fetch the total count of rows that you're interested in:

$condition = "car_type=? AND points > 0";
$q_count = "SELECT count(*) FROM cars WHERE {$condition}";
$r_count = mysql_query($q_count);
$car_count = mysql_result($r_count, 0, 0);

$car_count holds the max integer we can generate to retrieve a car. We're going to use this in the random number generator. Now SELECT all rows you're interested in:

$q_cars = "SELECT car_id FROM cars WHERE {$condition}";
$r_cars = mysql_query($q_cars);
$car_ids = array();
for($i = 0; $i < 4; ++$i)
{
    $random_row = rand(0, $car_count);
    $car_ids[] = mysql_result($r_cars, $random_row, 0);
}

I'm assuming normal distribution from the random number generator. Not checking for empty table or duplicate records, because in 30M records the probability of that is very low. You may want to consider adjustments to your table schema. Table of 30M records is slow to query, ORDER BY RAND() is really bad as well as using LIMIT and OFFSET.

Upvotes: 1

fmodos

Reputation: 4568

I am not sure if you can do that, but what about if you forget the 'random' and create a different formula to simulate that? My suggestion is to create one column lastViewed of date type in the cars table, so during the update of the views column, it will also update the lastViewed with the current date

Then the select query can be done of this way:

select * from cars where points=?, car_type=? order by views desc, lastViewed limit 4

This sql will always return 'random' results for the visitor based in the low views and the latest date it was viewed. The cool part of this solution is that it will give priority for the one that haven't been viewed for a while. So when insert a new car, the default value for lastViewed can be a date like from year 1900.

Upvotes: 2

Denis de Bernardy

Reputation: 78483

I'd love to give a specific answer, but I'd need help to understand your thought process...

You start by writing:

I have a website where people can vote on (...) the car they like most.

The problem is that cars added later to the database have less chance to get on the same points as cars added earlier.

But you then go on on writing:

When 70% of the cars have 1 points, 20% have 2 points and 10% have 3 points, than the random points should select cars 70% of the time with 1 point 20% with 2 points and 10% with 3 points.

To me, the latter spec makes little sense in light of the first remark.

Imho, what you really want is for users to have the same amount of opportunities to vote on each car. Or more precisely, to vote for each car compared to each other car.

If you assume that the (car) variables are independent, then you need to keep a count of how many times a choice came up, rather than how many times it got voted for, and adjust your decision process accordingly. It's a math problem, it's not so ugly, and it can then be translated into SQL for better or worse -- I'd venture that it'll probably be worse.

If you assume, like I do, that they're not independent, you also need to account for correlations -- and store how many times they came up with each other too. Because, well, there's an infinitely slim chance that you won't prefer this Mercedes rather than that Tata, that Xinkai or that AvtoVAZ. But given a choice between the same Mercedes, a BMW, a Porsche and a Ferrari, the decision might not be so clearcut.

In other words, your spec doesn't answer the problem as you presented it at all.

I'm currently begging to agree with the answer posted two hours ago: pick them really random, and you'll be satisfied without extra code...

As a side note, if your ids really have no gaps, generate four ids in php or whatever and fetch them using an in() statement. You won't get more efficient than that.

Upvotes: 4

Plamen

Reputation: 390

So, it looks that your main problem is the speed. In that case you can do some pre-processing, let's say - to have a table which to use like a queue, containing groups of 4 cars, ready to be shown. Of course, that will discriminate last moment viewed/voted cars, but you can refresh this queue on a regular intervals.

When 70% of the cars have 1 points, 20% have 2 points and 10% have 3 points, than the random points should select cars 70% of the time with 1 point 20% with 2 points and 10% with 3 points.

In case you pick them really random, that would be satisfied without extra code.

Upvotes: 1

Wallack

Reputation: 792

Yo could write a store procedure that does the follows:

(Don't take the syntax as correct as is mostly pseudocode)

First select the points:

SELECT @varpoints = points FROM cars ORDER BY RAND() LIMIT 1

This way we can obtain a random value of the points column.

Store that value in a var and do something like this (pseudocode):

WHILE (SELECT COUNT(car_id) FROM cars WHERE points = @varpoints ORDER BY views ASC) > 4
{
     SET @varpoints = @varpoints - 1;
}

Now retrieve just the SQL with the needed results:

SELECT car_id FROM cars WHERE points = @varpoints ORDER BY views ASC

And that should do the work.

This will take a random points value and do a query of cars with that value. If it doesn't get at least 4, it will substract 1 and try again. Some sort of try catch would be nice if exists a chance of having less than 4 cars of each points.

Upvotes: 1

SQL Random rows in a big table (with where clause)

Answers (7)

Related Questions