Frank
Frank

Reputation: 534

select 30 random rows where sum amount = x

I have a table

items
id int unsigned auto_increment primary key,
name varchar(255)
price DECIMAL(6,2)

I want to get at least 30 random items from this table where the total sum of price equals 500, what's the best approach to accomplish this?

I have seen this solution which looks have a similar issue MySQL Select 3 random rows where sum of three rows is less than value

And I am wondering if there are any other solutions that are easier to implement and/or more efficient

Upvotes: 16

Views: 1539

Answers (7)

Jannes Botis
Jannes Botis

Reputation: 11242

There is a solution if your product list satisfies the following assumption:

You have products for all prices between 0.00 and 500.00. eg. 0.01, 0.02 etc to 499.99. or maybe 0.05, 0.10 etc to 499.95.

The algorithm is based on the following:

In a collection of n positive numbers that sum up to S, at least one of them will be less than S divided by n (S/n)

In this case, the steps are:

  1. Select a product randomly where price < 500/30. Get its price, lets say X.
  2. Select a product randomly where price < (500 - X)/29. Get its price, assume Y.
  3. Select a product randomly where price < (500 - X - Y)/28.

Repeat this 29 times and get 29 products. For the last product, select one where price = remaining price. (or price <= remaining price and order by price desc and hopefully you could get close enough).

For the table items:

Get random product max price:

CREATE PROCEDURE getRandomProduct (IN maxPrice INT, OUT productId INT, productPrice DECIMAL(8,2))
BEGIN
   DECLARE productId INT;
   SET productId = 0;
       SELECT id, price INTO productId, productPrice
       FROM items
       WHERE price < maxPrice
       ORDER BY RAND()
       LIMIT 1;
END

Get 29 random products:

CREATE PROCEDURE get29products(OUT str, OUT remainingPrice DECIMAL(8,2))
BEGIN
  DECLARE x INT;
  DECLARE id INT;
  DECLARE price DECIMAL(8,2);
  SET x = 30;
  SET str = '';
  SET remainingPrice = 500.00;

  REPEAT
    CALL getRandomProduct(remainingPrice/x, @id, @price);
    SET str = CONCAT(str,',', @id);
    SET x = x - 1;
    SET remainingPrice = remainingPrice - @price;
    UNTIL x <= 1
  END REPEAT;
END

Call the procedure:

CALL `get29products`(@p0, @p1); SELECT @p0 AS `str`, @p1 AS `remainingPrice`;

and in the end try to find the last product to get to 500.

Alternatively, you could select 28 and use the solution on the linked question you provided to get a couple of products that sum to the remaining price.

Note that duplicate products are allowed. To avoid duplicates, you could extend getRandomProduct with an additional IN parameter of the products already found and add a condition NOT IN to exclude them.

Update: You could overcome the above limitation, so that you always find collections that sum to 500 by using a cron process as described at the 2nd section below.

2nd section: Using a cron process

Building on @Michael Zukowski `s suggestion, you could

  • create a table to hold the collections found
  • define a cron process that runs the above algorithm a number of times (in example 10 times) eg. every 5 min
  • if a collection is found that matches the sum, add it to the new table

This way you can find collections that always sum exactly to 500. When a user makes a request, you could select a random collection from the new table.

Even with a match rate of 20%, a cron process that runs the algorithm 10 times every 5 minutes in 24h you could more than 500 collections.

Using a cron process has the following advantages and disadvantages in my opinion:

Advantages

  • find exact matches
  • no process on client request
  • even with a low match rate, you can find several collections

disadvantages

  • if the price data are updated frequently, you could have inconsistent results, maybe using a cron process is not gonna work.
  • have to discard or filter old collections
  • it will probably be not random per client, as different client will probably see the same collection.

Upvotes: 2

jferard
jferard

Reputation: 8180

I'm suprised that nobody suggested, for the record, the brute force solution:

SELECT 
    i1.id, 
    i2.id, 
    ..., 
    i30.id, 
    i1.price + i2.price + ... + i30.price
FROM items i1 
INNER JOIN items i2 ON i2.id NOT IN (i1.id)
...
INNER JOIN items i30 ON i30.id NOT IN (i1.id, i2.id, ..., i29.id)
ORDER BY ABS(x - (i1.price + i2.price + ... + i30.price))

Such a request may be generated by a program to avoid mistakes. That's almost a joke, because the time is O(n^30) (the generic https://en.wikipedia.org/wiki/Subset_sum_problem is NP complete, but if you fix the size of the subset, it is not. ), but it is possible and may have a meaning for precomputations. When the set of prices does not change, use a precomputed set of prices and find random items that have thoses prices.

There is a dynamic programming solution (see Wikipedia), but it may take too long for your needs. There is also a polynomial time approximate algorithm, but the naive implementation would be O(n) in queries (I didn't search another implementation).

I propose another possibility, without the assumptions from Jannes Botis The principle is a greedy "hill climbing", with some retreats because the greedy method won't fit to every situation.

First of all, a summary: take the total of the 30 cheapest items, then progress as fast as possible to x (be greedy) by replacing cheap items by expensive ones; if you outreach x, then make a maximum step back and resume the climb unless you are done or tired.

And now, the details (should use PHP + MySQL, and not only MySQL):

Let N = 30

Step 1: initialization

Sort items by ascending price and select the first N ones

  • It the total price is x, you are done.
  • If the total price is more than x, give up: you can't produce a total equal to x.
  • Else continue with the N cheapest items.

With a B-tree index on prices, it should be fast

Step 2: climb

Thus, x - total > 0, and we want the difference to be the closest to 0.

Select every pair of items (with a join) where:

  1. the first item i1 is in the N selected items
  2. the second item i2 is not in the N selected items,
  3. the price of i1 is more than the price of i2: p1 - p2 > 0.
  4. (x - total) - (p1 - p2) >= 0

Order the result by ascending (x - total) - (p1 - p2).

  • If there is no matching row, there are two cases (so maybe use two queries if you allow N to grow):

    1. no items so that p1 - p2 > 0: increase N and add the item with the lowest price. If N == n, you can't reach x, else go to step 2.
    2. no items so that (x - total) - (p1 - p2) >= 0: you will outreach the limit x. Go to step 3.
  • Else take the first row (the closest to the peak) and replace i1 by i2 in the items: the new total is total - p1 + p2, and now x - total >= 0 and you are closer to 0.

    • If it is zero, then we are done.
    • Else loop to step 2.

*The join will take some O(n): N items i1 * [(n-N) items i2 minus the one with p2 > p1]*

Step 3: retreat

There are many way to retreat. Here's one.

  • If you have just retreated, give up: you're stuck.
  • If you have already retreated more than n times or if you are close enough to 0, you may give up. This avoids endless loops.
  • Else: Remove the item with the maximum price of the list, and replace it by the item that is not in the list with the minimum price (max and min to ensure you go down enough). Then update total and go back to step 2.

With a B-tree index on prices, it should be fast

I hope this is clear. You can tune it to decide when you have done enough and use a precomputed set of 30 items with a total price of x. I believe the time complexity is O(n) in average case. I did some tests (python + sqlite) with 200 items, random prices between 0 and 1000 and no retreat. On 1000 tests, 22 failures to reach 5000 (0.44%), 708 successes in 3 tries, 139 successes in 4 tries, 126 successes in 3 tries, 4 successes in 5 tries and 1 success in 1 try (a "try" is the try of a set of items different from the 30 cheapest items: k tries means times the query of step 2). This will depend on the number of items, the prices, ...

You can also make variations, e.g. start with a random set of items and try to narrow x, oscillate around x instead of retreating, ...

Upvotes: 1

Aishik kirtaniya
Aishik kirtaniya

Reputation: 538

  1. first select all values where sum = 500
  2. use mysql_query

then do the following code

$arr = array();
$num = 0;
while($row = mysqli_fetch_array($result))
{
    array_push($arr,$row['id']);
}
$arr2= array();
while(count($arr2!=30)
{
    $cnt = random(0,count($arr));
    if(in_array($arr[$cnt],$arr2);
    {
        array_push($arr2,$arr[$cnt]);
    }
}
print_r($arr2);

here $arr2 is the required array

Upvotes: 0

Pankaj
Pankaj

Reputation: 581

If you read the MySQL manual you might have seen the ORDER BY RAND() to randomize the the rows.

This example works fine and is fast if you only when let's say 1000 rows. As soon as you have 10000 rows the overhead for sorting the rows becomes important. Don't forget: we only sort to throw nearly all the rows away.

A great post handling several cases, from simple, to gaps, to non-uniform with gaps.

Here is how you can do it perfectly :

SELECT id, name, price
 FROM `items` AS i1 JOIN
    (SELECT CEIL(RAND() *
                 (SELECT MAX(id)
                    FROM `items`)) AS id) AS i2
 WHERE i1.id >= i2.id AND i1.price = 500
 ORDER BY i1.id ASC
LIMIT 30;

Upvotes: -1

Mike Doe
Mike Doe

Reputation: 17576

If you want to be efficient stop wasting your time and go for eventual consitency. Create console script that does what you want to accomplish by any means necessary, then run this script in CRON or with any scheduling software once in a while.

Having 100, 1000 visitors would you want your query to be executed every time? This is time and resource consuming. Randomly ordered queries cannot be cached by DBMS's too. Go for eventual consistency: create a table to hold that records and purge it each time, lock for writing, then load with new set, every 5 minutes for instance.

At least this is how I do it in heavily loaded applications. In the code it's matter of running plain SELECT query.

Upvotes: 3

Alexey
Alexey

Reputation: 2478

The closest answer I can provide is this

set @cnt = 0;
set @cursum = 0;
set @cntchanged = 0;
set @uqid = 1;
set @maxsumid = 1;
set @maxsum = 0;
select 
    t.id,
    t.name,
    t.cnt
from (
    select 
        id + 0 * if(@cnt = 30, (if(@cursum > @maxsum, (@maxsum := @cursum) + (@maxsumid := @uqid), 0)) + (@cnt := 0) + (@cursum := 0) + (@uqid := @uqid + 1), 0) id, 
        name,  
        @uqid uniq_id,
        @cursum := if(@cursum + price <= 500, @cursum + price + 0 * (@cntchanged := 1) + 0 * (@cnt := @cnt + 1), @cursum + 0 * (@cntchanged := 0)) as cursum, if(@cntchanged, @cnt, 0) as cnt  
    from (select id, name, price from items order by rand() limit 10000) as orig
) as t

where t.cnt > 0 and t.uniq_id = @maxsumid
;

So how it works? At first we select 10k randomly ordered rows from items. After it we sum prices of items until we reach 30 items with sum less than 500. When we find 30 items we repeat the process until we walk through all the 10k selected items. While finding these 30 items we save maximum found sum. So at the end we select 30 items with greatest sum (meaning the closest to the target 500). Not sure if that's what you originally wanted, but finding the exact sum of 500 would require too much effort on DB side.

Upvotes: 7

Jonas Staudenmeir
Jonas Staudenmeir

Reputation: 25906

Depending on the average price and the price distribution you could try something like this:

  1. Randomly select a few items less than you want in total (e.g. 25). Retry until their total amount is less than x.

  2. Then use the concept linked in your question to find a combination that provides the remaining amount.

Upvotes: 0

Related Questions