Reputation: 19308
I've got a SQL Server table with about 50,000 rows in it. I want to select about 5,000 of those rows at random. I've thought of a complicated way, creating a temp table with a "random number" column, copying my table into that, looping through the temp table and updating each row with RAND()
, and then selecting from that table where the random number column < 0.1. I'm looking for a simpler way to do it, in a single statement if possible.
This article suggest using the NEWID()
function. That looks promising, but I can't see how I could reliably select a certain percentage of rows.
Anybody ever do this before? Any ideas?
Upvotes: 378
Views: 511515
Reputation: 357
In SQl Server 2017 and later you can use the TABLESAMPLE phrase in your code like this:
SELECT * FROM TableName TABLESAMPLE (100 ROWS)
or
SELECT * FROM TableName TABLESAMPLE (10 PERCENT)
Upvotes: 0
Reputation: 123
If you know you have approximately N rows and you want approximately K random rows, you just need to pull any given row with a chance K/N. Using the RAND() function which gives you a fair distribution between 0 and 1, you could just do the following where PROB = K/N. Worked very quickly for me.
SELECT * FROM some_table WHERE RAND() < PROB
Upvotes: 1
Reputation: 1217
select * from table
where id in (
select id from table
order by random()
limit ((select count(*) from table)*55/100))
// to select 55 percent of rows randomly
Upvotes: 1
Reputation: 61975
Here is an updated and improved form of sampling. It is based on the same concept of some other answers that use CHECKSUM
/ BINARY_CHECKSUM
and modulus.
Reasons to use an implementation similar to this one, as opposed to other answers:
CHECKSUM(*)
/ BINARY_CHECKSUM(*)
issues with runs of data. When using the CHECKSUM(*)
approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.NEWID()
, such as CHECKSUM(NEWID()) % 100
, can never be stable/repeatable.CHECKSUM
only returns an int
value.ORDER BY NEWID()
, as ordering can become a significant bottleneck with large input sets. Avoiding the sorting also reduces memory and tempdb usage.TABLESAMPLE
and thus works with a WHERE
pre-filter.Cons / limitations:
CHECKSUM(*)
. Using hashbytes, as shown below, adds about 3/4 of a second of overhead per million lines. This is with my data, on my database instance: YMMV. This overhead can be eliminated if using a persisted computed column of the resulting 'well distributed' bigint
value from HASHBYTES.SELECT TOP n .. ORDER BY NEWID()
, this is not guaranteed to return "exactly N" rows. Instead, it returns a percentage row rows where such a value is pre-determined. For very small sample sizes this could result in 0 rows selected. This limitation is shared with the CHECKSUM(*)
approaches.Here is the gist:
-- Allow a sampling precision [0, 100.0000].
declare @sample_percent decimal(7, 4) = 12.3456
select
t.*
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
and ( -- sample
@sample_percent = 100
or abs(
-- Choose appropriate identity column(s) for hashbytes input.
-- For demonstration it is assumed to be a UNIQUEIDENTIFIER rowguid column.
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * @sample_percent)
)
Notes:
bigint
is critical as it allows 2^63 bits of 'random space' to which to apply the modulus operator; this is much more than the 2^31 range from the CHECKSUM result. This reduces the modulus error at the limit, especially as the precision is increased.1000 *
to account for the 4 digits of precision allowed in @sample_percent
.bigint
value by RAND()
to return a different row sample each run. This effectively changes the permutation of the fixed hash values.@sample_percent
is 100 the query planner can eliminate the slower calculation code entirely. Remember 'parameter sniffing' rules. This allows the code to be left in the query regardless of enabling sampling.Computing @sample_percent
, with lower/upper limits, and adding a TOP
"hint" in the query as might be useful when the sample is used in a derived table context.
-- Approximate max-sample and min-sample ranges.
-- The minimum sample percent should be non-zero within the precision.
declare @max_sample_size int = 3333333
declare @min_sample_percent decimal(7,4) = 0.3333
declare @sample_percent decimal(7,4) -- [0, 100.0000]
declare @sample_size int
-- Get initial count for determining sample percentages.
-- Remember to match the filter conditions with the usage site!
declare @rows int
select @rows = count(1)
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
-- Calculate sample percent and back-calculate actual sample size.
if @rows <= @max_sample_size begin
set @sample_percent = 100
end else begin
set @sample_percent = convert(float, 100) * @max_sample_size / @rows
if @sample_percent < @min_sample_percent
set @sample_percent = @min_sample_percent
end
set @sample_size = ceiling(@rows * @sample_percent / 100)
select *
from ..
join (
-- Not a precise value: if limiting exactly at, can introduce more bias.
-- Using 'option optimize for' avoids this while requiring dynamic SQL.
select top (@sample_size + convert(int, @sample_percent + 5))
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
and ( -- sample
@sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * @sample_percent)
)
) sampled
on ..
Upvotes: 2
Reputation: 605
The server-side processing language in use (eg PHP, .net, etc) isn't specified, but if it's PHP, grab the required number (or all the records) and instead of randomising in the query use PHP's shuffle function. I don't know if .net has an equivalent function but if it does then use that if you're using .net
ORDER BY RAND() can have quite a performance penalty, depending on how many records are involved.
Upvotes: 0
Reputation: 709
I was using it in subquery and it returned me same rows in subquery
SELECT ID ,
( SELECT TOP 1
ImageURL
FROM SubTable
ORDER BY NEWID()
) AS ImageURL,
GETUTCDATE() ,
1
FROM Mytable
then i solved with including parent table variable in where
SELECT ID ,
( SELECT TOP 1
ImageURL
FROM SubTable
Where Mytable.ID>0
ORDER BY NEWID()
) AS ImageURL,
GETUTCDATE() ,
1
FROM Mytable
Note the where condtition
Upvotes: 1
Reputation: 860
This link have a interesting comparison between Orderby(NEWID()) and other methods for tables with 1, 7, and 13 millions of rows.
Often, when questions about how to select random rows are asked in discussion groups, the NEWID query is proposed; it is simple and works very well for small tables.
SELECT TOP 10 PERCENT *
FROM Table1
ORDER BY NEWID()
However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems:
What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. Here is a new idea on how to do that:
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. In this example, we want approximately 10 percent of the rows selected randomly; therefore, we choose all of the rows whose random number is less than 10.
Please read the full article in MSDN.
Upvotes: 14
Reputation: 828
This is a combination of the initial seed idea and a checksum, which looks to me to give properly random results without the cost of NEWID():
SELECT TOP [number]
FROM table_name
ORDER BY RAND(CHECKSUM(*) * RAND())
Upvotes: 12
Reputation: 6690
It appears newid() can't be used in where clause, so this solution requires an inner query:
SELECT *
FROM (
SELECT *, ABS(CHECKSUM(NEWID())) AS Rnd
FROM MyTable
) vw
WHERE Rnd % 100 < 10 --10%
Upvotes: 1
Reputation: 3979
Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.
For MS SQL:
Minimum example:
select top 10 percent *
from table_name
order by rand(checksum(*))
Normalized execution time: 1.00
NewId() example:
select top 10 percent *
from table_name
order by newid()
Normalized execution time: 1.02
NewId()
is insignificantly slower than rand(checksum(*))
, so you may not want to use it against large record sets.
Selection with Initial Seed:
declare @seed int
set @seed = Year(getdate()) * month(getdate()) /* any other initial seed here */
select top 10 percent *
from table_name
order by rand(checksum(*) % @seed) /* any other math function here */
If you need to select the same set given a seed, this seems to work.
Upvotes: 4
Reputation: 18312
In MySQL you can do this:
SELECT `PRIMARY_KEY`, rand() FROM table ORDER BY rand() LIMIT 5000;
Upvotes: 6
Reputation: 4639
If you (unlike the OP) need a specific number of records (which makes the CHECKSUM approach difficult) and desire a more random sample than TABLESAMPLE provides by itself, and also want better speed than CHECKSUM, you may make do with a merger of the TABLESAMPLE and NEWID() methods, like this:
DECLARE @sampleCount int = 50
SET STATISTICS TIME ON
SELECT TOP (@sampleCount) *
FROM [yourtable] TABLESAMPLE(10 PERCENT)
ORDER BY NEWID()
SET STATISTICS TIME OFF
In my case this is the most straightforward compromise between randomness (it's not really, I know) and speed. Vary the TABLESAMPLE percentage (or rows) as appropriate - the higher the percentage, the more random the sample, but expect a linear drop off in speed. (Note that TABLESAMPLE will not accept a variable)
Upvotes: 12
Reputation: 2029
Selecting Rows Randomly from a Large Table on MSDN has a simple, well-articulated solution that addresses the large-scale performance concerns.
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
Upvotes: 34
Reputation: 1151
Depending on your needs, TABLESAMPLE
will get you nearly as random and better performance.
this is available on MS SQL server 2005 and later.
TABLESAMPLE
will return data from random pages instead of random rows and therefore deos not even retrieve data that it will not return.
On a very large table I tested
select top 1 percent * from [tablename] order by newid()
took more than 20 minutes.
select * from [tablename] tablesample(1 percent)
took 2 minutes.
Performance will also improve on smaller samples in TABLESAMPLE
whereas it will not with newid()
.
Please keep in mind that this is not as random as the newid()
method but will give you a decent sampling.
See the MSDN page.
Upvotes: 111
Reputation: 23
This works for me:
SELECT * FROM table_name
ORDER BY RANDOM()
LIMIT [number]
Upvotes: -1
Reputation: 29
Try this:
SELECT TOP 10 Field1, ..., FieldN
FROM Table1
ORDER BY NEWID()
Upvotes: 2
Reputation: 21108
select top 10 percent * from [yourtable] order by newid()
In response to the "pure trash" comment concerning large tables: you could do it like this to improve performance.
select * from [yourtable] where [yourPk] in
(select top 10 percent [yourPk] from [yourtable] order by newid())
The cost of this will be the key scan of values plus the join cost, which on a large table with a small percentage selection should be reasonable.
Upvotes: 470
Reputation: 1973
newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.
TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).
For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:
If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.
When run against a table with 1,000,000 rows, here are my results:
SET STATISTICS TIME ON
SET STATISTICS IO ON
/* newid()
rows returned: 10000
logical reads: 3359
CPU time: 3312 ms
elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()
/* TABLESAMPLE
rows returned: 9269 (varies)
logical reads: 32
CPU time: 0 ms
elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)
/* Filter
rows returned: 9994 (varies)
logical reads: 3359
CPU time: 641 ms
elapsed time: 627 ms
*/
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.
Upvotes: 50
Reputation: 59705
Just order the table by a random number and obtain the first 5,000 rows using TOP
.
SELECT TOP 5000 * FROM [Table] ORDER BY newid();
UPDATE
Just tried it and a newid()
call is sufficent - no need for all the casts and all the math.
Upvotes: 12