powered by Slim Framework
enhanced by Nesbot.com

MSSQL/MySQL: Improving ORDER BY RAND and ORDER BY NEWID

Published on Sep 5, 2011 by Jamie Munro

Whenever you need to display "random" data, it is always so tempting to take the simplest solution, e.g. SELECT * FROM mytable ORDER BY RAND().  This of course works perfectly; however, if your database table contains a lot of records (10,000+) a query like this can take seconds to complete – instead of 1/10th of a second.  With a bit of extra effort and careful thought, this process can be improved and still provide truly random results.



Assumptions



  • You have an existing database and website

  • You have a little bit of SQL knowledge


Before I get into the solution, I think it's imperative to demonstrate some actual performance results from tests that I have run in the past.  The following query was run on a table with 360,000 records:


SELECT * FROM videos ORDER BY RAND() LIMIT 10


The results of this query took 14.3622 seconds to return back 10 results!

The reason this query takes so long is it needs to retrieve all of the records, then order them randomly, and finally return back only 10 records.  Yes, it is quite crazy that all records need to be retrieved by the server, so it's our goal to improve this.

The solution to retrieving random records is about performing multiple small queries that leverage the primary key on the table (or any other index on the table) – anything that prevents ALL results from being retrieved.

The first step to the solution is retrieve the MIN and MAX id from the table:


SELECT MIN(id) as min_id, MAX(id) as max_id FROM videos


Once you have the min and max, you must perform a loop from 1 to xx equaling the number of random records you want.

Here is some pseudo code that performs this:


// variable to keep track of ids already retrieved to prevent duplicates
var currentRecords;
var recordCount = 0;
// best be safe to include a fail-safe in case enough records cannot  be found
var retryCount = 50;
var currentCount = 0;

// set min_id and max_id from the previous query

do {
// loop until a rand num is found not already returned
do {
// generate random number
var randNum = RANDOM(min_id, max_id);

currentCount++;
} while (EXISTS(randNum, currentRecords) AND currentCount < retryCount);

// if exceeded retry limit break
if (currentCount >= retryCount)
break;

// do query to get record
record = SELECT * FROM videos WHERE id = randNum;

// if we found a record save it
if (record) {
recordCount++;
currentRecords[] = record;

currentCount = 0;
} else {
currentCount++;
}
} while (recordCount < NUM_RECORDS AND currentCount < retryCount);

// currentRecords will contain the final random records that should be used for display


Hopefully my pseudo code isn't too ugly as it is a bunch of fake syntax trying to keep the basics in mind.  An outer loop is started that continues until NUM_RECORDS are found OR the retryCount has been exceeded.  Then another loop is started to generate a random number between the min and max ids.  A check is done to ensure this id has not already been returned.  Next a check is done to see if the retry count has been exceeded without receiving a valid random number.

Then a query is performed using the randNum.  This might not always return a result in case records were deleted the ids are not sequential (more on this in a minute).  If a record is found, the recordCount is incremented and the record is added to the currentRecords.  The currentCount is also reset to prevent the retry count from coming true.  If no record is found the currentCount is incremented to ensure this process will stop if the retryCount is exceeded.

Now, as I mentioned above, it's possible that when a query is performed it might not return a result because that id might not exist in the database.  Doing an extra 5 or 10 simple queries on a specific id is still much faster than performing the ORDER BY RAND or ORDER BY NEWID, so this is something that I am willing to accept.  If this concerns you, it's also possible to alter the query to be something like this:


// do query to get record
record = SELECT * FROM videos WHERE id >= randNum AND id <= randNum + 10 LIMIT 1;

// if MSSQL
record = SELECT TOP 1 * FROM videos WHERE id >= randNum AND id <= randNum + 10;


The above example has a better chance of not failing unless more than 10 records in a row have been deleted.  And since this still leverages the primary key, no speed hit occurs.

Summary


With a little bit of ingenuity and a bit of extra work – great speed improvements can be achieved when retrieving random records from a database table by avoiding the use of ORDER BY RAND and ORDER BY NEWID!

Tags: Database | MSSQL | MySQL

<- jQuery: Creating templates for your HTML Content  Home jQuery: Transitioning AJAX Content into view with $.animate() -> 
blog comments powered by Disqus