Statistics on getting a random row from a table
2010-01-24 15:14:31
So yesterday night I found this interesting post talking about of the ages known alternative for getting a random row from a table of using two queries, the first for learning the number of rows, and the second to get just one row using the reserved SQL word offset and passing the number of rows as the top for a random funciont. In pseudo-SQL, something like this...
... instead of using our much hated but extremely common order by rand:
I've always thought the two querys alternative had to be slower even if it was just for the overhead of the two network accesses, but this guy proved me wrong by going all the hassle of testing it for real, and, really, the two querys approach resulted being an order faster in is tests than our "beloved" random ordering.
It was precisely these tests that sparkled my curiosity, because I don't feel like they where of my taste. First, he uses MySQL and Sqlite, what an strange combination. Sqlite is like the MsDOS of the databases, there are lots of better embedded ones if it was the reason. Second, what's the use of using MySQL and not telling what storage we are using for the tests? (Althought he hints in the commentaries it could be MyISAM, and asserts correctly that the row number is stored in the metadata of the table). And finally, the consists of just one query. Whoa, what an statistic.
So I did my own tests with my server, a MySQL 5.1.37-1ubuntu5 running on an Ubuntu 9.10, and they really show he's absolutely right:
In a table with a million rows, for getting a random row, using the two reads approach is an order of magnitude better than the naive random order way! MyISAM performing better than InnoDB was expected, but, sincerely, I believed the difference would be greater.
The data of the table is measured in milliseconds per query (Where do I write to get that measure unit named after me????) so obviously the lower the better, as is the time on average a query took to complete, calculated from the thousands of petitions the perl based clients did.
UPDATE: The querys, explained:
The naive approach, ordering by rand() and then getting the first row:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
1 | SIMPLE | bench_myisam | index | (NULL) | PRIMARY | 8 | (NULL) | 1000000 | Using index; Using temporary; Using filesort |
And the output for the two querys approach:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
1 | SIMPLE | (NULL) | (NULL) | (NULL) | (NULL) | (NULL) | (NULL) | (NULL) | Select tables optimized away |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
1 | SIMPLE | bench_myisam | index | (NULL) | PRIMARY | 8 | (NULL) | 1000000 | Using index |
The tables are incredibly boring, just a million rows of an integer primary key and a varchar(200) of sample text. And here is the code for the Perl client I wrote to test the two querys approach. The other client is almost identical, but with just one query.