Tim Biegeleisen
Tim Biegeleisen

Reputation: 522762

By how much do SSDs narrow the performance gap between clustered and non clustered indices?

Most SQL relational databases support the concept of a clustered index in a table. A clustered index, usually implemented as a B-tree, represents the actual records in a given table, physically ordered by that index on disk/storage. One advantage of this special clustered index is that after traversing the B-tree in search for a record or set of records, the actual data can be found immediately at the leaf nodes.

This stands in contrast to a non clustered index. A non clustered index exists outside the clustered index, and also orders the underlying data using one or more columns. But, the leaf nodes may not have data for all the columns needed in the query. In this case, the database has to do a disk seek to the original data to get this information.

In most database resources I have seen on Stack Overflow and elsewhere, this additional disk seek is viewed as a substantial performance penalty. My question is how would this analysis change assuming that all database files were stored on a solid state drive (SSD)?

From the Wikipedia page for SSDs, the random access time for SSDs is less than 0.1 ms, while random access times for mechanical hard disks are typically 10-100 times slower.

Do SSDs narrow the gap between clustered and non clustered indices, such that the former become less important for overall performance?

Upvotes: 3

Views: 1361

Answers (3)

Bill Karwin
Bill Karwin

Reputation: 562961

First of all, a clustered index does not guarantee that the rows are physically stored in index order. InnoDB for example can store the clustered index in a non-sequential way. That is, two database pages containing consecutive rows of the table might be stored physically close to each other, or far apart in the tablespace, and in either order. The B-tree data structure for the clustered index has pointers to the leaf pages, but they don't have to be stored in any order.

SSD is helpful for speeding up IO-based operations, particularly involving disk seeks. It's way faster than an spinning magnetic disk. But RAM is still a couple of orders of magnitude faster than the best SSD.

The 2018 numbers:

  • Disk seek: 3,000,000ns
  • SSD random read: 16,000ns
  • Main memory reference: 100ns

RAM still trumps durable storage by a wide margin. If your dataset (or at least the active subset of your dataset) fits in RAM, you won't need to worry about the difference between magnetic disk storage and SSD storage.


Re your comment:

The clustered index helps because when a primary key lookup searches through the B-tree and finds a leaf node, right there are all the other fields of the row associated with that primary key value.

Compare with MyISAM, where a primary key index is separate from the rows of the table. A query searches the B-tree of the primary key index, and at the leaf node finds a pointer to the location in the data file where the corresponding row is stored. So it has to do a second seek into the data file.

This does not necessarily mean that the clustered index in InnoDB is stored consecutively. It might need to skip around a bit to read all the pages of the tablespace. This is why it's so helpful to have the pages in RAM in the buffer pool.

Upvotes: 3

Gordon Linoff
Gordon Linoff

Reputation: 1271151

First, the additional disk seek is not really a "killer". This can be a big issue in high transaction environments where microseconds and milliseconds count. However, for longer running queries, it will make little difference.

This is especially true if the database intelligently does "look ahead" disk seeks. Databases are often not waiting for data because another thread is predicting what pages will be needed and working on bringing those back. This is usually done by just taking the "next" pages on a sequential scan.

SSDs are going to speed up pretty much all operations. They do change the optimization parameters. In particular, I think they are comparably fast on throughput (although I don't keep up with the technology specifically). Their big win is in latency -- the time from issuing the request for a disk block and the time when it is retrieved.

In my experience (which is a few years old), the performance using SSD was comparable to an in-memory database for most operations.

Whether this makes cluster indexes redundant is another matter. A key place where they are used is when you want to separate a related small amount of rows (say "undeleted") from a larger amount. By putting them in the same data pages, the clustered index reduces the overall number of rows being read -- it doesn't just make the reads faster.

Upvotes: 2

ScaisEdge
ScaisEdge

Reputation: 133400

Just sume suggestions (to broad for simple comment)

taking into account that everything depends on the distribution of the keys in the not clusterd index and in the respective nodes, (which is completely causal and can only be assessed in average terms) remains the fact that any access benefits from the performance of the SSD disk. In this case, the increase in prepositions is not linear but is nonetheless substantial. Therefore, on average, it should not be a factor of 1 to 100, precisely for issues related to the randomness of distribution, but for every circumstance in which this manifests itself. access is 100 times faster .. in this case it is all the more efficient the more causally .. the situation occurs. There is however a fact at the base .. every action on disk is much more efficient and therefore in general the behavior of a not clusterd index comes to be explicit in an optimal context.

Taking this into account, the gap should be radically reduced and this should take place thanks to the context in which the entire filing system exists and which is the basis of the database; from accessing the logical files that compose it to the physical sectors in which the data are actually preserved

Upvotes: 0

Related Questions