Reputation: 1495
My questions relate to the way relational DBs access the data, and the special case of using Index-Only-Scans, i.e. scans which check all 'where' conditions and fetch all return values from some index, without accessing the table itself.
Imagine we need to access some columns which are not in the index. We need to access them for one (or both) of two reasons: to compare with the 'where' clause and to fetch the column value as a result. How would the database act in this case: it would fetch the whole row, or it would fetch only the column it needs?
As a consequence of the first question emerges this one: does the number of columns to return in select query really matters if we are not using the Index-Only-Scan? I mean if we have to fetch or compare with 'where' clause some columns which are not indexed - does it really matter how many columns we return, or we can write "select * from ..." without any concerns in case DB fetches the whole row anyway?
When we use Index-Only-Scans we have to include all the columns the query deals with in one single index. If some column is included in another index - this will not burst the performance. Am I right?
I have read that MySQL InnoDB engine uses clustered indexes by default, i.e. all the rows in tables are physically sorted by some index. This means that searching that table using some secondary index would be less efficient, because after that search db has to make a second one over the primary index because in a clustered index db is not storing the rowId anymore. Am I right? If yes, why MySQL implements indexing in such a way, thus limiting the usage of secondary indexes?
Upvotes: 1
Views: 921
Reputation: 8846
Some of these explanations probably go over stuff you already know, but the full detail may help future readers.
The server would most likely just get the rows that are needed. However, this may be affected by how the data is stored. For example, the InnoDB engine usually stores large data (like TEXT
s and BLOB
s off page, so those probably wouldn't be fetched if not needed.
I think I need some clarification here, so correct me if I'm missing something from your question. First, it is better to only return the columns you needs and listing all of the columns instead of selecting *
is faster. As with #1, how much of a difference selecting other columns will make depends. Selecting large columns (like TEXT
s or BLOB
s) is usually more expensive than small columns.
I'm not 100% sure what you mean here, but I think I can answer the question. If you have a query like SELECT c1, c2, c3 FROM table WHERE c1 = 1 AND c2 = 2
, an index like (c1,c2,c3)
would probably be optimal; all of columns that the query needs are in the index so the server doesn't need to look up the full data rows. it does not matter if c1
, c2
, or c3
are included in any other indexes.
In your question, you say that a clustered index db is not storing the rowId anymore
, which isn't completely correct.
Assuming that the rowId
is a unique, probably numeric, identifier for the data:
In a non-clustered database table, the all of the indexes connect some columns to the physical data location. In the case of the primary index this would look like rowId -> data location
. A secondary index could look like column 1 -> column 2 -> data location
. To get any other data, the server then looks up the data based on the physical location.
In a clustered table, the physical data basically is the primary index. The primary index looks like rowId -> data
and secondary indexes look like column 1 -> column 2 -> rowId
.
For a non-clustered table, the full look-up path would look like rowId -> data location -> data
using a primary index and column 1 -> column 2 -> data location -> data
for a secondary index.
For a clustered table, it would look like rowId -> data
for the primary index and column 1 -> column 2 -> rowId -> data
for a secondary index.
So to correct the quote at the beginning of this section, the only indexes that really "store" the rowId
are secondary indexes on a clustered table.
While secondary index look-ups on clustered tables are slower than on a non-clustered table, the difference is usually negligible if you're using a short primary key. One of the main benefits of a clustered table is that primary index look-ups are faster, so if you're primarily using primary key look-ups, they are beneficial.
In response to KutaBeach's comments:
Fetching columns that aren't needed isn't helpful. When the server needs to go the the data to get rows not in an index, it doesn't always fetch all of the data for that row. Some storage configurations store some data apart from the main row since it may be very large and otherwise affect performance. An example could be a TEXT
column that is 65535 characters long for every row. If the storage engine stores that data off page, getting data from rows is much faster if the TEXT
column isn't needed.
It sounds like when you say rowId
, you mean the physical address of the row, not a unique number assigned to each row. In this case, you are correct that only secondary indexes on a clustered table do not store the rowId
; all other indexes store the rowId
. However, this is not because the data can or cannot move; the data in a table can move at any time, in which case the indexes get updated to reflect the move. In MySQL, a PRIMARY INDEX
is basically just the main index of a table. It is pretty much the same as a UNIQUE
index in that it forces values to be unique, with the only difference being that it is used as the main key for the table. Non-clustered indexes do contain every rowId
.
Upvotes: 1