user125264
user125264

Reputation: 1827

Database indexes in layman's terms

Over the years I've built many simple databases and majority of the time the record count in each table has been a few hundred. Tiny databases.

As of late, I have a table which contains about 20 columns. but it has grown to 500k in records.

I noticed query time was incredibly slow. 20 seconds to fetch 100 records or so. So I decided to look into indexes, which I'm trying to understand in the most simplest of terms.

If you have a table like mine, with hundreds of thousands of rows and all it contains is an index on the ID column, is it safe to say that when you need to increase speed for simple queries you just create an index on the column that is used frequently to identify the records?

ID companyName email firstname lastname phonenumber recordtype

If we often query for records by ID, we would have an index on the ID column, if we did the same for the company column.

If we did queries that identified the record by companyName & recordtype, we would create a specific index for those 2 columns as a clustered index? Or an index specifically for those 2 columns.

I'm trying to be very basic here with familiar concepts for myself as I have struggled to understand alot of the articles I have read online and as my queries seem like they take forever is this a common way to increase query speeds on simple table structures?

Upvotes: 3

Views: 5423

Answers (2)

RDFozz
RDFozz

Reputation: 215

Since you tagged your question as sql-server, I'll answer from that framework. Other DBMSes should work similarly.

Clustered vs. non-clustered

In SQL Server, the first thing to understand is the difference between a clustered index and a non-clustered index.

A non-clustered index basically takes the data from the indexed columns, sorts them as specified (ascending or descending order by column), and includes a pointer to the actual table row the data references. In SQL Server, you can include columns that are not actually indexed; those column aren't used to sort the data, but are stored with the pointer to the row. These indexes are separate from the table itself, and thus duplicate data from the table.

A clustered index isn't separate from the table; it defines how the data in the table is organized. If a table has a clustered index, then the data is stored in the order specified by the index.

When the underlying table has a clustered index, any non-clustered index will use the columns from the clustered index as the pointer to each row. this means those columns are included in each non-clustered index automatically.

A clustered index affects inserts into the table. Each insert must be made in the correct position, as determined by the indexes columns. If the table is indexed on an IDENTITY column, then each new row would come after the last, and all new rows are added to the end of the table. On the other hand, if the data is indexes on (for instance) a customer's name, then each row may need to be written to a different location; this can lead to page splits, which (since the database has to assign a new page to the table, and define how it fits in with the other pages) takes a bit longer to do.

How indexes are used

The database generally used an index to locate rows that match a certain set of data. In the following query:

SELECT cust_id, cust_name, address, city, state, zip, phone
  FROM customer
 WHERE cust_name = 'John Smith'
   AND state = 'OH'
;

we're trying to find rows that match a specific state and cust_name.

The database engine could use an index where:

  • the first two columns in the index are cust_name and state (or state and cust_name);
  • the first column in the index is state; or
  • the first column in the index is cust_name.

If there's an index where all the columns we're looking for are indexed columns (with no columns we're not looking for listed before the last column we are looking for), then SQL Server will probably be able to use that index to find the records in question.

Why does the order of the columns in the index matter? Because that's how the data in the index is stored. If there's an index on state and cust_name, then we find the first row where state = 'OH'; then, within the 'OH' rows, we find the first row where cust_name = 'John Smith'. We know that all rows from there until either state or cust_name changes are valid for consideration.

If the index was on state, city, and cust_name, then we can find the first row where state = 'OH'; however, finding the first row from there where cust_name = 'John Smith' would just find the first 'John Smith' for the current city (say, 'Akron'). There could be more 'John Smith's in 'Cincinnati', 'Cleveland', 'Columbus', 'Dayton', etc.); we'd have to check through the entire list of cities to find them all.

SQL Server can make use of two separate indexes in a search; however, it has to use them separately. Let's say we've got an index that starts with state, and an index that starts with cust_name. To use these to find out records, SQL has to build a list of all the rows with state = 'OH' from the state index; a list of all the rows with cust_name = 'John Smith' from the cust_name index, and then identify which rows are in both lists.

When deciding whether or not to use an index, SQL Server considers the statistics is has on its tables. If, for instance, it knows that each possible state only identifies a small number of rows (has a high degree of cardinality), and each unique cust_name identifies a small number of rows, it may be worthwhile to generate the two lists, and match them up. If, however, there are 100,000 rows in the table, and only two different values for state, it's more likely to find the possible matches based on cust_name, and then check them to see if they happen to be in the right state; the list of rows with state = 'OH' would be too long to be worth running through.

Indexes can be used in other ways when trying to locate records. In the query above, if there are 50 other columns in the customer table, and there's an index that has all the columns from any part of the query as either indexed columns or included columns, then all the information required by the query exists in that index. It can generate the query's result set without even looking at the table. This is called a covering index.

Note that non-equality searches (on a range, or using column LIKE 'S%') can still use indexes, but only up to the first column in the index where the range is applied.

Also not that some criteria can be searched using an index: column LIKE '%Smith', or criteria where the column isn't being used directly, like CAST(datestr as datetime) < '2017-12-21 14:00'.

The cost of indexes

I've already noted the cost of a clustered index above. Depending on the columns indexed, each insert into the table could be more or less likely to require the engine to break on page of data into two, in order to accommodate the new row. Similarly, if the indexes columns can be changed, then an update can cause a row to move from one spot in the table to another. This can cause the index/table to be fragmented; to store less information on each page than the page can hold, and thus to require reading more pages into memory to respond to a query.

The cost of a non-clustered index can be even higher. Each non-clustered index can be though of as a partial copy of the underlying table. When a row in added or deleted, all indexes on the table need to be changed; when a row is updated, each index may need to be changed. If you have 15 non-clustered indexes on a table, then each insert or delete is basically updating 16 tables, not one.

Also, each non-clustered index has to be stored on disk. Having 15 indexes on a table increases disk space consumption: maybe by 15%, maybe by 1000% (it depends on the indexes).

Thanks to these factors, it's not always in your best interest to toss out another index because a query is slow. At some point, too many indexes will make inserts, updates, and deletes slow, and may eat up too much disk space.

Your specific examples

If you frequently perform searches on id or companyName, you would probably want an index on each column.

If you search on companyName and recordType frequently, then an index where those two column are the first two columns listed would probably help performance. If the index is on companyName, recordType, and email, then it could help when searching for all three of those fields; for companyName and recordType, or, just for companyName. It would not help searches for recordType without companyName, or email without both companyName and recordType.

Two indexes, one on companyName alone (or not followed immediately by recordType), and one on recordType (same restriction wrt companyName) could help, if both indexes have a high cardinality. Otherwise, the database may just use whichever one should pull back fewer total records.

Upvotes: 5

Alan
Alan

Reputation: 1428

Layman's Terms: Think of an index the same way you would the index in the back of a street directory. There are two ways to find an address in a street directory. The first way is to start on map 1 in Grid A1 and search each grid on the map until eventually after some time you will come across the grid on some page that has the address (assuming you are diligent.). Or you can look up the address reference in the back of the directory by street name sorted alphabetically, with a reference to the page and grid reference number. The first approach will take an average of n/2 searches to find the location (where n is the number of street references - think rows). Using the index will enable a binary split or other technique to zero-in on the record very quickly.

There is a trade-off with indexes. They store additional space and there is an overhead writing them when saving the record. So if you have a lot of writes they can slow you down. Clustered indexes avoid the additional space and the extra lookup step because the actual data is stored in the order defined by the index.

Upvotes: 4

Related Questions