Effective indexes are one of the best ways to improve performance in a database application. Without an index, the SQL Server engine is like a reader trying to find a word in a book by examining each page. By using the index in the back of a book, a reader can complete the task in a much shorter time. In database terms, a table scan happens when there is no index available to help a query. In a table scan SQL Server examines every row in the table to satisfy the query results. Table scans are sometimes unavoidable, but on large tables, scans have a terrific impact on performance.
One of the most important jobs for the database is finding the best index to use when generating an execution plan. Most major databases ship with tools to show you execution plans for a query and help in optimizing and tuning indexes. This article outlines several good rules of thumb to apply when creating and modifying indexes for your database. First, let’s cover the scenarios where indexes help performance, and when indexes can hurt performance.
Useful Index Queries
Just like the reader searching for a word in a book, an index helps when you are looking for a specific record or set of records with a WHERE clause. This includes queries looking for a range of values, queries designed to match a specific value, and queries performing a join on two tables. For example, both of the queries against the Northwind database below will benefit from an index on the UnitPrice column.
DELETE FROM Products WHERE UnitPrice = 1 SELECT * FROM PRODUCTS WHERE UnitPrice BETWEEN 14 AND 16
Since index entries are stored in sorted order, indexes also help when processing ORDER BY clauses. Without an index the database has to load the records and sort them during execution. An index on UnitPrice will allow the database to process the following query by simply scanning the index and fetching rows as they are referenced. To order the records in descending order, the database can simply scan the index in reverse.
SELECT * FROM Products ORDER BY UnitPrice ASC
Grouping records with a GROUP BY clause will often require sorting, so a UnitPrice index will also help the following query to count the number of products at each price.
SELECT Count(*), UnitPrice FROM Products GROUP BY UnitPrice
By retrieving the records in sorted order through the UnitPrice index, the database sees matching prices appear in consecutive index entries, and can easily keep a count of products at each price. Indexes are also useful for maintaining unique values in a column, since the database can easily search the index to see if an incoming value already exists. Primary keys are always indexed for this reason.
Indexes are a performance drag when the time comes to modify records. Any time a query modifies the data in a table the indexes on the data must change also. Achieving the right number of indexes will require testing and monitoring of your database to see where the best balance lies. Static systems, where databases are used heavily for reporting, can afford more indexes to support the read only queries. A database with a heavy number of transactions to modify data will need fewer indexes to allow for higher throughput. Indexes also use disk space. The exact size will depends on the number of records in the table as well as the number and size of the columns in the index. Generally this is not a major concern as disk space is easy to trade for better performance.
Building The Best Index
There are a number of guidelines to building the most effective indexes for your application. From the columns you select to the data values inside them, consider the following points when selecting the indexes for your tables.
Having short index is beneficial for two reasons. First, database work is inherently disk intensive. Larger index keys will cause the database to perform more disk reads, which limits throughput. Secondly, since index entries are often involved in comparisons, smaller entries are easier to compare. A single integer column makes the absolute best index key because an integer is small and easy for the database to compare. Character strings, on the other hand, require a character by character comparison and attention to collation settings.
The most effective indexes are the indexes with a small percentage of duplicated values. As an analogy, think of a phone book for a town where almost everyone has the last name of Smith. A phone book in this town is not very useful if sorted in order of last name, because you can only discount a small number of records when you are looking for a Smith.
An index with a high percentage of unique values is a selective index. Obviously, a unique index is highly selective since there are no duplicate entries. Many databases will track statistics about each index so they know how selective each index is. The database uses these statistics when generating an execution plan for a query.
Indexes generally contain only the data values for the columns they index and a pointer back to the row with the rest of the data. This is similar to the index in a book: the index contains only the key word and then a page reference you can turn to for the rest of the information. Generally the database will have to follow pointers from an index back to a row to gather all the information required for a query. However, if the index contains all of he columns needed for a query, the database can save a disk read by not returning to the table for more information.
Take the index on UnitPrice we discussed earlier. The database could use just the index entries to satisfy the following query.
SELECT Count(*), UnitPrice FROM Products GROUP BY UnitPrice
We call these types of queries covered queries, because all of the columns requested in the output are covered by a single index. For your most crucial queries, you might consider creating a covering index to give the query the best performance possible. Such an index would probably be a composite index (using more than one column), which appears to go against our first guideline of keeping index entries as short as possible. Obviously this is another tradeoff you can only evaluate with performance testing and monitoring.
Many databases have one special index per table where all of the data from a row exists in the index. SQL Server calls this index a clustered index. Instead of an index at the back of a book, a clustered index is closer in similarity to a phone book because each index entry contains all the information you need, there are no references to follow to pick up additional data values.
As a general rule of thumb, every non-trivial table should have a clustered index. If you only create one index for a table, make the index a clustered index. In SQL Server, creating a primary key will automatically create a clustered index (if none exists) using the primary key column as the index key. Clustered indexes are the most effective indexes (when used, they always cover a query), and in many databases systems will help the database efficiently manage the space required to store the table.
When choosing the column or columns for a clustered index, be careful to choose a column with static data. If you modify a record and change the value of a column in a clustered index, the database might need to move the index entry (to keep the entries in sorted order). Remember, index entries for a clustered index contain all of the column values, so moving an entry is comparable to executing a DELETE statement followed by an INSERT, which can obviously cause performance problems if done often. For this reason, clustered indexes are often found on primary or foreign key columns. Key values will rarely, if ever, change.
Determining the correct indexes to use in a database requires careful analysis, benchmarking, and testing. The rules of thumb presented in this article are general guidelines. After applying these principals you need to retest your specific application in your specific environment of hardware, memory, and concurrent activity. See my previous article: SQL Server Indexes, for a more thorough introduction.