3.1.2 : Performance and Indexes

It is a goal of databases to run queries efficiently. But to do so so work is often required, first from the database designer, sometimes also from the user who can aim to write queries in more optimal ways. From the designer's point of view it often is necessary to tune databases in a particular way to assist the database engine in constructing optimal execution plans for a particular query. The main technique here is the definition of appropriate table indexes. We describe this very important concept in some more detail on this page.
Note that this is a field that requires quite some more understanding of database functionality than most first time users will not possess. Whenever you have questions about badly performing queries, please mail us the SQL at we will try to improve it if possible.

An index can be seen as a kind of table that contains a subset of the columns of the main table, ordered in a particular way. Each row in such an index points back to the row in the main table from which it was extracted. If a query contains a constraint on a subset of columns that is contained in such an index, the query engine may first find the rows obeying the constraint in the index and then retrieve the required data form the main table. As the index is ordered one can use fast algorithms to find the requested data, often in "logarithmic time". This is to be compared to a linear scan over a table of O(1 billion) rows, with total size ~400GB which may take minutes.

The user does not need to use the indexes explicitly in her queries. The database "knows" of the definition of the indexes and their relation to the table and will use this knowledge in the compilation of a user query into an execution plan. It is however often useful to know which indexes exist on a table to write a query in such a way that indexes can be used. In our databases a typical example is that users should use the "snapnum" column rather then the "redshift" column that are often both available in the tables. Whenever we created indexes that allow more efficient querying for objects at a certain snapshot, we used the integer snapnum column, rather than the floating-point redshift. (See the discussion on this page for some more details on the relation between the snapnum and redshift columns.)

Indexes are often added to tables while it is up an running, hence a static documentation of them is soon out of date. On the main web page though there is a button labelled "Show Indexes" that can help one find all indexes for a given table. Clicking that button will produce the following query in the query window:

use [database-name] 
select t.name as table_name 
,      ind.name as index_name 
,      col.name as column_name 
,      ic.index_column_id as column_rank 
  from sys.indexes ind 
  ,    sys.index_columns ic 
  ,    sys.columns col 
  ,    sys.tables t 
 where ind.object_id = ic.object_id 
   and ind.index_id = ic.index_id 
   and ind.is_primary_key = 0 
   and ind.is_unique = 0 
   and ind.is_unique_constraint = 0 
   and t.is_ms_shipped = 0 
   and ic.object_id = col.object_id 
   and ic.column_id = col.column_id 
   and ind.object_id = t.object_id 
order by t.name, ind.name, ic.index_column_id

After substituting the name of a particular database for the [database-name] and clicking "Query (browser)" button a result will appear like the following example below which shows some indexes defined on tables in the Guo2010a database. The table shows first the table on which an index has been defined, then the name of the index. Then the columns in the index shown in the order in which they have been defined in the index, sepcified by the column_rank. This is important. The index is ordered by columns in the order of their definition.

For example the index ix_guo2010a_mr_snapnum_sampling defined on Guo2012a..MR contains columns (snapnum, fofID, centralMvir, type, stellarmass,...) in this order, which implies that it is ordered first by snapnum, than fofID, then centralMvir and so on. This helps when a query has a WHERE_cluase that include constraints on on snapnum and fofId. Now a binary search (for example, DB will use more optimal algorithms) can very quickly find the range of rows that fulfills the constraint. But when the query only constrains stellarMass the whole index will still have to scanned before the apporpriate rows are found. (Though note that the reduced width of the index compared will lead to some improvement compared to scanning the whole table).

It is further important to realize that when a query requests only columns that all appear in the index, this can lead to a further performance improvement. In this case once the DB engine has found the rows obeying the constraints, all the required data can be read off from the rows in the index itself. I.e. it is not necessary to do a lookup in the main table itself. This so called "bookmark lookup" (please ask Google for more details) can kill the performance improvement the use of an index, especially if large amounts of rows obey the constraints.

For this purpose many of our indexes contain more columns than could be useful in a sorting algorithm. For example the centralMvir in the index discussed above, being a floating point value, will rarely have values shared by many rows. Hence its ordering will not produce intervals of constant values that can be further ordered by subsequent values such as the stellarMass. However if one wants to obtain the stellarMass of galaxies in a given snapshot and a given fofGroup, one needs look no further than this index. A bookmark lookup is not required.

NB the index under discussion is itself not optimal. For it contains both fofId and snapnum. But a fofID identifies a single friends-of-friends group which is always contained in a single snapnum. I.e. the snapnum is actually constrained by the fofID. Hence including the snapnum is only of interest if one has NO constraint on fofId.

table_name	index_name	column_name	column_rank
MR	ix_guo2010a_mr_snapnum_sampling	snapnum	1
MR	ix_guo2010a_mr_snapnum_sampling	fofID	2
MR	ix_guo2010a_mr_snapnum_sampling	centralMvir	3
MR	ix_guo2010a_mr_snapnum_sampling	type	4
MR	ix_guo2010a_mr_snapnum_sampling	stellarMass	5
MR	ix_guo2010a_mr_snapnum_sampling	bulgeMass	6
MR	ix_guo2010a_mr_snapnum_sampling	gDust	7
MR	ix_guo2010a_mr_snapnum_sampling	rDust	8
MR	ix_mr_descendantid	descendantId	1
MR	ix_mr_descendantid	galaxyID	2
MR	ix_mr_descendantid	lastProgenitorId	3
MR	ix_mr_descendantid	stellarMass	4
MR	ix_mr_descendantid	type	5
MR	ix_mr_descendantid	snapnum	6
MR	ix_mr_fofid_type	fofID	1
MR	ix_mr_fofid_type	type	2
MR	ix_mr_haloid_tree	haloID	1
MR	ix_mr_haloid_tree	type	2
MR	ix_mr_haloid_tree	galaxyID	3
MR	ix_mr_haloid_tree	lastProgenitorId	4