3.2.5  :  Spatial Indexes

A class of queries that are of interest and that are not trivial to treat efficiently with standard relational database design are those that aim to find objects that are spatially close to each other. A typical question is:

Find all halos within a distance of :R> Mpc from a given position (:X,:Y,:Z)
(a parameter in SQL is indicated by a name prefixed with a colon), This can be trivially translated to SQL as follows:
SELECT *
  FROM MPAHaloTrees..MHalo H
 WHERE SQRT((H.X-:X)*(H.X-:X)+(H.Y-:Y)*(H.Y-:Y)+(H.Z-:Z)*(H.Z-:Z)) <= :R
The problem with this query is that the database can not make use of standard techniques for performance improvement. Such techniques generally require a representation of the data in which desired objects are closely located to each other on disk. Disks are basically 1-dimensional structures, so an ordering in the quantity, or set of quantities, that defines the "nearness" is generally used. See the duiscussion on indexes elsewhere in this documentation or in the literature on relational databases. For spatial structures in 2 dimensions or higher, there is no exact way to accomplish such an ordering. This implies that without special treatment the database will have to scan through the whole table and evaluate the expression in the WHERE clause. During this scan the whole table will have to be read from disk and it is this IO part of the query execution which is really the expensive step and should if possible be minimized.

There are some techniques though that can speed up these queries. These require special data structures tht map better form 3D to 1D. We have incorporated a number of these in the Millennium database and describe these now.

Zone index

This index currently (2007-06-11) offers the best way to perform queries that have a spatial component in the Millennium database. To show how it works we use a simpler version of the query above, that still suffers from the same problem.
SELECT *
  FROM MPAHaloTrees..MHalo H
 WHERE SNAPNUM = 63
   AND X between 10 and 20
   AND Y BETWEEN 20 AND 30
   AND Z BETWEEN 220 AND 230
To each object (halo or galaxy) we associate integer valued positional coordinates, obtained from binning the simulation box by a relatively coarse grid. We have chosen grid cells of size about 10 h-1 Mpc, 503 cells in the Millennium, 73 in the millimil. When the size of the grid cell is L, the index columns are defined as ix=floor(x/L) and similar for y and z.

This implies that all objects with the same values for ix, iy and iz will be in the same cube of size L. An index is created on these columns which now implies that queries that include these columns explicitly can in general be executed much more quickly. The index contains the following columns in order.

snapsnum, ix,iy,iz,x,y,z,haloid/galaxyid 
The following query illustrates how to write the previous query that will allow this index to be used:
SELECT *
  FROM MPAHaloTrees..MHalo H
 WHERE SNAPNUM = 63
   AND IX = 1
   AND IY = 2
   AND IZ = 22
Note that if you want to get all the properties for the halos in this region, the index will allow you to find the halo IDs quickly, but for each of these a separate lookup needs to be performed in the main table. As these halos will be roughly randomly distributed over the main table, each lookup may require a random read. This is a relatively slow procedure as the head of the device reading the data form disk must be placed to a new position each time, which takes of the order of 1 milli second. Depending on the actual number of halos found quickly in the index lookup, this socalled "bookmark lookup" may still cause the query to time out before copletion.
In this case one may wish to break up the main query in multiple pieces based on disjunct selections of the ix/iy/iz columns. Alternativel one may first store only the haloids in a separate table, which will be fast as all the information is contained in the index itself and no lookup is required. For example one may store the results in the table MYSUBVOLUME_IDS as in
SELECT *
  INTO MYSUBVOLUME_IDS
  FROM MPAHaloTrees..MHalo H
 WHERE SNAPNUM = 63
   AND IX = 1
   AND IY = 2
   AND IZ = 22
The full information can then be obtained by joining that table to the main table. This will again require bookmark lookups, but the number of these can be controlled more explicitly, for example by walking through the table in MyDB in steps of a size that will not cause the timeout. To be able to do so it is useful to first order the table by haloid:
create clustered index ix_MYSUBVOLUME_IDS on MYSUBVOLUME_IDS(haloID)
Then one can issue the first query:
select h.*
 from (
  select top 50000 haloid 
    from MYSUBVOLUME_IDS
    order by haloid ) mt
  ,    mpahalotrees..mhalo h
 where mt.haloid = h.haloid
This will produce 50000 halos. Retrieve the haloID of the last of these and issue the next query
select h.*
 from (
  select top 50000 haloid 
    from MYSUBVOLUME_IDS
   where haloid > :LASTHALOID
    order by haloid ) mt
  ,    mpahalotrees..mhalo h
 where mt.haloid = h.haloid
and repeat this procedure until all the halos have been retrieved. This can easily be coded in a simple script, see the pages on wget, R, and IDL.

phKey

Most objects in the Millennium databases also have another column indicating their position,
phkey
. This column indicates the grid cell the object is in created by a Peano-Hilbert space filling curve. The Peano-Hilbert curve in particular has the nice property that points that are nearby on the curve are also nearby in space. For other such curves this is not always true, but often, An column containing the (discrete) distance along such a curve can therefore provide a reasonable mapping from 1D-<3D, and when indexed, can be used to implement spatial searches efficiently. The code for this is too complex to use directly in T-SQL and the functions must be implemented in a more general programming language. We have created such a library in C# and embedded this in our SQLServer database. Here we describe how to use the functions and other components of this library in your SQL queries.

For this purpose the simulation volume of the Millennium run has been subdivided in 2563 grid cells and to each cell a value has been assigned corresponding to the step number along such a curve.

References

For an interesting general discussion of this problem and various approaches solutions see the PhD thesis by Volker Markl and references therein.

A very general and comprehensive reference to spatial indexes is the following book.
Hana Samet, 2006, Foundations of Multidimentsional and Metric Data Structures
Morgan Kaufman, ISBN 13:978-0-12-369446-1