3.2.6  :  Random Sampling

For many science questions it is strictly speaking not necessary to use the full Millennium database. And for performance reasons it would often be preferrable to use a smaller dataset that is in some way a good representation of the complete set. The milli-Millennium could be used for these purposes, however the problem with it is that it misses the long wavelength modes of the initial perturbation field and does therefore not correspond to a random volume of the same size from the full Millennium. It's use is mainly in the opportunity it offers to test SQL queries that will work on the full Millennium database, while giving much faster return times.

To offer an easy option of drawing proper random samples from the Millennium run we have added a column to some of the datasets which contains an integer random number generated using the java.lang.Math.random() function. For the Millennium datasets the number ranges between 0 and 1000000, for the milli-Millennium data between 0 and 1000. For example, to choose a roughly "1 in 1000" sample of all galaxies at the final snaphot one may choose to have all galaxies with this random number in the interval [0,1000]. Note that this does not directly offer the an option to have a random sample of a prescribed size, but as the example below shows it will be not hard to do so either. Currently (2006-08-01) this column is only added to MPAGalaxies..MGalaxy and millimil..MMGalaxy. The following queries show how one might use this feature.

The first takes a random sample that returns roughly 1% of all the galaxies from the MPA Millennium galaxy catalogue at the final snapshot. This should roughly produce 200000 galaxies:

select *
  from mpagalaxies..mgalaxy
 where snapnum=63
   and random between 0 and 10000
By choosing different intervals of the same size different smaples will be chosen.

The following query picks a random sample of exactly 1000 entries:

select top 1000 *
  from mpagalaxies..mgalaxy
 where snapnum=63
   and random between 0 and 100
The selection on random would produce about 2000 galaxies, the top 1000 takes only the first 1000 of these.

This function will be most useful in combination with a select ... into ... in a MyDB.