I stumbled on an article about Nassim Taleb’s book “The Black Swan” recently. Quick summary: It’s very easy to make mistakes when applying the mathematical theory of statistics to the real world. Not everything in the world can be described as independent random variables with clearly defined properties.

So whenever you put real money behind your statistical reasoning, you better be really sure that your calculations are correct and that all the preconditions regarding applicability of the theory are met.

Furthermore, the rare event you are tempted to ignore (the “this won’t happen in my lifetime” syndrome) may have such bad results that you just shouldn’t discount it.

All this reminded me a bit of what I wrote 12 years ago in my master thesis on Pseudorandom Number Generation:

Classic statistical tests examine if the test statistic does not deviate from its expected value too much. If we are only interested in the expected outcome of a similar simulation problem, such *one-level* statistical tests are all we need in order to be confident about the accuracy of the simulation.

On the other hand we might be interested in the distribution of the simulation’s outcome. For this goal hitting the expected value is not enough, the variance of the result is now important, too. Thus we will demand the same behaviour from the test statistic, too.

Let us illustrate this principle with an example. We want to test the well known strategy of doubling the ante in a game of roulette. It is supposed to guarantee winning the initial ante and works like this: If we do not win in the first round (and therefore win twice the ante) the ante is doubled for the next round. If this round is won, we get back four times the initial ante while we invested three times the initial ante resulting in a net win of one ante. In case of bad luck we double the ante again hoping for eight times the ante for an investment of seven. As we hope that we will finally win before our capital is drained a net win seems to be certain.

In order to simulate this we need random numbers to determine whether we will win the current bet. The probabilities are 18/37 for winning and 19/37 for losing each round, respectively. It seems to be natural to use the lengths of runs as a test statistic to test our source of PRN for its fitness to simulate a real roulette table. The probability that the maximal run length in 500 tries is greater than 15, is smaller than all usual values for α, so according to the corresponding statistical test we should reject all sequences where such runs do occur.

When we now run the simulation with these prescreened sequences we will *never ever* experience a loss as long as we have enough money for 15 steps of doubling the ante. Thus we should conclude that the strategy works. As we know, this is not true. So what went wrong with our simulation ?

The statistical test considered it equally important whether the sequence in question was “well-behaved” or not, whereas the simulation assigned completely different weights to those cases. Thus the area that the test considered to be insignificant (smaller than α) played a major role in the simulation (more than 1/2).

There are some other cases of simulations where we are not so much interested in the average case, but in the extreme ones. Consider for example all those safety measures in power plants or other machinery where a rare sequence of occurrences might lead to catastrophic results. When simulating these security systems one must not a priori exclude unusual sequences.

Please note that the distinction between level-1 and level-2 tests (tests which test the distribution of the results of a level-1 test) is arbitrary. The test statistic of a level-2 test is just another function of the underlying set of PRN, too.