<- rnorm(100)
devs <- qt(pnorm(devs), df = 1)
tdevs plot(devs, tdevs)
abline(0,1)
Simulation
References:
- Gentle: Computational Statistics
- Monahan: Numerical Methods of Statistics
Videos (optional):
There are various videos from 2020 in the bCourses Media Gallery that you can use for reference if you want to.
- Video 1. Random number generation, part 1
- Video 2. Random number generation, part 2
- Video 3. Rejection sampling
- Video 4. Importance sampling
Many (most?) statistical papers include a simulation (i.e., Monte Carlo) study. Many papers on machine learning methods also include a simulation study. The basic idea is that closed-form mathematical analysis of the properties of a statistical or machine learning method/model is often hard to do. Even if possible, it usually involves approximations or simplifications. A canonical situation in statistics is that we have an asymptotic result and we want to know what happens in finite samples, but often we do not even have the asymptotic result. Instead, we can estimate mathematical expressions using random numbers. So we design a simulation study to evaluate the method/model or compare multiple methods. The result is that the researcher carries out an experiment (on the computer, sometimes called in silico), generally varying different factors to see what has an effect on the outcome of interest.
The basic strategy generally involves simulating data and then using the method(s) on the simulated data, summarizing the results to assess/compare the method(s).
Most simulation studies aim to approximate an integral, generally an expected value (mean, bias, variance, MSE, probability, etc.). In low dimensions, methods such as Gaussian quadrature are best for estimating an integral but these methods don’t scale well (we’ll discuss this in Unit 12 on integration/differentiation), so in higher dimensions we often use Monte Carlo techniques.
To be more concrete:
If we have a method for estimating a model parameter (including uncertainty), such as a regression coefficient, what properties do we want the method to have and what criteria could we use?
If we have a prediction method (including prediction uncertainty), what properties do we want the method to have and what criteria could we use?
If we have a method for doing a hypothesis test, what criteria would we use to assess the hypothesis test? What properties do we want the test to have?
If we have a method for finding a confidence interval or a prediction interval, what criteria would we use to assess the interval?
1. Monte Carlo considerations
Motivating example
Let’s consider linear regression, with observations
But suppose that we’re interested in the properties of regression estimation when in reality the mean is not linear in
Instead we decide to use a Monte Carlo estimate. To keep the notation more simple, let’s just consider one element of the vector
Or to estimate the variance, we have
In evaluating the performance of regression under non-standard conditions or the performance of our robust regression procedure, what decisions do we have to make to be able to carry out our Monte Carlo procedure?
Next let’s think about Monte Carlo methods in general.
Monte Carlo (MC) basics
Monte Carlo overview
The basic idea is that we often want to estimate
We get an MC estimate of
Note that in most simulation studies,
Back to the regression example
Let’s relate that back to our regression example. In that particular case, if we’re interested in whether the regression estimator is biased, we want to know:
For the variance, we have
and we can use the Monte Carlo estimate of
where
Finally note that we also need to use the Monte Carlo estimate of
We might also be interested in the coverage of a confidence interval. In that case we have
Of course we want that
Simulation uncertainty
Since
Note that this is particularly confusing if we have
The simulation variance is
Note that in the simulation setting, the randomness in the system is very well-defined (as it is in survey sampling, but unlike in most other applications of statistics), because it comes from the RNG that we perform as part of our attempt to estimate
Back to the regression example
Some examples of simulation variances we might be interested in in the regression example include:
Uncertainty in our estimate of bias:
.Uncertainty in the estimated variance of the estimated coefficient:
.Uncertainty in the estimated mean square prediction error:
.
In all cases we have to estimate the simulation variance, hence the
Final notes
Sometimes the
Variance reduction (optional)
There are some tools for variance reduction in MC settings. One is importance sampling (see Section 3). Others are the use of control variates and antithetic sampling. I haven’t personally run across these latter in practice, so I’m not sure how widely used they are and won’t go into them here.
In some cases we can set up natural strata, for which we know the probability of being in each stratum. Then we would estimate
Another strategy that comes up in MCMC contexts is Rao-Blackwellization. Suppose we want to know
either draw from the marginal distribution of
2. Design of simulation studies
Consider the paper that is part of PS5. We can think about designing a simulation study in that context.
First, what are the key issues that need to be assessed to evaluate their methodology?
Second, what do we need to consider in carrying out a simulation study to address those issues? I.e., what are the key decisions to be made in setting up the simulations?
Basic steps of a simulation study
Specify what makes up an individual experiment (i.e., the individual simulated dataset) given a specific set of inputs: sample size, distribution(s) to use, parameter values, statistic of interest, etc. In other words, exactly how would you generate one simulated dataset?
Often you’ll want to see how your results will vary if you change some of the inputs; e.g., sample sizes, parameter values, data generating mechanisms. So determine what factors you’ll want to vary. Each unique combination of input values will be a scenario.
Write code to carry out the individual experiment and return the quantity of interest, with arguments to your code being the inputs that you want to vary.
For each combination of inputs you want to explore (each scenario), repeat the experiment
times. Note this is an easily parallel calculation (in both the data generating dimension and the inputs dimension(s)).Summarize the results for each scenario, quantifying simulation uncertainty.
Report the results in graphical or tabular form.
Often a simulation study will compare multiple methods, so you’ll need to do steps 3-6 for each method.
Various considerations
Since a simulation study is an experiment, we should use the same principles of design and analysis we would recommend when advising a practicioner on setting up a scientific experiment.
These include efficiency, reporting of uncertainty, reproducibility and documentation.
In generating the data for a simulation study, we want to think about what structure real data would have that we want to mimic in the simulation study: distributional assumptions, parameter values, dependence structure, outliers, random effects, sample size (
All of these may become input variables in a simulation study. Often we compare two or more statistical methods conditioning on the data context and then assess whether the differences between methods vary with the data context choices. E.g., if we compare an MLE to a robust estimator, which is better under a given set of choices about the data generating mechanism and how sensitive is the comparison to changing the features of the data generating mechanism? So the “treatment variable” is the choice of statistical method. We’re then interested in sensitivity to the conditions (different input values).
Often we can have a large number of replicates (
We might denote the data, which could be the statistical estimator under each of two methods as
One can think about choosing
When comparing methods, it’s best to use the same simulated datasets for each level of the treatment variable and to do an analysis that controls for the dataset (i.e., for the random numbers used), thereby removing some variability from the error term. A simple example is to do a paired analysis, where we look at differences between the outcome for two statistical methods, pairing based on the simulated dataset.
One can even use the “same” random number generation for the replicates under different conditions. E.g., in assessing sensitivity to a
Experimental Design (optional)
A typical context is that one wants to know the effect of multiple input variables on some outcome. Often, scientists, and even statisticians doing simulation studies will vary one input variable at a time. As we know from standard experimental design, this is inefficient.
The standard strategy is to discretize the inputs, each into a small number of levels. If we have a small enough number of inputs and of levels, we can do a full factorial design (potentially with replication). For example if we have three inputs and three levels each, we have
As the number of inputs and/or levels increases to the point that we can’t carry out the full factorial, a fractional factorial is an option. This carefully chooses which treatment combinations to omit. The goal is to achieve balance across the levels in a way that allows us to estimate lower level effects (in particular main effects) but not all high-order interactions. What happens is that high-order interactions are aliased to (confounded with) lower-order effects. For example you might choose a fractional factorial design so that you can estimate main effects and two-way interactions but not higher-order interactions.
In interpreting the results, I suggest focusing on the decomposition of sums of squares and not on statistical significance. In most cases, we expect the inputs to have at least some effect on the outcome, so the null hypothesis is a straw man. Better to assess the magnitude of the impacts of the different inputs.
When one has a very large number of inputs, one can use the Latin hypercube approach to sample in the input space in a uniform way, spreading the points out so that each input is sampled uniformly. Assume that each input is
Even amongst statisticians, taking an experimental design approach to a simulation study is not particularly common, but it’s worth considering.
3. Implementation of simulation studies
Luke Miratrix (a UCB Stats PhD alum) has prepared a nice tutorial on carrying out a simulation study, including helpful R code. So if the discussion here is not concrete enough or you want to see how to effectively implement such a study, see simulation_tutorial_miratrix.pdf and the similarly named R code file.
Computational efficiency
Parallel processing is often helpful for simulation studies. The reason is that simulation studies are embarrassingly parallel - we can send each replicate to a different computer processor and then collect the results back, and the speedup should scale directly with the number of processors we used. Since we often need to some sort of looping, writing code in C/C++ and compiling and linking to the code from R may also be a good strategy, albeit one not covered in this course.
Handy functions in R include expand.grid() to get all combinations of a set of vectors and the replicate() function in R, which will carry out the same R expression (often a function call) repeated times. This can replace the use of a for loop with some gains in cleanliness of your code. Storing results in an array is a natural approach.
<- c("low", "med", "hi")
thetaLevels <- c(10, 100, 1000)
n <- c("t", "norm")
tVsNorm <- expand.grid(thetaLevels, tVsNorm, n)
levels ## example of replicate() -- generate m sets correlated normals
set.seed(1)
<- function(n, theta = 1){
genFun <- rnorm(n)
u <- runif(n)
x <- exp(-fields::rdist(x)/theta)
Cov <- chol(Cov)
U return(cbind(x,crossprod(U, u)))
}<- 20
m <- replicate(m, genFun(100, 1))
simData dim(simData) # 100 observations by {x, y} values by 20 replicates
[1] 100 2 20
Analysis and reporting
Often results are reported simply in tables, but it can be helpful to think through whether a graphical representation is more informative (sometimes it’s not or it’s worse, but in some cases it may be much better). Since you’ll often have a variety of scenarios to display, using trellis plots in ggplot2 via the facet_wrap function will often be a good approach to display how results vary as a function of multiple inputs.
You should set the seed when you start the experiment, so that it’s possible to replicate it. It’s also a good idea to save the current value of the seed whenever you save interim results, so that you can restart simulations (this is particularly helpful for MCMC) at the exact point you left off, including the random number sequence.
To enhance reproducibility, it’s good practice to post your simulation code (and potentially simulated data) on GitHub, on your website, or as supplementary material with the journal. Another person should be able to fully reproduce your results, including the exact random number generation that you did (e.g., you should provide code for how you set the random seed for your randon number generator).
Many journals are requiring increasingly detailed documentation of the code and data used in your work, including code and data for simulations. Here are the American Statistical Association’s requirements on documenting computations in its journals:
“The ASA strongly encourages authors to submit datasets, code, other programs, and/or extended appendices that are directly relevant to their submitted articles. These materials are valuable to users of the ASA’s journals and further the profession’s commitment to reproducible research. Whenever a dataset is used, its source should be fully documented and the data should be made available as on online supplement. Exceptions for reasons of security or confidentiality may be granted by the Editor. Whenever specific code has been used to implement or illustrate the results of a paper, that code should be made available if possible. [.…snip.…] Articles reporting results based on computation should provide enough information so that readers can evaluate the quality of the results. Such information includes estimated accuracy of results, as well as descriptions of pseudorandom-number generators, numerical algorithms, programming languages, and major software components used.”
4. Random number generation (RNG)
At the core of simulations is the ability to generate random numbers, and based on that, random variables. On a computer, our goal is to generate sequences of pseudo-random numbers that behave like random numbers but are replicable. The reason that replicability is important is so that we can reproduce the simulation.
Generating random uniforms on a computer
Generating a sequence of random standard uniforms is the basis for all generation of random variables, since random uniforms (either a single one or more than one) can be used to generate values from other distributions. Most random numbers on a computer are pseudo-random. The numbers are chosen from a deterministic stream of numbers that behave like random numbers but are actually a finite sequence (recall that both integers and real numbers on a computer are actually discrete and there are finitely many distinct values), so it’s actually possible to get repeats. The seed of a RNG is the place within that sequence where you start to use the pseudo-random numbers.
Many RNG methods are sequential congruential methods. The basic idea is that the next value is
Given the construction, such sequences are periodic if the subsequence ever reappears, which is of course guaranteed because there is a finite number of possible subsequence values given that all the
An example of a sequential congruential method is a basic linear congruential generator:
<- 100
n <- 171
a <- 30269
m <- rep(NA, n)
u 1] <- 7306
u[for(i in 2:n)
<- (a * u[i-1]) %% m
u[i] <- u/m
u <- runif(n)
uFromR par(mfrow = c(2,2), mgp = c(1.8, 0.7, 0), mai = c(.5,.5,.3,.1))
plot(1:n, u, type = 'l')
plot(1:n, uFromR, type = 'l')
hist(u, nclass = 25)
hist(uFromR, nclass = 25)
A wide variety of different RNG have been proposed. Many have turned out to have substantial defects based on tests designed to assess if the behavior of the RNG mimics true randomness. Some of the behavior we want to ensure is uniformity of each individual random deviate, independence of sequences of deviates, and multivariate uniformity of subsequences. One test of a RNG that many RNGs don’t perform well on is to assess the properties of
Combining generators can yield better generators. The Wichmann-Hill is an option in R and is a combination of three linear congruential generators with
RNGkind("Wichmann-Hill")
set.seed(1)
<- .Random.seed
saveSeed <- runif(10)
uFromR <- c(171, 172, 170)
a <- c(30269, 30307, 30323)
m <- matrix(NA, nr = 10, nc = 3)
xyz 1, ] <- (a * saveSeed[2:4]) %% m
xyz[for( i in 2:10)
<- (a * xyz[i-1, ]) %% m
xyz[i, ] for(i in 1:10)
print(c(uFromR[i],sum(xyz[i, ]/m)%%1))
[1] 0.1297134 0.1297134
[1] 0.9822407 0.9822407
[1] 0.8267184 0.8267184
[1] 0.242355 0.242355
[1] 0.8568853 0.8568853
[1] 0.8408788 0.8408788
[1] 0.3421633 0.3421633
[1] 0.7062672 0.7062672
[1] 0.6212432 0.6212432
[1] 0.6537663 0.6537663
## we should be able to recover the current value of the seed
10, ] xyz[
[1] 24279 14851 10966
2:4] .Random.seed[
[1] 24279 14851 10966
By default R uses something called the Mersenne twister, which is in the class of generalized feedback shift registers (GFSR). The basic idea of a GFSR is to come up with a deterministic generator of bits (i.e., a way to generate sequences of 0s and 1s),
Additional notes
Generators should give you the same sequence of random numbers, starting at a given seed, whether you ask for a bunch of numbers at once, or sequentially ask for individual numbers.
When one invokes a RNG without a seed, they generally have a method for choosing a seed, often based on the system clock.
There have been some attempts to generate truly random numbers based on physical randomness. One that is based on quantum physics is http://www.idquantique.com/true-random-number-generator/quantis-usb-pcie-pci.html. Another approach is based on lava lamps!
RNG in R
We can change the RNG in R using RNGkind(). We can set the seed with set.seed(). The seed is stored in .Random.seed. The first element indicates the type of RNG (and the type of normal RV generator). The remaining values are specific to the RNG. In the demo code, we’ve seen that for Wichmann-Hill, the remaining three numbers are the current values of
In R the default RNG is the Mersenne twister (?RNGkind
), which is considered to be state-of-the-art – it has some theoretical support, has performed reasonably on standard tests of pseudorandom numbers and has been used without evidence of serious failure. Plus it’s fast (because bitwise operations are fast). In fact this points out one of the nice features of R, which is that for something as important as this, the default is generally carefully chosen by R’s developers. The particular Mersenne twister used has a periodicity of .Random.seed[2]
.
We can set the seed by passing an integer to set.seed(), which then sets as many actual seeds as required for a given generator. Here I’ll refer to the integer passed to set.seed() as the seed. Ideally, nearby seeds generally should not correspond to getting sequences from the stream that are closer to each other than far away seeds. According to Gentle (CS, p. 327) the input to set.seed() should be an integer,
So we get replicability by setting the seed to a specific value at the beginning of our simulation. We can then set the seed to that same value when we want to replicate the simulation.
set.seed(1)
rnorm(10)
[1] -1.12774688 0.94127649 1.06642978 -0.40656626 0.30874760 1.42146069
[7] -1.68323660 0.43367702 -0.01607178 -1.72752716
set.seed(1)
rnorm(10)
[1] -1.12774688 0.94127649 1.06642978 -0.40656626 0.30874760 1.42146069
[7] -1.68323660 0.43367702 -0.01607178 -1.72752716
We can also save the state of the RNG and pick up where we left off. So this code will pick where you had left off, ignoring what happened in between saving to savedSeed and resetting.
set.seed(1)
rnorm(5)
[1] -1.1277469 0.9412765 1.0664298 -0.4065663 0.3087476
<- .Random.seed
savedSeed rnorm(5)
[1] 1.42146069 -1.68323660 0.43367702 -0.01607178 -1.72752716
<- sample(1:50, 2000, replace = TRUE)
tmp <- savedSeed
.Random.seed rnorm(5)
[1] 1.42146069 -1.68323660 0.43367702 -0.01607178 -1.72752716
In some cases you might want to reset the seed upon exit from a function so that a user’s random number stream is unaffected:
<- function(args) {
f <- .Random.seed
oldseed ## other code
<<- oldseed # note global assignment!
.Random.seed }
Note the need to reassign to the global variable .Random.seed.
RNGversion() allows you to revert to RNG from previous versions of R, which is very helpful for reproducibility.
The RNGs in R generally return 32-bit (4-byte) integers converted to doubles, so there are at most
One way to proceed if you’re using both R and C is to have C use the R RNG, calling the C functions that R uses under the hood, which are located in the Rmath library. This way you have a consistent source of random numbers and don’t need to worry about issues with RNG in C. If you call C from R, this should approach should also work (see details in http://statistics.berkeley.edu/computing/cpp); you could also generate all the random numbers you need in R and pass them to the C function.
Note that whenever a random number is generated, the software needs to retain information about what has been generated, so this is an example where a function must have a side effect not observed by the user. R frowns upon this sort of thing, but it’s necessary in this case.
RNG in parallel
We can generally rely on the RNG in R to give a reasonable set of values. One time when we want to think harder is when doing work with RNG in parallel on multiple processors. The worst thing that could happen is that one sets things up in such a way that every process is using the same sequence of random numbers. This could happen if you mistakenly set the same seed in each process, e.g., using set.seed(mySeed) in R on every process. More details on parallel RNG are given in Unit 6.
5. Generating random variables
There are a variety of methods for generating from common distributions (normal, gamma, beta, Poisson, t, etc.). Since these tend to be built into R and presumably use good algorithms, we won’t go into them. A variety of statistical computing and Monte Carlo books describe the various methods. Many are built on the relationships between different distributions - e.g., a beta random variable (RV) can be generated from two gamma RVs.
Also note that you can call the C functions that implement the R distribution functions as a library (Rmath), so if you’re coding in C or another language, you should be able to make use of the standard functions: {r,p,q,d}{norm,t,gamma,binom,pois,etc.} (as well as a variety of other R math functions, which can be seen in Rmath.h). Information on this can be found in the Writing R Extensions manual on CRAN (section 6.16).
Multivariate distributions
The mvtnorm package supplies code for working with the density and CDF of multivariate normal and t distributions.
To generate a multivariate normal, in Unit 10, we’ll see the standard method based on the Cholesky decomposition:
<- chol(covMat)
U crossprod(U, rnorm(nrow(covMat)))
Side note: for a singular covariance matrix we can use the Cholesky with pivoting, setting as many rows to zero as the rank deficiency. Then when we generate the multivariate normals, they respect the constraints implicit in the rank deficiency. However, you’ll need to reorder the resulting vector because of the reordering involved in the pivoted Cholesky.
Inverse CDF
Most of you know the inverse CDF method. To generate
Rejection sampling
The basic idea of rejection sampling (RS) relies on the introduction of an auxiliary variable,
To implement this we draw from a larger set and then only keep draws for which
- generate
- generate
- if
then use ; otherwise go back to step 1
The intuition here is graphical: we generate from under a curve that is always above
If
One example of RS is to sample from a truncated normal. Of course we can just sample from the normal and then reject, but this can be inefficient, particularly if the truncation is far in the tail (a case in which inverse CDF suffers from numerical difficulties). Suppose the truncation point is greater than zero. Working with the standardized version of the normal, you can use an translated exponential with lower end point equal to the truncation point as the majorizing density (Robert 1995; Statistics and Computing, and see calculations in the demo code). For truncation less than zero, just make the values negative.
Adaptive rejection sampling (optional)
The difficulty of RS is finding a good enveloping function. Adaptive rejection sampling refines the envelope as the draws occur, in the case of a continuous, differentiable, log-concave density. The basic idea considers the log of the density and involves using tangents or secants to define an upper envelope and secants to define a lower envelope for a set of points in the support of the distribution. The result is that we have piecewise exponentials (since we are exponentiating from straight lines on the log scale) as the bounds. We can sample from the upper envelope based on sampling from a discrete distribution and then the appropriate exponential. The lower envelope is used for squeezing. We add points to the set that defines the envelopes whenever we accept a point that requires us to evaluate
Importance sampling
Importance sampling (IS) allows us to estimate expected values, with some commonalities with rejection sampling.
Here we don’t require the majorizing property, just that the densities have common support, but things can be badly behaved if we sample from a density with lighter tails than the density of interest. So in general we want
This suggests we can reduce variance in an IS context by oversampling
What if we actually want a sample from
Ratio of uniforms (optional)
If
One can also consider truncating the rectangular region, depending on the features of
Monahan recommends the ratio of uniforms, particularly a version for discrete distributions (p. 323 of the 2nd edition).