So far in this course we have focused on covering the process and methods of principled inference, and putting them into practice. Hopefully, this all seems possible, at least in the context of tutorials guiding you through pre-solved problems. In real life, however, it's very easy to end up in a situation where plugging through this process to generate samples from the posterior distribution appears to be computationally impossible in a reasonable amount of time. What then?
The short answer, as far as these notes go, is: don't give up. There are a strategies that could be described as approximate methods that fit within the formalism we've been using, as we'll summarize below.
Apologies for the exasperating question, but it really is worth asking. We try to keep the computing overhead for this class low enough that an old laptop can churn through any tutorial in at most a couple of minutes tops, but that's because we're just as impatient as you are. (Or more, considering that we have to run all of them that are turned in...) For real life problems, having to wait a few hours, or days, or even weeks, depending on the importance/finality of the inference might be fine. High-performace computing is obviously a great resource if you have access to it, and MCMC is famously "embarassingly parallelizable".
But let's assume that simple impatience is not the issue. The next thing to ask is whether the theoretical part of the model, whatever it is, can be justifiably approximated and its evaluation thus sped up. Cows may not be spherical, but simplifying the description of complex phenomena, what it's defensible to do so, is a big part of science.
Last, we can think of ways to make the inference algorithms we already have work faster. There's a reason we spend so much time on methods in this course instead of just introducing pymc
and calling it a day. It's because mindless, brute-force application of any of the methods we've covered, even the fanciest, will not be efficient all the time. When efficiency really matters, applying some human intelligence can be a game changer. Since all samplers that we know of bog down when the dimensionality of the parameter space becomes large, a recommended approach is to see if the number of parameters to be sampled can be reduced. For example,
Several of the "Practice"-level tutorials explore these tricks. For the rest of these notes, we'll assume that they won't cut it, and we do need to approximate.
Ok, you've rolled your eyes through the previous section and determined that it really is necessary to find an alternative to traditional MCMC methods. The next contender for consideration is Approximate Bayesian Computation.
ABC is a family of sampling methods all based on the idea that if we can generate a mock dataset that is sufficiently similar to the observed data, then the parameters of that model are a plausible draw from the posterior PDF.
In this approach the likelihood function is never explicitly evaluated, and so it is sometimes described as being "likelihood-free". This terminology is misleading at best since ABC relies on a generative model; as you well appreciate, a generative model is not complete with a sampling distribution, and sampling distributions are equivalent to likelihood functions. That said, there are situations when we can generate from a sampling distribution much more easily than we can evaluate its density. This is the use case for ABC.
Suppose we have a data set $d$, and a generative model, $H$, with parameters $\theta$. The simplest ABC algorithm is as follows:
Here the "distance", $\rho$, is a function we can choose. Typically it is not practical to compute the distance between datasets $\rho(d,d')$ directly. Instead, we first reduce the data into a set of summary statistics, $S(d)$, whose dimensionality is smaller than that of $d$. We then reject samples if $\rho[S(d),S(d')] = \rho(d,d') < \varepsilon$. (Note that the difference in likelihoods would make an excellent distance metric, if we could compute it!)
Similarly, we get to choose $\varepsilon$, with the goal being to make it small enough that our approximation to the posterior is good.
You may recognize a similarity to rejection sampling (or Metropolis-Hastings) above. What ABC gives us is samples from $p(\theta | \rho[d,d'(\theta)] < \varepsilon, H)$. If $\varepsilon \rightarrow 0$, this will be exactly the true posterior, $p(\theta | d, H)$. If $\varepsilon \rightarrow \infty$, it will instead be the prior, $p(\theta|H)$. In practice, what we get is something in between, and the hope is that with a small enough $\varepsilon$ it will be close to the posterior, while still not rejecting in too many iterations.
As you might guess, this method requires some care. If the summary statistics are not close to being sufficient statistics$^1$ for $\theta$ (i.e. encoding all of the information the data have about $\theta$), the approximate posterior will be too broad. If we try to use too many summary statistics, it may be too difficult for mock data sets to match the real data. If $\varepsilon$ is not set correctly, sampling can be either too inefficient or not meaningful.
In the end, we're left with an approximate posterior, and need to decide whether it's adequate for our purposes. Here posterior prediction and cross-validation can help.
It's worth noting again that the algorithm above is the simplest ABC implementation. As with MCMC, there are a number of modifications that can make it more efficient (in the sense of requiring fewer rejections), potentially allowing us to be more demanding with our choice of $\varepsilon$. In fact, the implementations listed below use a version of population monte carlo, adaptively updating the distribution sampled from and reducing $\varepsilon$ to gradually approach the true posterior.
The use of a set of summary statistics in place of a larger data set in ABC is an example of "dimensionality reduction", a generally effective and practical way to avoid computational overheads. Whether under that name or not, and whether within the context of ABC or not (usually not), this approach is ubiquitous. In general, the idea is to reduce (literally) the data by compressing it into a form that can be modeled with a much smaller number of parameters.
Any time we truncate a generative model before it goes all the way to the rawest form of recorded data, we are in principle engaging in dimensionality reduction. Usually this is well justified, particularly if the sampling distribution for the reduced model is well understood. However, it's worth questioning whether simple dimensionality reduction, as opposed to ABC, is a good idea when the sampling distribution of the reduced model is something we can only guess at (including arguing for the central limit theorem and/or doing simulations).
Suppose that the sampling distribution is calculable, but so expensive that we are highly motivated to minimize the number of times we need to do so. (And suppose that ABC is also prohibitive, for some reason.)
As a last resort, we can hope that the central limit theorem makes our posterior distribution approximately Gaussian. We presumably cannot find its mean and covariance algebraically, but finding them numerically would at least involve a limited number of evaluations compared with exploring a more complex PDF. In particular, a Gaussian has a single peak, so numerical optimization (e.g. by gradient descent) might well take us there.
(There is something of a leap of faith here since optimization of a general function is neither more robust nor less expensive than sampling. Hence, much rests on us being able to find the actual posterior mode, and not some local maximum, relatively quickly.)
Once the putative posterior mode, $\hat{\theta}$, has been found, we can approximate it as Gaussian (equivalently, the log-posterior as quadratic) by Taylor expanding around the mode:
$\ln p(\theta|d) \approx \ln p(\hat{\theta}|d) + \frac{1}{2} \frac{\partial^2 \ln p}{\partial \theta^2} \bigg\rvert_{\theta=\hat{\theta}} (\theta - \hat{\theta})^2 + O[(\theta - \hat{\theta})^3]$.
Ignoring the higher order terms and exponentiating,
$p(\theta|d) \approx p(\hat{\theta}|d) \exp \left[ -\frac{1}{2} (\theta - \hat{\theta})^T H (\theta - \hat{\theta}) \right]$,
where $H$ is the "Hessian" matrix of second derivatives,
$H_{ij} = -\frac{\partial^2 \ln p}{\partial \theta_i \partial \theta_j} \bigg\rvert_{\theta=\hat{\theta}}$.
This is a multivariate Gaussian, with covariance matrix $H^{-1}$. Note that optimizers that use or numerically compute second derivatives will generally return $H^{-1}$ evaluated at the optimum for you; since numerical optimizers are traditionally minimizers, the minus sign we included in the definition of $H$ above would be there naturally (i.e. we would have found the minimum of $-\ln p$).
The methods in this section were invented within the frequentist framework. We'll first introduce them in that context, and then consider how they might fit into our usual inference approach.
Resampling methods try to compensate for "small sample" effects in the data, or otherwise not knowing the sampling distribution. The classic application for resampling is robustifying the estimation of a sample mean, when the sampling distribution appears to have heavy, non-Gaussian tails. But, in general, there is some quantity that we would like to infer from the data - we don't have a proper model to fit, but the hope is that we can somehow estimate it from the data more directly.
The jackknife procedure is
The average (compared to the full-data-set calculation) and scatter of these estimates provides some idea of the small-sample bias and its scatter. In essence, we make the data set even smaller and see how that changes the result.
The bootstrap is a little more sophisticated. The idea is that we have data that sample a distribution (what we normally aptly call the sampling distribution), so they can be used as a direct, if crude, estimate of that distribution without further assumptions. A key requirement is that the measured data are a fair representation of draws from that distribution. The procedure is
Functionally, the procedure above approximates the sampling distribution as a sum of delta functions and attempts to marginalize over it. The resulting distribution is interpreted as being indicative of the true uncertainty from the sampling distribution, translated to the estimand.
As promised, both of these methods are explicitly frequentist - they aim to estimate something (bias and/or variance) about an estimate!
Still, they do have an interpretation within the Bayesian framework. Specifically, both methods suggest a particular way to marginalize over uncertainty in the sampling distribution itself. This is not an issue we have come across yet, since step one of our approach has been to specify a generative model, including a sampling distribution. In fact, our usual advice when faced with uncertainty about the nature of sampling distribution itself is still to model it, possibly in multiple ways and applying model selection if necessary. On those occasions when the data set is so small that such an approach doesn't seem likely to be fruitful, and if we are ultimately interested in a simple function of the data (like an estimator), resampling methods are a plausible alternative.
Sufficiency is a big deal in frequentist analysis, since the goal is to find estimators (i.e. statistics computed from the data) that work well, and taking advantage of all the information present in the data is reasonably an aspect of "working well". Frequentist estimators for a given problem (or problems similar to it) are therefore a good place to start looking for summary statistics. For example, the standard estimators for the intercept and slope might be useful summaries for a linear model, even if the assumptions underlying them do not actually hold in a given problem.