In which we will
appreciate the curse of dimensionality, and some of its consequences
see how the Laplace approximation can provide an approximate posterior PDF, in very convenient form
gain a new appreciation for summary statistics, and see how ABC can provide an approximate posterior PDF, despite the data likelihood never being computed
When trying to answer interesting questions, we increasingly run into the challenges of "big data" and "big models".
"Big data" might refers to any data set whose volume, velocity or variety is sufficiently high that we need to fundamentally change the way we approach it. For example: SDSS gave astronomy a big data problem, in that we needed web-based SQL queries to allow us to make initial subsamples that we can work with. In the before time, we would have just downloaded the whole dataset.
"Big models", with large numbers of parameters, tend to follow big data. This is because larger data sets ought to provide information about more complex phenomena, as well as because parameters tend to proliferate in hierarchical models.
The fundamental problem is that posterior characterization gets more computationally difficult when the dimensionality of the model parameter space increases. We saw this when we abandoned simple grids for MCMC samples - but sampling can only get us so far.
Consider an $N$-dimensional hypersphere of radius $R$. With a little algebra, you can show that the fraction of the sphere's volume contained in a thin shell at the surface approaches 1 for large $N$. More intuitively, thinking about cubes in 1, 2, and 3 dimensions, more and more of the total volume is found in the "corners" of the space.
A consequence of this is that uniform priors become harder to justify (all that weight in the corners!), and similarly methods like SMC that sample from the prior will become increasingly inefficient in high dimensions (unless those priors are very informative). MCMC sampling can go a long way to helping, but it inevitably becomes more challenging to move around a high-dimensional space, unless the posterior is very simple.
Beyond this, we often find that evaluating the posterior for really interesting problems is too computationally expensive for our preferred methods to be feasible. We may not be in the position to evaluate it thousands of times, let alone tens or hundreds of thousands. We might even not be able to write the posterior down concretely enough to evaluate it in the first place.
"Dimensionality reduction" is an excellent, practical way to avoid the curse. The dimensionality of the data can be reduced by compressing it into a (much) smaller set of "summary statistics". The resulting compressed dataset may be modelable with a much smaller number of parameters.
The "data" are automatically measured galaxy shapes, whose tiny distortions are due to weak gravitational lensing by a complex distribution of matter between the galaxies and us. Rather then hierarchically model all this structure, with cosmological hyper-parameters, we compress the shape data into "correlation function" summary statistics. Importantly, these summary statistics can be directly predicted (on average) from a cosmological model, without having to generate a specific realization of the large scale structure. There is an issue, however: the exact form of the sampling distribution for the correlation function summary data is unknown (and dependent on cosmology). This leads us to the question of how to deal with...
If the sampling distribution of our reduced-dimensionality summary statistics is something we can understand from our model, then we're in good shape. We may have lost some information by reducing the data to summaries, but otherwise our usual approaches will work. On the other hand, if we no longer understand the sampling distribution from first principles, as in the example above, we need a different approach. Options include
ABC is a family of sampling methods all based on the idea that if we can generate a mock dataset that is sufficiently similar to the observed data, then the parameters of that model are a plausible draw from the posterior PDF. These methods might be useful when the model is too complex to enable a likelihood to be evaluated (or even written down).
When might that be? The cosmic shear example, without dimensional reduction, is one scenario. The only way to construct a PDF over possible sheared skies would be by brute force: realizing initial conditions and simulating the growth of structure and galaxy evolution many times from the same set of cosmological/astrophysical parameters. If we're reduced to doing such forward simulations anyway, ABC provides a better alternative, analogously to the improvement of MCMC over exhaustive grid evaluations.
Suppose we have a dataset $d$, and a generative model, $H$, with parameters $\theta$. The simplest ABC algorithm is as follows:
Here the "distance", $\rho$, is a function we can choose. Typically it is not practical to compute the distance between datasets $\rho(d,d')$ directly. Instead, we first reduce the data into a set of summary statistics, $S(d)$, and then reject samples if $\rho(S(d),S(d')) = \rho(d,d') < \epsilon$. (Note that the difference in likelihoods would make an excellent distance metric, if we could compute it!)
Similarly, we get to choose $\epsilon$, although the goal is for it to be small enough that our approximation is good.
You may recognize a similarity to rejection sampling (or Metropolis-Hastings) above. What ABC gives us is samples from $p(\theta | \rho(d,d') < \epsilon, H)$. If $\epsilon \rightarrow 0$, this will be exactly the true posterior, $p(\theta | d, H)$. If $\epsilon \rightarrow \infty$, it will instead be the prior, $p(\theta|H)$. In practice, what we get is something in between, and the hope is that with a small enough $\epsilon$ it will be close to the posterior, while still not rejecting in too many iterations.
As you might guess, this method requires some care. If the summary statistics are not close to being sufficient statistics for $\theta$ (i.e. encoding all of the information the data have about $\theta$), the approximate posterior will be too broad. If we try to use too many summary statistics, it may be too difficult for mock data sets to match the real data. If $\epsilon$ is not set correctly, sampling can be either too inefficient or not meaningful.
In the end, we're left with an approximate posterior, and need to decide whether it's adequate for our purposes. Here posterior prediction and cross-validation can help.
It's worth noting again that the algorithm above is the simplest ABC implementation. As with MCMC, there are a number of modifications that can make it more efficient (in the sense of requiring fewer rejections), potentially allowing us to be more demanding with our choice of $\epsilon$.
Suppose that the sampling distribution is calculable, but so expensive that we are highly motivated to minimize the number of times we need to do so.
What would be the ideal modeling situation, given such an expensive likelihood and a large number of parameters? How about a model that maps onto the linear least squares problem, for which a purely algebraic calculation allows the posterior, which is a Gaussian, to be computed exactly.
Short of that, we can hope that the central limit theorem makes our real posterior distribution approximately Gaussian. We may not be able to find its mean and covariance algebraically, but finding them numerically would at least involve a limited number of evaluations compared with exploring a more complex PDF. In particular, a Gaussian has a single peak, so numerical optimization (e.g. by gradient descent) might well take us there.
(There is something of a leap of faith here since, in general, optimization is not more robust or less expensive than sampling. Hence, much rests on us being able to find the actual posterior mode, and not some local maximum, relatively quickly.)
Once the putative posterior mode, $\hat{\theta}$, has been found, we can approximate it as Gaussian (equivalently, the log-posterior as quadratic) by Taylor expanding around the mode:
$\log p(\theta|d) \approx \log p(\hat{\theta}|d) - \frac{1}{2} \frac{\partial^2 \log p}{\partial \theta^2} \bigg\rvert_{\theta=\hat{\theta}} (\theta - \hat{\theta})^2 + O[(\theta - \hat{\theta})^3]$.
Ignoring the higher order terms and exponentiating,
$p(\theta|d) \approx p(\hat{\theta}|d) \exp \left[ -\frac{1}{2} (\theta - \hat{\theta})^T H (\theta - \hat{\theta}) \right]$,
where $H$ is the "Hessian" matrix of second derivatives,
$H_{ij} = \frac{\partial^2 \log p}{\partial \theta_i \partial \theta_j} \bigg\rvert_{\theta=\hat{\theta}}$.
This is a multivariate Gaussian, with covariance matrix $H^{-1}$. Note that optimizers that use or numerically compute second derivatives will generally return $H^{-1}$ evaluated at the optimum for you.
Since this method approximates the posterior as a Gaussian, the resulting approximation is automatically normalized. That is, it provides an approximate evidence... if you're willing to believe it.