Containing an incomplete overview of additional methods for generating samples from a posterior PDF.
CosmoMC
by Antony LewisThe Metropolis-Hastings algorithm has historically been the workhorse MCMC method, in that it is widely applicable, its inner workings and shortcomings are relatively simple to understand, and it's straightforward to assess how well it's working. But that doesn't mean it's the only or best solution for any particular problem.
I would caution against thinking of the methods below as more "advanced", in the sense of being automatically preferable, compared with those we have already seen. For example, several of them are "better" in the sense that the sequences they produce tend to be less correlated than Metropolis would yield. Yet, we've seen that correlated chains are not a bug, as such. In practice, what we care about is accumulating a certain number of effectively independent samples in a reasonable amount of human time, with the run time determined by the number of times the posterior needs to be evaluated. An algorithm that produces completely independent samples, but requires 10 posterior evaluations for each one, is no better according to this metric than a Metropolis kernel with a correlation length of 10.
All of which is to say that, in my limited experience, this is one of those cases where there just isn't a free lunch to be had. Compared to Metropolis with the most straightforward and common-sense proposal adaptation (see below), other methods that I've tried (which, admittedly, does not include everything below) haven't been any faster. Since many of those methods don't admit the same simple diagnostics that we saw previously, I tend to stubbornly stick with Metropolis (or conjugate Gibbs, when possible) in my own work. But that doesn't mean you shouldn't learn about and try other things. And there are certain tasks that MCMC methods simply don't do well, like calculation of the Bayesian evidence, for which we'll want to use other approaches (see Model Evaluation and Comparison II).
One thing to keep in mind, when faced with a very computationally expensive problem, is that we're allowed to use different samplers for different subsets of parameters, just as long as every parameter gets updated somehow.
Our first method is not actually a novel one, just our old friend Metropolis with extra bells and whistles. Recall the basic algorithm:
choose a starting point and a proposal scale
while we want more samples
propose a new point from a scaled distribution, centered on our current position
accept/reject the new point
To speed things up in human time, we can also take advantage of the embarrassingly parallel nature of the algorithm. That is, nothing prevents us from running multiple, independent chains using as many CPUs as we have available, and pooling the chains at the end. In fact, we've seen that we'll want multiple chains in any case, to test for convergence.
In the algorithm itself, we mainly have freedom to play with the proposal distribution. We've already discussed using a proposal distribution that discourages very small steps (in scaled units). What else can we do? The better we can tune the proposal distribution to match the overall shape of the posterior, the more efficiently the chain will be able to move around in parameter space. Hence, the goal is to cleverly and automatically tune the proposal distribution.
Strictly speaking, tuning the proposal distribution on the fly breaks the rules of MCMC (specifically, detailed balance). Technically, we should stop tuning after some time, once things appear to be working well, and throw away all steps before that point. In practice, if we've actually managed to converge to the target distribution, subsequent updates to the proposal distribution will be negligible anyway, making this less of a concern.
One way of making the proposal distribution more efficient would be to customize the proposal scale for each parameter. If we believe the chain has begun (laboriously) to sample from the posterior, we could periodically compute the standard deviation of the samples so far in each direction, and use that as a guess of the appropriate proposal scale. More generally, we could look at the multidimensional covariance of the samples so far, and adjust the shape of our proposal distribution to match; this would let us learn about and compensate for parameter degeneracies. This change is equivalent to using a rotated coordinate basis, with appropriate scales in each new basis direction, for our proposal distribution. We could visualize it like this:
Original |
Proposal basis change |
Parameter basis change |
In each panel, the ellipse shows the shape of the posterior distribution, and the vectors show the coordinate basis with respect to which the proposal distribution is defined. (We are not confined to proposing along these directions, but are imagining that we want the proposal distribution to be simple with respect to this basis, and have a proposal scale associated with each basis direction.) On the left, the basis is Cartesian, with the scales set to the standard deviation of the (marginal) posterior in each direction; we can see that this results in proposals that are too large on average to be accepted often. In the center, we have rotated the basis for proposals to align with the major and minor axes of the iso-probability surfaces, and tuned the associated scales appropriately. This can equivalently be accomplished by re-defining the parameters to reduce the correlation of the PDF, as illustrated on the right. (Note that, being a rotation and scaling, this parameter transformation is linear, meaning that the priors are invariant under it. If we had used a nonlinear reparametrization, we'd need to worry about transforming the prior density accordingly.) For a nice, elliptical PDF like the one shown, a basis that diagonalizes the covariance matrix is optimal.
If changes to some parameters require much more costly calculations than others, straightforward diagonalization of the covariance may not be optimal in practice. In this case, we might want to propose changes to fast parameters more often than slow parameters. More generally, given a hierarchy of speeds, we can also triangularize the proposal basis, such that the fastest parameter appears in every basis vector, the second-fastest appears in all but one, etc., with the slowest parameter being changed only in tandem with all other parameters. This is illustrated in the right panel, below. We might also consider sampling in faster directions more often than in slower directions.
Original |
Diagonalized covariance |
Triangular fast-slow |
We've already discussed running independent chains in parallel, and using information gathered so far in a chain to adapt the proposal distribution. Combining these ideas turns out to be extremely powerful. If we start multiple chains in different parts of parameter space as usual, and their proposal distributions are not great out of the box, then chances are they will make their way to different parts of the region where the posterior is non-tiny and then struggle to move around. Pooling the information about where multiple chains are sampling can thus tell us much more about the global shape of the posterior than looking only at one chain. With some imagination, we can see this in the example below, where the most degenerate direction becomes evident within the first few steps of each chain.
Pooling information in this way can vastly decrease the time to convergence compared with running separate chains, much more so than the speedup from embarrassingly simple parallelization alone. Consider the comparison of independent-adaptive and communicating-adaptive chains below, where the starting positions and adaptation strategy (diagonalizing the covariance) are the same in each case.
It isn't so easy to tell what's happening as a function of time for each chain in these plots, but, looking at their traces, we see that the communicating chains provide more useful information more quickly than the alternative.
Independent adaptation |
Pooled adaptation |
As with doing on-the-fly adaptation to begin with, we are breaking a rule here - the chains are no longer independent, so we should think about the consequences for our ability to test for convergence by comparing them. Strictly speaking, we should use a procedure like this strictly to find a good proposal distribution, then throw away those chains and run new ones with the proposal distribution fixed. On the other hand, the argument that the proposal distribution will eventually stop changing still holds (normally), so comparisons of the post-convergence chains are still arguably reasonable. In addition, we still have the sanity check that chains with dispersed starting points found their way to the same part of parameter space, regardless of the adaptation method.
Most packaged implementations of Metropolis-Hastings will do at least some of the above; see links in the Monte Carlo Sampling notes.
Consider the function $[p(x)]^{1/T}$, where $p(x)$ is the target PDF. We say a chain sampling this function, instead of $p(x)$ itself, has temperature $T$. For $T>1$, $p^{1/T}$ is smoothed out compared with $p$. This smoothed out distribution will be easier to move around than the original, especially when there are strong/complex degeneracies, and in particular it should be more likely for a chain to move between well separated peaks of the PDF. Of course, we're only actually interested in samples obtained for $T=1$, ultimately.
With parallel tempering, we run one chain with $T=1$ and several more chains with $T>1$. A modified Metropolis-Hastings update occasionally allows the chains to exchange positions, giving the $T=1$ chain a mechanism for sampling regions of parameter space it might otherwise have low probability of proposing. Samples from the $T=1$ chain can be used for inference.
While standard MCMC is analogous to the evolution of a thermodynamic system, HMC is (almost) analogous to the dynamical evolution of a single particle. Consider our free parameters as coordinates of a position, $\theta$, and minus the log-posterior as a potential energy, $U(\theta)$.
HMC introduces momentum parameters, $\phi$, corresponding to each position parameter, and an associated "kinetic energy",
$K(\phi) = \sum_i \frac{\phi_i^2}{2m_i}$,
where we are free to choose the "masses", $m_i$. Looking at the analogy the other way, the probability associated with $K(\phi)$,
$p(\phi) \propto e^{-K(\phi)}$,
is a multivariate Gaussian with mean zero and a diagonal covariance given by the masses.
The HMC algorithm alternates direct sampling of $\phi$ from this Gaussian with joint updates of $\phi$ and $\theta$, as follows:
There's a lot more literature on HMC and its optimization than we can cover - see e.g. this chapter. In a nutshell, the introduction of "momentum" into the evolution of the chain is supposed to reduce the random-walk behavior (autocorrelation) of traditional MCMC.
The catch is that, to evolve the $(\theta,\phi)$ state, we need to compute the gradient of the log-posterior (the "force"). If this can be computed analytically, we're in business. If we need to do it numerically, it isn't obvious that HMC will have any advantage over standard MCMC.
These days, there are HMC packages that can perform symbolic differentiation, provided you can express your model as equations in their own language. For complex models, doing this translation such that the resulting, automatically generated code is efficient can be something of an art. And sadly, those of us using complex, legacy likelihood codes that we'd rather not reinvent, or that involve fundamentally non-analytic components like ugly look-up tables, are not likely to move to HMC any time soon.
This is not really a new method, but it's worth knowing that there is software out there that can handle the exploitation of conjugacies for you. Similarly to HMC, you would need to code up your model symbolically; the software then determines what conjugacies exist and uses them, meaning that you don't need to implement those updates yourself (possibly introducing errors). For parameters that are not conjugate, they typically fall back on basic one-parameter-at-a-time Metropolis sampling, so it's only worth investigating these packages if you know (or suspect) your problem has conjugacies to take advantage of.
See links in the Monte Carlo Sampling notes.
Moving beyond Markov chain methods, the method of Goodman and Weare (2010) works by evolving an ensemble of points in parameter space. After convergence, the ensemble can be regarded as a set of samples from the target distribution. This approach provides some of the benefits of running multiple chains, although we cannot interpret the evolution of the points in the ensemble as independent chains in any sense.
The procedure for moving each point (aka walker) in the ensemble is:
$g(z) \propto \frac{1}{\sqrt z}; ~ \frac{1}{2}\leq z \leq 2$. 3. Accept or reject the move based on the ratio of posterior densities multiplied by $z^{N-1}$.
Note that there is some magic in the density $g$. We are not free to choose just any function.
This algorithm is relatively easy to use - there is no tuning required and it's straightforward to parallelize. One just needs to provide an initial ensemble, which is not much more onerous than providing a single initial state for MCMC. For these reasons, we lean on a Python implementation called emcee
(feel free to groan) in some of the later tutorials. There are some important cautions to keep in mind, however.
In short, we'll use this method when we're already pretty sure that the the posterior is simple (unimodal), that we can guess a decent starting point, and that we'll therefore be able to recognize convergence unambiguously.
Now we get to some fundamentally different approaches. Recall the essence of MC integration,
$\int f(x) \, dx = \int w(x) \, p(x) \, dx \approx \frac{1}{n}\sum_i w(x_i)$; $x_i \sim p$.
We used this (way back) to motivate Simple Monte Carlo, where $p$ is the prior and $w$ is the likelihood. The key feature that enabled SMC is that any set of points can provide an unbiased estimate of an integral, as long as the weighting is correct.
PMC is a more refined version of this - instead of sampling from the prior, we iteratively build up an approximation to the posterior distribution, called the generating distribution, along with a number of samples and associated weights. Each iteration consists of
These points can then be weighted by the ratio $\pi/q$ and treated as samples of the posterior. All of these points, that is; no matter which iteration they came from and how uninformed the generating distribution was at that point, they still come with the correct weight. Still, getting accuracy from this method on human timescales requires a clever way to refine $q$. With a suitably intelligent implementation (e.g. see this), the generating distribution can be evolved towards the posterior distribution, providing samples for a posteriori analysis as well as an estimate for the evidence.
This is a particular form of PMC that is primarily aimed at calculating the Bayesian evidence. We'll look at its utility more closely in the notes on model comparison. In brief, we begin with a large number of points sampled from the prior, and gradually evolve them towards higher-likelihood positions, keeping track of how the volume filled changes (marvel at the complicated math here). Like PMC generally, every time we evaluate the likelihood we get a sample and a weight that we can use to derive parameter constraints, regardless of whether we even care about the evidence.
By virtue of spamming the parameter space with points at the start, nested sampling is likely (though never guaranteed) to find multiple modes, if they exist. The same can be said of PMC, given a wide initial generating distribution. Since these methods maintain something like a global view of the parameter space, rather building up information by tracing a sequence through the space, they are probably the best option in the situation that confounds MCMC, when there are widely separated peaks in the posterior distribution.
If nested sampling is so much more foolproof than MCMC, why wouldn't we always use it instead? In short, exactly those features that make it more robust to multimodal distributions make it less efficient than MCMC for unimodal distributions - if our goal is simply to find the peak and the usual credible intervals/regions, we're better off looking around the region of parameter space where the posterior is large, rather than carefully mapping out the tails.
Understanding the guts of every possible sampling algorithm is not really the point of this course, so the list above is necessarily incomplete and somewhat superficial. Hopefully, you've gotten at least a sense of what's out there and how it works.
Like many things in computing, finding the best way to sample a particular PDF when optimization really matters is an art. I'll just reiterate two pieces of advice from above
As you've seen already, sampling packages tend to come in two flavors: those with a relatively "minimalist" interface that leaves the user responsible for the model implementation (e.g. emcee
, where the user provides initial parameter values and a log-posterior function that accepts a parameter array as an argument), and those where the user is expected to specify the entire model in the package's bespoke pseudocode. There's a trade-off here, in that the latter potentially helps one avoid any number of coding mistakes, but can also make the process of debugging more opaque and/or preclude the use of external codes within the posterior calculation. If you're comfortable with all that, then PyMC
is probably a good choice for most purposes, since it's actively developed and has a large user base. Otherwise (and we're open to feedback on this), our advice is emcee
to get something up and running quickly in Python, falling back to the Metropolis variant of your choice if there are any doubts about the G-W sampling working correctly, and specialized codes like nested/Gibbs sampling where appropriate.