Tutorial: MCMC Diagnostics

You should already have run one or two different MCMC algorithms to generate multiple chains from different strting points for the toy photometry problem. Here, we'll work through the process of diagnosing whether these Markov chains are usefully sampling the posterior distribution, in particular assessing:

The diagnostics discussed below include both qualitative and quantitative checks. We don't particularly think it's all that instructive to write the code that does the quantitative calculations - though there is surely room for improvement or expansion if you're interested - so intead we will demonstrate how to use functions provided by the incredible and pandas packages.

Final note: if you have done both the Gibbs Sampling and Metropolis Sampling tutorials, then you can/should use your own chains below. If not, you can change the paths below to read in chains from the solutions/ directory instead, for the case you haven't worked through.

Gibbs samples

It's nice to start with the Gibbs sampled chains, since (for this problem) they almost certainly look nicer. First, read them in. The way we did things in the previous tutorial, each entry in the chains list will be an $N_\mathrm{steps} \times N_\mathrm{parameters}$ array.

Visual inspection

You've already used the most important method of vetting chains: visual inspection. The key questions are:

  1. Do multiple, independent chains appear to be sampling the same distribution (have they converged to the same distribution)?
  2. Is there a clear "burn-in" period before convergence that should be eliminated?
  3. Are the chains highly autocorrelated (taking small steps compared with the width of the posterior)? This is not an issue per se, if the chains are long enough, although it means the sampler is not moving as efficiently as one might hope.

Plot the parameter traces below, and answer these questions (qualitatively) for the conjugate Gibbs sampling chains.

Is there a burn-in that should clearly be removed? If so, do so here by setting the value of burn to a number of steps $>0$. (You shouldn't need to remove very much - if burn=100, for e.g., doesn't seem incredibly generous in this case, then something went wrong in the previous tutorial.)

Gelman-Rubin statistic

Recall from the notes that the Gelman-Rubin convergence statistic, $R$, quantitatively tests the similarlity of independent chains intended to sample the same PDF. To be meaningful, they should start from different locations and burn-in should be removed.

For a given parameter, $\theta$, the $R$ statistic compares the variance across chains with the variance within a chain. Intuitively, if the chains are random-walking in very different places, i.e. not sampling the same distribution, $R$ will be large.

We'd like to see $R\approx 1$; for example, $R<1.1$ is often used.

Checkpoint: If your Gibbs sampler works properly, $R$ for the chains we ran should be very close to 1 (we have differences of order 0.0001).

Autocorrelation

Similarly, the autocorrelation of a sequence, as a function of lag, $k$, can be quantified:

$\rho_k = \frac{\mathrm{Cov}_i\left(\theta_i,\theta_{i+k}\right)}{\mathrm{Var}(\theta)}$

The larger lag one needs to get a small autocorrelation, the less informative individual samples are.

The pandas function plotting.autocorrelation_plot() is useful for this. Note that seemingly random oscillations basically tell us the level of noise due to the fininte chain length. A coherent drop as a function of lag indicates a genuine autocorrelation, and the lag at which it drops to within the noise is an approximate autocorrelation length. If we needed to thin the chains to conserve disk space, this would be a reasonable factor to thin by.

Checkpoint: Again, for this problem, the Gibbs chains should be very well behaved. Our autocorrelation plots basically look like noise (almost all points within the horizontal, dashed lines that pandas provides as an estimate of the noise).

Effective number of independent samples

From $m$ chains of length $n$, we can also estimate the "effective number of independent samples" as

$n_\mathrm{eff} = \frac{mn}{1+2\sum_{t=1}^\infty \hat{\rho}_t}$, with

$\hat{\rho}_t = 1 - \frac{V_t}{2V}$ ($V$ as in the Gelman-Rubin calculation), and

$V_t = \frac{1}{m(n-t)} \sum_{j=0}^m \sum_{i=t+1}^n (\theta_{i,j} - \theta_{i-t,j})^2$.

In practice, the sum in $n_\mathrm{eff}$ is cut off when the estimates $\hat{\rho}_t$ become "too noisy", e.g. when the sum of two successive values $\hat{\rho}_t$ and $\hat{\rho}_{t+1}$ is negative. Roughly speaking, this should occur when the lag is of the order of the autocorrelation length.

The effective_samples below function allows you to pass a guess at this maximum lag, since doing the calculation to arbitrarily long lags becomes very expensive. It will issue a warning if it thinks this maximum lag is too small, according to the criterion above.

Since we started with 4 chains of length 10,000 (possibly minus a little burn-in), $n_\mathrm{eff}$ values close to 40,000 are telling us that the autocorrelation is indeed quite small, in agreement with the plots above. Let's remember that this need not be true for every problem; Gibbs sampling is not independence sampling and produces a Markov chain that in principle could be highly correlated, depending on the model and data involved.

Note that, as with the Gelman-Rubin statistic, this is a case where one might be interested in seeing the effective number of samples for the most degenerate linear combinations of parameters, rather than the parameters themselves.

Something to do

By now you are probably bored. Don't worry. Here is some work for you to do.

Let's get a sense of how many samples are really needed to, e.g., determine 1D credible intervals (as opposed to making the whole posterior look nice). Remember that the effective number of samples is less than the total.

At this point, we're done comparing the individual chains, so we can lump them all together into one massive list of MCMC samples.

For reference the total number of samples is:

We'll name the results using this full chain 40k in the expectation that the length is still close to 40,000 steps, even after removing burn-in.

Let's have a look at the credible interval calculation for $\mu_0$.

The PDF estimate should look pretty reliable with so many samples. The question is, if we're going to reduce this to a statement like $\mu=X^{+Y}_{-Z}$, keeping only up to the leading significant figure of $Y$ and $Z$, how many did we actually need to keep?

Thin the chain by factors of 4, 40, and 400 (to produce chains of length about 10000, 1000 and 100), and see how the endpoints of the 68.3% credible intervals compare. We're looking at the endpoints rather than the values of $Y$ and $Z$ above because the latter are more volatile (depending also on the estimate of $X$).

Remember that thinning by a factor of 4 means that we keep only every 4th entry in the chain, not that we simply select the first 25% of samples. So we're not answering how long we needed to bother running the chain to begin with - that's a slightly different question. We're finding out how redundant our samples are, not just in the "effective independence" sense, but for the specific purpose of quantifying this credible interval.

Checkpoint: Your mileage may vary slightly, of course. But we got a difference of unity at most, for a CI that was nominally $\pm7$ or so.

... which is a little surprising, honestly, even though we knew the autocorrelation was quite low in this case. But here's a slightly different question: which of the possible results would you be confident enough to put in a paper, having looked at the marginalized PDFs above? (The cell below compares them on the same plot, along with the 1 and 2$\sigma$ CIs.) And, relatedly, which one would you actually want to show in a figure?

Just to belabor the point, here are the mode and -low +high CIs for the 4 cases, at the precision we would actually report. With the chains in solutions/, the only non-trivial difference is in the 100-sample case. (And this is with the default histogram smoothing, which we would possibly tweak in that case.)

To belabor it even more, if we looked at the $\geq 1000$ sample PDFs and reasonably decided that they warranted symmetric CIs, these are the center +- width CIs we would report:

The bottom line is that summarizing a posterior with 1D best values and CIs is not, in fact, all that demanding in terms of the number of samples... provided that those samples are close to independent. Making a visual of the PDF in 1D or 2D that looks like it isn't noise dominated is often significantly more demanding. The upshot is that, if we have enough samples for the posterior plots to look not-noise-dominated, we can usually be pretty confident about numbers we report from them.

Metropolis samples

Now, read in the Metropolis chains and perform the same checks. Again, use the chains in solutions/ if you didn't work the Metropolis notebook yourself.

Below we plot the traces. Address the same 3 questions posed for the Gibbs samples.

Compare the two methods in these terms (again, 3 parts to this answer).

On the basis of the traces above, choose a burn-in length to remove from the beginning of each chain.

Here we compute the G-R criterion:

Checkpoint: Your results will depend on the proposal distribution you used, but in this problem it's possible to get $R$ values very close to 1 again, though a bit larger than in the Gibbs case.

Next, we'll look at the autocorrelation plot:

This should look qualitatively different from the Gibbs case, in that there is a clear sign of correlation at small lags.

Next, the effective number of samples.

Checkpoint: This will likely be much smaller than it was for the Gibbs chains.

That being the case, we won't bother repeating the thinning exercise above. However, it's worth looking at the unthinned posterior for $\mu_0$ from Metropolis, with only $\sim400$ effectively independent samples (in the solutions chains), in the context of that exercise as it was done for Gibbs.

Very likely, you can see this posterior histogram retains significant noise due to the correlation of the chains, comparable to a much shorter chain of independent samples. The good news is that, as ugly as it looks, our summary of the best value and CI is still essentially what it should be, comparing with the results above.

Parting thoughts

Hopefully this notebook has given you some practice with evaluating MCMC results and deciding how trustworthy they are. We do want to emphasize again that any conclusions about the relative performance of the two samplers involved are not generalizable. Conjugate Gibbs is often, but not always, a very nice option when the model permits it. On the other hand, we're not seeing Metropolis at its best; it can be enhanced in ways that improve its performance beyond what we've seen here (we'll cover some in later notes). Combined with the fact that it can be used with any likelihood and prior, Metropolis sampling is an essential tool.