Tutorial: MCMC Diagnostics¶
For this notebook, you should already have generated and saved Markov chains in the Redshift Distribution tutorial using two algorithms, conjugate Gibbs and Metropolis. Here, we'll work through the process of diagnosing whether these Markov chains are usefully sampling the posterior distribution, in particular assessing:
- their convergence
- their autocorrelation
- the effective number of independent samples, and how this impacts the usual reported values and credible intervals
The diagnostics discussed below include both qualitative and quantitative checks. We don't particularly think it's all that instructive to write the code that does the quantitative calculations - though there is surely room for improvement or expansion if you're interested - so intead we will demonstrate how to use functions provided by the incredible
and pandas
packages.
from os import getcwd
from os.path import exists as file_exists
from yaml import safe_load
from glob import glob
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
%matplotlib inline
import incredible as cr
thisTutorial = 'clredshift' # NB we are reading data from a previous tutorial!
if getcwd() == '/content':
# assume we are in Colab, and the user's data directory is linked to their drive/Physics267_data
from google.colab import drive
drive.mount('/content/drive')
datapath = '/content/drive/MyDrive/Physics267_data/' + thisTutorial + '/'
else:
# assume we are running locally somewhere and have the data under ./data/
datapath = 'data/'
Gibbs samples¶
It's nice to start with the Gibbs sampled chains, since (for this problem) they almost certainly look nicer. First, read them in. The way we did things in the previous tutorial, each entry in the chains
list will be an $N_\mathrm{steps} \times N_\mathrm{parameters}$ array. The cell below assumes they are present in the location where the Redshift Distribution tutorial writes them.
chains = [np.loadtxt(f) for f in glob(datapath+'clredshift_gibbs_*.txt.gz')]
param_labels = [r'$z_\mathrm{cl}$', r'$\sigma^2$']
Visual inspection¶
You've already used the most important method of vetting chains: visual inspection. The key questions are:
- Do multiple, independent chains appear to be sampling the same distribution (have they converged to the same distribution)?
- Is there a clear "burn-in" period before convergence that should be eliminated?
- Are the chains highly autocorrelated (taking small steps compared with the width of the posterior)? This is not an issue per se, if the chains are long enough, although it means the sampler is not moving as efficiently as one might hope.
Plot the parameter traces below, and answer these questions (qualitatively) for the conjugate Gibbs sampling chains.
plt.rcParams['figure.figsize'] = (16.0, 6.0)
fig, ax = plt.subplots(len(param_labels), 1);
cr.plot_traces(chains, ax, labels=param_labels, Line2D_kwargs={'markersize':1.0})
Space to answer the 3 key questions
I_have_answered = False # change to True when true
assert I_have_answered
Is there a burn-in that should clearly be removed? If so, do so here by setting the value of burn
to a number of steps $>0$. (You shouldn't need to remove very much - if burn=100
, for e.g., doesn't seem incredibly generous in this case, then something went wrong in the previous tutorial.)
# burn =
for i in range(len(chains)):
chains[i] = chains[i][burn:,:]
Gelman-Rubin statistic¶
Recall from the notes that the Gelman-Rubin convergence statistic, $R$, quantitatively tests the similarlity of independent chains intended to sample the same PDF. To be meaningful, they should start from different locations and burn-in should be removed.
For a given parameter, $\theta$, the $R$ statistic compares the variance across chains with the variance within a chain. Intuitively, if the chains are random-walking in very different places, i.e. not sampling the same distribution, $R$ will be large.
We'd like to see $R\approx 1$; for example, $R<1.1$ is often used.
cr.GelmanRubinR(chains)
array([0.99997119, 0.99996812])
If your Gibbs sampler works properly, $R$ for the chains we ran should be very close to 1 (we have differences of order $10^{-5}$).
assert np.allclose(cr.GelmanRubinR(chains), [1.0]*len(param_labels), atol=1e-3)
Autocorrelation¶
Similarly, the autocorrelation of a sequence, as a function of lag, $k$, can be quantified:
$\rho_k = \frac{\mathrm{Cov}_i\left(\theta_i,\theta_{i+k}\right)}{\mathrm{Var}(\theta)}$
The larger lag one needs to get a small autocorrelation, the less informative individual samples are.
The pandas
function plotting.autocorrelation_plot()
is useful for this. Note that seemingly random oscillations basically tell us the level of noise due to the finite chain length. A coherent drop as a function of lag indicates a genuine autocorrelation, and the lag at which it drops to within the noise (roughly indicated by the horizontal, gray lines) is an approximate autocorrelation length. If we needed to thin the chains to conserve disk space, this would be a reasonable factor to thin by.
plt.rcParams['figure.figsize'] = (16.0, 6.0)
fig, ax = plt.subplots(len(param_labels), 1);
for j,lab in enumerate(param_labels):
pd.plotting.autocorrelation_plot(chains[0][:,j], ax=ax[j]);
ax[j].set_ylabel(lab+' autocorrelation')
Again, for this problem, the Gibbs chains should be very well behaved. Our autocorrelation plots look like noise at almost all lags.
Yup_that_checks_out = False # change to True if, yup, that checks out
assert Yup_that_checks_out
Effective number of independent samples¶
From $m$ chains of length $n$, we can also estimate the "effective number of independent samples" as
$n_\mathrm{eff} = \frac{mn}{1+2\sum_{t=1}^\infty \hat{\rho}_t}$, with
$\hat{\rho}_t = 1 - \frac{V_t}{2V}$ ($V$ as in the Gelman-Rubin calculation), and
$V_t = \frac{1}{m(n-t)} \sum_{j=0}^m \sum_{i=t+1}^n (\theta_{i,j} - \theta_{i-t,j})^2$.
In practice, the sum in $n_\mathrm{eff}$ is cut off when the estimates $\hat{\rho}_t$ become "too noisy", e.g. when the sum of two successive values $\hat{\rho}_t$ and $\hat{\rho}_{t+1}$ is negative. Roughly speaking, this should occur when the lag is of the order of the autocorrelation length.
The effective_samples
below function allows you to pass a guess at this maximum lag, since doing the calculation to arbitrarily long lags becomes very expensive. It will issue a warning if it thinks this maximum lag is too small, according to the criterion above.
maxlag = 500
cr.effective_samples(chains, maxlag=maxlag) # `maxlag' might be something you need to play with, in practice
array([39603.76123811, 36826.36833789])
Since we started with 4 chains of length 10,000 (possibly minus a little burn-in), $n_\mathrm{eff}$ values close to 40,000 are telling us that the autocorrelation is indeed quite small, in agreement with the plots above. Let's remember that this need not be true for every problem; Gibbs sampling is not independence sampling and produces a Markov chain that in principle could be highly correlated, depending on the model and data involved.
assert np.all(cr.effective_samples(chains, maxlag=maxlag) > 30e3)
Note that, as with the Gelman-Rubin statistic, this is a case where one might be interested in seeing the effective number of samples for the most degenerate linear combinations of parameters, rather than the parameters themselves.
Something to do¶
By now you are probably bored. Don't worry. Here is some work for you to do.
Let's get a sense of how many samples are really needed to, e.g., determine 1D credible intervals (as opposed to making the whole posterior look nice). Remember that the effective number of samples is less than the total.
At this point, we're done comparing the individual chains, so we can lump them all together into one massive list of MCMC samples.
chain = np.concatenate(chains, axis=0)
For reference the total number of samples is:
chain.shape[0]
39960
We'll name the results using this full chain 40k
in the expectation that the length is still close to 40,000 steps, even after removing burn-in.
Let's have a look at the credible interval calculation for $z_{\mathrm{cl}}$.
print(chain.shape[0], 'samples')
plt.rcParams['figure.figsize'] = (12.0, 4.0)
fig, ax = plt.subplots(1, 2);
h40k = cr.whist(chain[:,0], plot=ax[0])
ci40k = cr.whist_ci(h40k, plot=ax[1]);
ax[0].set_xlabel(param_labels[0]);
ax[1].set_xlabel(param_labels[0]);
ci40k
39960 samples
{'mode': np.float64(1.9781141777429876), 'level': array([0.68268949, 0.95449974]), 'prob': array([0.68446821, 0.95472974]), 'density': array([134.31573489, 27.90110838]), 'min': array([1.97622887, 1.97443782]), 'max': array([1.97968527, 1.98150774]), 'low': array([-0.00188531, -0.00367636]), 'high': array([0.00157109, 0.00339356]), 'center': array([1.97795707, 1.97797278]), 'width': array([0.0017282 , 0.00353496])}
The PDF estimate should look pretty reliable with so many samples. The question is, if we're going to reduce this to a statement like $z_{\mathrm{cl}}=X^{+Y}_{-Z}$, keeping only up to the leading significant figure of $Y$ and $Z$, how many did we actually need to keep?
Thin the chain by factors of 4, 40, and 400 (to produce chains of length about 10000, 1000 and 100), and see how the endpoints of the 68.3% credible intervals compare. We're looking at the endpoints rather than the values of $Y$ and $Z$ above because the latter are more volatile (depending also on the estimate of $X$).
Remember that thinning by a factor of 4 means that we keep only every 4th entry in the chain, not that we simply select the first 25% of samples. So we're not answering how long we needed to bother running the chain to begin with - that's a slightly different question. We're finding out how redundant our samples are, not just in the "effective independence" sense, but for the specific purpose of quantifying this credible interval.
# No clues here, but it's pretty much cut and paste.
# Analogous to the cell above, save the output of `whist` in h10k, h1k, h100, and the output of
# whist_ci in ci10k, ci1k and ci100. This is so we can plot them all togther later.
print('40k:', ci40k['min'][0], ci40k['max'][0])
print('10k:', ci10k['min'][0], ci10k['max'][0])
print(' 1k:', ci1k['min'][0], ci1k['max'][0])
print('100:', ci100['min'][0], ci100['max'][0])
40k: 1.9762288650313484 1.9796852716693534 10k: 1.9762328043301312 1.9797232683318629 1k: 1.9761655262325257 1.9797539892774076 100: 1.9765343165947464 1.980246564087866
Your mileage may vary slightly, of course, but we see a differences of order $10^{-4}$, several times smaller than the width of the interval.
... which is a little surprising, honestly, even though we knew the autocorrelation was quite low in this case. But here's a slightly different question: which of the possible results would you be confident enough to put in a paper, having looked at the marginalized PDFs? (The cell below compares them on the same plot, along with the 1 and 2$\sigma$ CIs.) And, relatedly, which one would you actually want to show in a figure?
plt.rcParams['figure.figsize'] = (14.0, 5.0)
fig, ax = plt.subplots(1, 2);
ax[0].plot(h40k['x'], h40k['density'], '-', label='40k');
ax[0].plot(h10k['x'], h10k['density'], '-', label='10k');
ax[0].plot(h1k['x'], h1k['density'], '-', label='1k');
ax[0].plot(h100['x'], h100['density'], '-', label='100');
ax[0].legend();
ax[0].set_xlabel(param_labels[0]);
ax[1].plot(0.0, ci40k['mode'], 'o', color='C0', label='40k');
ax[1].plot([0.0]*2, [ci40k['min'][0],ci40k['max'][0]], '-', color='C0', linewidth=3);
ax[1].plot([0.0]*2, [ci40k['min'][1],ci40k['max'][1]], '--', color='C0');
ax[1].plot(1.0, ci10k['mode'], 'o', color='C1', label='10k');
ax[1].plot([1.0]*2, [ci10k['min'][0],ci10k['max'][0]], '-', color='C1', linewidth=3);
ax[1].plot([1.0]*2, [ci10k['min'][1],ci10k['max'][1]], '--', color='C1');
ax[1].plot(2.0, ci1k['mode'], 'o', color='C2', label='1k');
ax[1].plot([2.0]*2, [ci1k['min'][0],ci1k['max'][0]], '-', color='C2', linewidth=3);
ax[1].plot([2.0]*2, [ci1k['min'][1],ci1k['max'][1]], '--', color='C2');
ax[1].plot(3.0, ci100['mode'], 'o', color='C3', label='100');
ax[1].plot([3.0]*2, [ci100['min'][0],ci100['max'][0]], '-', color='C3', linewidth=3);
ax[1].plot([3.0]*2, [ci100['min'][1],ci100['max'][1]], '--', color='C3');
ax[1].legend();
ax[1].set_ylabel(param_labels[0]);
Just to belabor the point, here are the mode
and -low +high
CIs for the 4 cases, at the precision we would actually report (NB because everyone's data set is somewhat random, you may need to change the number of digits to round to to be appropriate). In our solutions, the only non-trivial difference is in the 100-sample case.
roundto = 4
print('~40k samples:', (np.round(ci40k['mode'], roundto)), (np.round(ci40k['low'][0], roundto)), '+'+str((np.round(ci40k['high'][0], roundto))))
print('~10k samples:', (np.round(ci10k['mode'], roundto)), (np.round(ci10k['low'][0], roundto)), '+'+str((np.round(ci10k['high'][0], roundto))))
print(' ~1k samples:', (np.round(ci1k['mode'], roundto)), (np.round(ci1k['low'][0], roundto)), '+'+str((np.round(ci1k['high'][0], roundto))))
print('~100 samples:', (np.round(ci100['mode'], roundto)), (np.round(ci100['low'][0], roundto)), '+'+str((np.round(ci100['high'][0], roundto))))
~40k samples: 1.9781 -0.0019 +0.0016 ~10k samples: 1.9781 -0.0019 +0.0016 ~1k samples: 1.9781 -0.0019 +0.0017 ~100 samples: 1.9788 -0.0023 +0.0014
The bottom line is that summarizing a posterior with 1D best values and CIs is not, in fact, all that demanding in terms of the number of samples... provided that those samples are close to independent. Making a visual of the PDF in 1D or 2D that looks like it isn't noise dominated is often significantly more demanding. The upshot is that, if we have enough samples for the posterior plots to look not-noise-dominated, we can usually be pretty confident about numbers we report from them. Kepp in mind also that we used the "easier" of the two parameters in this problem for the example; typically the posterior for $\sigma^2$ is less symmetric, and the samples more correlated.
Metropolis samples¶
Now, read in the Metropolis chains and perform the same checks. Again, use the chains in solutions/
if you didn't work the Metropolis notebook yourself.
chains = [np.loadtxt(f) for f in glob(datapath+'clredshift_metro_*.txt.gz')]
Below we plot the traces.
plt.rcParams['figure.figsize'] = (16.0, 6.0)
fig, ax = plt.subplots(len(param_labels), 1);
cr.plot_traces(chains, ax, labels=param_labels, Line2D_kwargs={'markersize':1.0})
On the basis of the traces above, choose a burn-in length to remove from the beginning of each chain.
# mburn =
for i in range(len(chains)):
chains[i] = chains[i][mburn:,:]
Depending on how the burn-in period looked, the remaining traces might be clearer.
plt.rcParams['figure.figsize'] = (16.0, 6.0)
fig, ax = plt.subplots(len(param_labels), 1);
cr.plot_traces(chains, ax, labels=param_labels, Line2D_kwargs={'markersize':1.0})
Now that you hopefully have had a good look at the these chains, answer the same key questions from the first "Visual Inspection" section.
Space to answer the 3 key questions
Compare the chains from two methods in these terms (again, 3 parts to this answer).
Space to answer the 3 key questions, comparing the MCMC algorithms
I_have_answered_everything = False # change True when it is true
assert I_have_answered_everything
Here we compute the G-R criterion:
cr.GelmanRubinR(chains)
array([1.00055024, 1.00004532])
Your results will depend on the proposal distribution you used, but in this problem it's possible to get $R$ values very close to 1 again, though possibly a bit larger than in the Gibbs case.
assert np.allclose(cr.GelmanRubinR(chains), [1.0]*len(param_labels), atol=1e-2)
Next, we'll look at the autocorrelation plot:
plt.rcParams['figure.figsize'] = (16.0, 6.0)
fig, ax = plt.subplots(len(param_labels), 1);
for j,lab in enumerate(param_labels):
pd.plotting.autocorrelation_plot(chains[0][:,j], ax=ax[j]);
ax[j].set_ylabel(lab+' autocorrelation')
This will likely look qualitatively different from the Gibbs case, in that there is a clear sign of correlation at small lags.
Next, the effective number of samples, increasing maxlag
if needed.
maxlag = 500
cr.effective_samples(chains, maxlag=maxlag)
array([3164.2126463 , 4620.66829709])
This will likely be much smaller than it was for the Gibbs chains, due to higher autocorrelation, but we should still aim to have $n_\mathrm{eff}>1000$.
assert np.all(cr.effective_samples(chains, maxlag=maxlag) > 1e3)
That being the case, we won't bother repeating the thinning exercise above. However, it's worth comparing the marginalized constraints we infer from this method to the other. Here they are for $z_\mathrm{cl}$:
chain = np.concatenate(chains, axis=0)
print('Total samples:', chain.shape[0])
plt.rcParams['figure.figsize'] = (12.0, 4.0)
fig, ax = plt.subplots(1, 2);
hmetro = cr.whist(chain[:,0], plot=ax[0])
cimetro = cr.whist_ci(hmetro, plot=ax[1]);
ax[0].set_xlabel(param_labels[0]);
ax[1].set_xlabel(param_labels[0]);
cimetro
Total samples: 38000
{'mode': np.float64(1.9781765699674), 'level': array([0.68268949, 0.95449974]), 'prob': array([0.68398875, 0.95523014]), 'density': array([132.29231941, 28.05091835]), 'min': array([1.97617317, 1.97428425]), 'max': array([1.97975067, 1.98149649]), 'low': array([-0.0020034 , -0.00389232]), 'high': array([0.0015741 , 0.00331992]), 'center': array([1.97796192, 1.97789037]), 'width': array([0.00178875, 0.00360612])}
You can/should compare them in detail (for both parameters). For our solutions, the 68.3% CI for $z_\mathrm{cl}$ is essentially identical:
print('Gibbs:', round(ci40k['min'][0], roundto), round(ci40k['max'][0], roundto))
print('Metro:', round(cimetro['min'][0], roundto), round(cimetro['max'][0], roundto))
Gibbs: 1.9762 1.9797 Metro: 1.9762 1.9798
In other words, all that extra correlation doesn't really matter, as long as we have enough effectively independent samples at the end of the day. Provided, of course, that the extra time spent producing them is not a barrier.
Parting thoughts¶
Hopefully this notebook has given you some practice with evaluating MCMC results and deciding how trustworthy they are. We do want to emphasize again that any conclusions about the relative performance of the two samplers involved are not generalizable. Conjugate Gibbs is often, but not always, a very nice option when the model permits it. On the other hand, we're not seeing Metropolis at its best; it can be enhanced in ways that improve its performance beyond what we've seen here (we'll cover some in later notes). Combined with the fact that it can be used with any likelihood and prior, Metropolis sampling is an essential tool.
Parting technical note¶
Should you be inclined to use incredible
beyond provided code in this course (any why wouldn't you?) note that all of its functions that take multiple chains as arguments require them to be in one of the following formats:
- a list of
Nsamples
$\times$Nparams
arrays, as in this notebook - a 3D
Nchains
$\times$Nsamples
$\times$Nparams
array
In particular, passing a 3D array with some other axis order will probably not crash anything, but will produce nonsense.