Tutorial: Cluster redshift distribution¶
In this notebook, you will implement two of the simpler algorithms for Markov chain monte carlo. In practice, you will probably use a package provided by someone else in most circumstances, but it's useful to build these samplers from the ground up at least once to see how they work under the hood.
To be concrete, you will
- define a generative model for redshifts of the member galaxies of a cluster
- determine the conditional of each parameter, and implement conjugate Gibbs updates
- implement proposal distributions and the Metropolis accept/reject rule
- run both algorithms, and compare the resulting samples from the posterior distribution
from os import getcwd
from os.path import exists as file_exists
from yaml import safe_load
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
%matplotlib inline
import incredible as cr
thisTutorial = 'clredshift'
if getcwd() == '/content':
# assume we are in Colab, and the user's data directory is linked to their drive/Physics267_data
from google.colab import drive
drive.mount('/content/drive')
datapath = '/content/drive/MyDrive/Physics267_data/' + thisTutorial + '/'
else:
# assume we are running locally somewhere and have the data under ./data/
datapath = 'data/'
Background¶
Since the emphasis here is on sampling methods, we'll use a relatively simple data set and model. The data are spectroscopically determined redshifts for galaxies in cluster XLSSC 122 (Willis et al. 2020). The cluster was discovered through its X-ray emission, and intially assigned a photometrically determined redshift of $1.9 \pm 0.2$. This is (still, at this writing) an extremely high redshift for a massive cluster, and as a result pinning it down more precisely (i.e. spectroscopically) required space-based data, hence the Hubble observations presented by Willis et al. (In fact, this is a rare case where the first spectroscopic redshift came from follow-up X-ray data... which is not relevant, since the Hubble redshift is 100 times more precise, but still fun.)
Here is the top of the data table from the paper, showing redshifts measured for individual galaxies. We'll be using these galaxy redshifts to estimate the redshift of the cluster itself, assuming they follow a known distribution.
ID | RA | Dec | Magnitude | Colour | Redshift | Notes |
---|---|---|---|---|---|---|
526 | 34.43422 | -3.75880 | 20.64 | 1.44 | 1.980 | G |
451 | 34.42228 | -3.76351 | 21.95 | 1.29 | 1.981 | G |
657 | 34.43410 | -3.75766 | 21.67 | 1.49 | 1.983 | G |
1032 | 34.43245 | -3.74992 | 22.38 | 1.33 | 1.982 | G |
295 | 34.43503 | -3.76795 | 22.50 | 1.56 | 1.987 | G |
... | ... | ... | ... | ... | ... | ... |
The data in the public data directory are the "G" and "GE" class redshifts from this table, which is a bit sketchy, as a redshift cut is already used in those classifications. We're keeping it simple in this notebook, and just using the data to demonstrate some sampling methods. One of the "Practice" tutorials involves fitting a cluster galaxy redshift distribution simultaneously with a background distribution of non-cluster galaxy redshifts, if you're interested.
In any case, as usual, the data you'll be using have been randomly generated just for you! Let's read them in:
z_gal = np.loadtxt(datapath+'redshifts.txt')
z_gal
array([1.98 , 1.981, 1.983, 1.982, 1.987, 1.963, 1.993, 1.977, 1.988, 1.966, 1.977, 1.979, 1.963, 1.996, 1.976, 1.991, 1.981, 1.962, 1.979, 1.963, 1.972, 1.993, 1.997, 1.969, 1.971, 1.981, 1.977, 1.963, 1.977, 1.979, 1.978, 1.972, 1.976])
This will show a histogram of the galaxy redshifts:
plt.rcParams['figure.figsize'] = (4.0, 3.0)
plt.hist(z_gal);
plt.xlabel('redshift');
Define the model and priors¶
Physically, we expect the distribution of cluster member velocities (hence redshifts) to be approximately Gaussian. Our model will therefore simply be that the redshifts above follow a Gaussian distribution, whose mean is the cluster redshift. The width of the distrbution is related to the mass of the cluster, although we won't pursue that here other than obtaining constraints on it.
Because it's most natural for the conjugate Gibbs method we'll use, let's parameterize the model with
- the Gaussian's mean, $z_\mathrm{cl}$, and
- its variance, $\sigma^2$.
param_names = ['z_cl', 'sigma2']
param_labels = [r'$z_\mathrm{cl}$', r'$\sigma^2$']
For actual humans, it's usually more intuitive to think about the standard deviation rather than variance, so once we have our samples we'll transform $\sigma^2$ back to $\sigma$. This is a little awkward, but it's nice to take advantage of conjugacies when possible, so we'll deal with it this one time.
final_names = ['z_cl', 'sigma']
final_labels = [r'$z_\mathrm{cl}$', r'$\sigma$']
As always, you will need to choose priors for parameters. Again to take advantage of conjugacies, we'll want to use a Gaussian prior on $z_\mathrm{cl}$ and an inverse-gamma (!) prior on $\sigma^2$; these are both 2-parameter distributions and, while not infinitely flexible, both admit a wide range of behaviors, including common "uninformative" choices. This is probably the last tutorial where we will be so prescriptive about the form of priors.
Note: In class we will discuss and agree on a common set of priors that everyone should use.
Since there's no reason you should already be familiar with it, the inverse-gamma density is
$\mathrm{InvGamma}(x|\alpha,\beta) \propto x^{-\alpha-1} e^{-\beta/x}$ put constants in
for $x\geq0$, $\alpha>0$ and $\beta>0$.
If you stare at this for a while, you'll see that certain "uninformative" priors that we might want to use (e.g. uniform in $\sigma^2$, uniform in $\sigma$, or the Jeffreys prior), can be accomodated - but only by using values of $\alpha$ and/or $\beta$ that are not allowed by the distribution's definition. This is fine. All that actually matters is that the parameters of the conditional posterior are allowed, which is not the same as requiring the parameters of the prior distribution to be allowed. Math, but true. These improper priors will, however, cause scipy.stats.invgamma
to break if we try to evaluate the prior; so, just in case we choose to use one of them, it will be helpful to have the PDF and log-PDF coded in a form that retains the correct functional dependences but omits the problematic normalizing constants.
# These are lazy but functional - you may see runtime warnings about dividing by or taking the log of zero
def invgamma_unpdf(x, alpha, beta):
'Inverse-gamma PDF without the normalization constants'
return np.where(x<=0, 0.0, x**(-alpha-1) * np.exp(-beta/x))
def invgamma_unlogpdf(x, alpha, beta):
'Inverse-gamma log-PDF without the normalization constants'
return np.where(x<=0, -np.inf, (-alpha-1)*np.log(x) -beta/x)
Before going on, draw the PGM and specify all the probabilistic relationships in the model.
Space for PGM and generative model
Next, define the hyperparameters corresponding to the priors you chose above:
mu0: mean of the Gaussian prior on z_cl
tau0: STANDARD DEVIATION of the Gaussian prior on z_cl
alpha0: alpha parameter of the inverse-gamma prior on sigma2
beta0: beta parameter of the inverse-gamma prior on sigma2
# hyperparams = {'mu0':..., 'tau0':..., 'alpha0':..., 'beta0':...}
Let's have a look at the priors PDFs:
print('Normal prior hyperparameters for z_cl:', hyperparams['mu0'], hyperparams['tau0'])
print('Inverse-gamma prior hyperparameters for sigma2:', hyperparams['alpha0'], hyperparams['beta0'])
plt.rcParams['figure.figsize'] = (10.0, 3.0)
fig, ax = plt.subplots(1,2)
xx = np.linspace(hyperparams['mu0']-5*hyperparams['tau0'], hyperparams['mu0']+5*hyperparams['tau0'], 1000)
ax[0].plot(xx, st.norm.pdf(xx, hyperparams['mu0'], hyperparams['tau0']));
ax[0].set_xlabel(param_labels[0]); ax[0].set_ylabel(r'$p($'+param_labels[0]+r'$)$');
xx = np.linspace(1e-4, 0.004, 1000) # limits are hacky
ax[1].plot(xx, invgamma_unpdf(xx, hyperparams['alpha0'], hyperparams['beta0']));
ax[1].set_xlabel(param_labels[1]); ax[1].set_ylabel(r'$p($'+param_labels[1]+r'$)$ (unnormalized)');
Normal prior hyperparameters for z_cl: 1.99 0.06 Inverse-gamma prior hyperparameters for sigma2: 0.0 0.0 Note that these may not match the priors we decide on in class
Solution using conjugate Gibbs sampling¶
Conjugacy relations¶
Recall that the strength of conjugate Gibbs sampling is that, where applicable, it lets us tailor a very efficient sampler to a given problem. The cost of this is having to juggle a little math. Given that you can look up all solutions for the common scenarios in the Wikipedia, we won't make you work out all of the algebra, but it's worth doing if you are so inclined (and not terribly involved).
Relation for $z_\mathrm{cl}$¶
Thanks to our choice of a normal prior, the fully conditional posterior for $z_\mathrm{cl}$ is proportional to a product of exclusively Gaussians:
$p(z_\mathrm{cl}|z,\sigma^2) \propto p(z_\mathrm{cl}) \prod_i p(z_i|z_\mathrm{cl},\sigma^2) = \mathrm{Normal}(z_\mathrm{cl}|\mu_0,\tau_0) \prod_i \mathrm{Normal}(z_i|z_\mathrm{cl},\sigma)$,
where $z$ is the vector of galaxy redshifts. First, we'll swap $z_i$ and $z_\mathrm{cl}$ in the second term, using the symmetry of the Gaussian density with respect to these two parameters:
$p(z_\mathrm{cl}|z,\sigma^2) \propto \mathrm{Normal}(z_\mathrm{cl}|\mu_0,\tau_0) \prod_i \mathrm{Normal}(z_\mathrm{cl}|z_i,\sigma)$.
We can then take advantage of an extremely useful identity for the product of two Gaussians, which can be written
$\mathrm{Normal}(x|\mu_1,\sigma_1) \, \mathrm{Normal}(x|\mu_2,\sigma_2) = \mathrm{Normal}(x|\mu_a,\sigma_a) \, \mathrm{Normal}(0|\mu_b,\sigma_b)$,
where
- $\mu_a = \sigma_a^{2}\left(\frac{\mu_1}{\sigma_1^2} + \frac{\mu_2}{\sigma_2^2}\right)$,
- $\sigma_a^2 = \left(\sigma_1^{-2} + \sigma_2^{-2}\right)^{-1}$,
- $\mu_b = \mu_1 - \mu_2$,
- $\sigma_b^2 = \sigma_1^2 + \sigma_2^2$.
That is, we end up with the product a Gaussian PDF for $x$, whose mean is a weighted average of the original means and whose width is smaller than either original width; and a second Gaussian PDF that doesn't depend on $x$, whose mean is the difference between the original means, and whose variance is the quadrature sum of the original variances. If the original expression represents two independent constraints on $x$, this has a natural interpretation: the first term on the right encodes the joint constraint, while the second (which could be rewritten $\mathrm{Normal}[\mu_1|\mu_2,\sigma_b]$) is encodes the agreement between the two original constraints.
In any case, repeated application of this identity simplifies our conditional posterior to
$p(z_\mathrm{cl}|z,\sigma^2) = \mathrm{Normal}\left(z_\mathrm{cl}\left|\left[\frac{1}{\tau_0^2}+\frac{N}{\sigma^2}\right]^{-1}\left[\frac{\mu_0}{\tau_0^2}+\frac{\sum_i z_i}{\sigma^2}\right], \left[\frac{1}{\tau_0^2}+\frac{N}{\sigma^2}\right]^{-1/2}\right.\right)$,
where $N$ is the length of $z$.
Here we have only kept the first term when using the above identity each time, since we only need to deal with terms that depend on $z_\mathrm{cl}$. In this case, that yields a normalized PDF, so we can call this last line an equality automatically. That is, if we had kept track of those second terms from the identity, they would certainly end up exactly canceling the evidence, such that the conditional posterior ends up being normalized. This is known as a shortcut.
Relation for $\sigma^2$¶
The conjugate distribution for $\sigma^2$ in this problem is inverse-gamma, which is why we used it for the prior above. (You may also see the scaled-inverse-chisquare distribution mentioned; this is a special case of inverse-gamma, so it also works.) As noted above, the functional form of this distribution is
$\mathrm{InvGamma}(x|\alpha,\beta) \propto x^{-\alpha-1} e^{-\beta/x}$.
So, the conditional posterior for $\sigma^2$ looks like
$p(\sigma^2|z,z_\mathrm{cl}) \propto p(\sigma^2) \prod_i p(z_i|z_\mathrm{cl},\sigma^2) \propto \sigma^{-2(\alpha_0+1)} e^{-\beta_0/\sigma^2} \prod_i \frac{1}{\sigma} e^{-\frac{(z_i-z_\mathrm{cl})^2}{2\sigma^2}}$,
where we have expanded out the Gaussian PDF in the sampling distribution and thrown away terms that do not involve $\sigma$. For not much effort, you can simplify this to an expression with the same form as the inverse-gamma density above, with $\alpha=\alpha_0+N/2$ and $\beta = \beta_0 + \frac{1}{2}\sum_i (z_i - z_\mathrm{cl})^2$.
Implementation¶
So, now we have rules from drawing samples from the conditional posterior distributions of both parameters. Note that this is not the same as being able to sample both parameters in a single step; we still need to construct a Markov chain by updating the parameters in series.
That being the case, we will need an initial position in parameter space at which to start the chain. Any reasonable guess will do, given that this is a fairly simple problem. (An unreasonable guess will likely also work, although we don't recommend it.)
# guess = {'z_cl':..., 'sigma2':...}
guess
Now for the fun part: write a function that takes the data (z_gal
), model parameter dictionary and prior hyperparameter dictionary as input, and returns the parameters of the conditional posterior for $z_\mathrm{cl}$ Even though you'll have chosen specific hyperparameter already, you should write your code such that any valid values can be passed.
def conditional_post_z_cl(data, par, hypar):
# return a tuple (mean, stdev) encoding the normal distribution from which z_cl should be drawn
Now do the same for $\sigma^2$.
def conditional_post_sigma2(data, par, hypar):
# return a tuple (alpha, beta) encoding the inverse-gamma distribution from which sigma2 should be drawn
This will test that each function works correctly for a given, arbitrary input:
condpost_test = safe_load(open(datapath+'test_condpost.yaml', 'r').read())
assert np.allclose(conditional_post_z_cl(z_gal, condpost_test['testpar'], condpost_test['testhypar']), condpost_test['z_cl'])
assert np.allclose(conditional_post_sigma2(z_gal, condpost_test['testpar'], condpost_test['testhypar']), condpost_test['sigma2'])
If those conjugacy definitions are working, we can move on. The following, given functions take the same arguments, and will update the parameter dictionary in place.
def update_z_cl(data, par, hypar):
mean,sd = conditional_post_z_cl(data, par, hypar)
par['z_cl'] = st.norm.rvs(mean, sd)
def update_sigma2(data, par, hypar):
alpha,beta = conditional_post_sigma2(data, par, hypar)
par['sigma2'] = st.invgamma.rvs(alpha, scale=beta)
#par['sigma2'] = 1 / st.gamma.rvs(alpha, scale=1/beta) # equivalent
Finally, let's test all of that by calling each function and verifying that all the parameters have changed (to finite, allowed values).
params = guess.copy()
print('Before:', params)
update_z_cl(z_gal, params, hyperparams)
update_sigma2(z_gal, params, hyperparams)
print('After:', params)
print('Difference:', {k:params[k]-guess[k] for k in params.keys()})
Before: {'z_cl': np.float64(1.977939393939394), 'sigma2': np.float64(9.31478420569332e-05)} After: {'z_cl': np.float64(1.979476132505211), 'sigma2': np.float64(0.00011122940104082998)} Difference: {'z_cl': np.float64(0.0015367385658169308), 'sigma2': np.float64(1.8081558983896787e-05)}
Results¶
We now have the necessary machinery to build a Markov chain, starting from your guessed parameter values above, and updating each parameter in turn. The cell below will do so, storing the chain in an $N_\mathrm{samples}\times2$ array. The order of the individual parameter updates is arbitrary, and could even be randomized if you particularly wanted to.
%%time
params = guess.copy()
nsamples = 10000
gchain = np.zeros((nsamples, len(params)))
for i in range(nsamples):
update_z_cl(z_gal, params, hyperparams)
update_sigma2(z_gal, params, hyperparams)
gchain[i,:] = [params[k] for k in param_names]
CPU times: user 1.46 s, sys: 48.1 ms, total: 1.51 s Wall time: 1.48 s
Let's do the most basic (yet still extremely important) visual check to see how our sampler performed, looking at traces of the Markov chain for each parameter. (It's ok if you haven't read the notes on MCMC Diagnostics yet; we will go more in-depth later.) These trace plots show the value of each parameter as a function of iteration, and we'll add a line showing your initial guess.
fig, ax = plt.subplots(gchain.shape[1], 1, figsize=(20, gchain.shape[1]*3));
cr.plot_traces(gchain, ax, labels=param_labels, truths=[guess[k] for k in param_names]);
Let's go ahead and transform the samples from $\sigma^2$ to $\sigma$, since that's what we're more interested in. (This is not a trick question.)
# transform gchain[:,1] from sigma^2 to sigma
# gchain[:,1] = ...
The following cells will find the 1D and 2D credible regions, as you've seen before.
fig, ax = plt.subplots(gchain.shape[1], 2, figsize=(9, gchain.shape[1]*3));
gCIs = {}
for i,a in enumerate(ax):
h = cr.whist(gchain[:,i], plot=a[0]); a[0].set_xlabel(final_labels[i]);
gCIs[final_names[i]] = cr.whist_ci(h, plot=a[1]);
a[1].set_xlabel(final_labels[i]);
gCIs
{'z_cl': {'mode': np.float64(1.9781814762624792), 'level': array([0.68268949, 0.95449974]), 'prob': array([0.68523941, 0.95529879]), 'density': array([135.52107032, 29.15794556]), 'min': array([1.97627713, 1.97437278]), 'max': array([1.97973958, 1.98144195]), 'low': array([-0.00190435, -0.0038087 ]), 'high': array([0.0015581 , 0.00326048]), 'center': array([1.97800835, 1.97790737]), 'width': array([0.00173123, 0.00353459])}, 'sigma': {'mode': np.float64(0.009760276883732646), 'level': array([0.68268949, 0.95449974]), 'prob': array([0.6860537 , 0.95454917]), 'density': array([191.74522563, 40.36988888]), 'min': array([0.00854658, 0.00764145]), 'max': array([0.01103569, 0.01272252]), 'low': array([-0.0012137 , -0.00211883]), 'high': array([0.00127541, 0.00296224]), 'center': array([0.00979113, 0.01018199]), 'width': array([0.00124455, 0.00254054])}}
gtri = cr.whist_triangle(gchain, bins=50, smooth2D=1);
cr.whist_triangle_plot(gtri, paramNames=final_labels);
Check goodness of fit¶
Just this once, since the whole point of the tutorial is to get familiar with sampling techniques, we will skip testing the goodness of fit.
Run multiple chains¶
We weren't overly concerned with the starting point for the test chain above. But, for later notebooks, we'll want to see how multiple, independent chains with different starting points behave when using this method. The cell below will take care of running 4 chains, started at random yet broadly reasonable positions.
%%time
nchains = 4
gchains = [np.zeros((nsamples,len(param_names))) for j in range(nchains)]
for samples in gchains:
# randomly initializing from within the prior is reasonable, unless it's improper
if np.isfinite(hyperparams['tau0']):
params = {'z_cl':st.norm.rvs(hyperparams['mu0'], hyperparams['tau0'])}
else:
params = {'z_cl':st.uniform.rvs(1.85, 0.3)} # just something to fall back on
if hyperparams['alpha0'] > 0 and hyperparams['beta0'] > 0:
params['sigma2'] = st.invgamma.rvs(hyperparams['alpha0'], scale=hyperparams['beta0'])
else:
params['sigma2'] = st.chi2.rvs(22) * 5e-6 # just something to fall back on
for i in range(nsamples):
update_z_cl(z_gal, params, hyperparams)
update_sigma2(z_gal, params, hyperparams)
samples[i,:] = [params[k] for k in param_names]
CPU times: user 5.67 s, sys: 75.9 ms, total: 5.75 s Wall time: 5.71 s
Now we can look at a more colorful version of the trace plots, showing all of the chains simultaneously:
fig, ax = plt.subplots(gchain.shape[1], 1, figsize=(20, gchain.shape[1]*3));
cr.plot_traces(gchains, ax, labels=param_labels, Line2D_kwargs={'markersize':1.0});
Here you might be able to see some extreme values as the chains burn in, before they settle down to sampling the posterior distribution, depending on where the chains were initialized.
Uncomment and run the following cell to save the chains to your data folder. We will use them in a later tutorial.
#for i,samples in enumerate(gchains):
# np.savetxt(datapath+'clredshift_gibbs_'+str(i)+'.txt.gz', samples, header=' '.join(param_names))
I_have_saved_the_Gibbs_chains = False # you should not, in fact, need to change this to True
assert all([file_exists(datapath+'clredshift_gibbs_'+str(i)+'.txt.gz') for i in range(nchains)]) or I_have_saved_the_Gibbs_chains
Solution using Metropolis sampling¶
In the conjugate Gibbs solution, the sampling functions were explicitly tailored to our model definition. That won't be the case here, so the first thing to do is to implement log-prior and log-likelihood functions for this problem, just as you've done previously.
def log_prior(z_cl, sigma2, mu0, tau0, alpha0, beta0):
def log_likelihood(data, z_cl, sigma2, **unused_kwargs):
# unused_kwargs is there so we can pass hyperparameters without crashing,
# not that we would/could use them in the likelihood
The log-posterior function is given:
def log_posterior(data, **allparams):
lnp = log_prior(**allparams)
if np.isfinite(lnp):
lnp += log_likelihood(data, **allparams)
return lnp
As always, let's check that they return finite numbers for reasonable parameter values.
print(log_prior(**guess, **hyperparams))
print(log_likelihood(z_gal, **guess, **hyperparams))
print(log_posterior(z_gal, **guess, **hyperparams))
assert np.isfinite(log_posterior(z_gal, **guess, **hyperparams))
11.155592280710191 106.3168517581422 117.4724440388524
Moving on to the sampler, we next need a proposal distribution. While in principle you can do something fancier here, I'll suggest using a multivariate Gaussian centered on the current position in parameter space. This is translationally invariant, so later on you can use the simple Metropolis acceptance rule instead of the slightly more complex Metropolis-Hastings rule. A Gaussian isn't necessarily the best choice in general, since the most likely proposals are very small steps, but it will do for this demonstration.
Also in the name of keeping it simple, let's make the proposal independent in each parameter (a diagonal covariance matrix for the 2-dimensional Gaussian). Similarly to the grid method, you'll want to guess the appropriate order of magnitude for steps in each parameter, which is the same order as the width of the posterior, and you may need to return to this point to adjust them after seeing the performance.
Since we're assuming a diagonal covariance, let's go ahead and just represent the proposal distribution as 2 univariate Gaussians, as in the dictionary below.
Aside: You may not have seen it before in this class, but calling scipy.norm()
as below produces what they call a "frozen" probability distribution object, with fixed parameters. The standard deviation is whatever you specify via the scale
argument, and the (unspecified) mean would remain at the default value of 0.0. This means that each entry should be interpreted as a distribution for the displacement of a proposal from the current position of a chain. In other words, the proposal distributon density for a change to $x_0$ could be computed with proposal_distribution['x0'].pdf(x0_proposed - x0_current)
. Other methods of scipy
probability distributions, in particular random number generation, are also available; the main difference from how we've used them before is that we can't/don't need to specify parameter values in the call because they've already been frozen in.
You are not required to follow the advice above, so long as your definitions of proposal_distribution
, propose
and step
are all mutually consistent and functional.
# proposal_distribution = {'z_cl':st.norm(scale=...) ,
# 'sigma2':st.norm(scale=...)}
Next, define a function that returns a proposed point in parameter space, given the current location and the above dictionary of proposal distributions.
Technical note: You might be tempted to begin this function with a command like proposal = current_params
. If so, remember that, in Python, b = a
does not make a copy of a
if a
is a dictionary (or a numpy
array for that matter). Both b
and a
would point to the same data in memory. The safest/quickest way to get a new dictionary with the same structure as a
whose values can then be safely overwritten is with b = a.copy()
.
def propose(current_params, dists):
"""
current_params: dictionary holding current position in parameter space
dists: dictionary of proposal distributions
Return value: a new dictionary holding the proposed destination in parameter space
"""
Let's make sure it does, indeed, propose new parameter values:
print('Test starting position:', guess)
params = propose(guess, proposal_distribution)
print('Test proposal:', params)
print('Difference:', {k:params[k]-guess[k] for k in params.keys()})
Test starting position: {'z_cl': np.float64(1.977939393939394), 'sigma2': np.float64(9.31478420569332e-05)} Test proposal: {'z_cl': np.float64(1.9762804772262919), 'sigma2': np.float64(0.00013992130603124304)} Difference: {'z_cl': np.float64(-0.0016589167131022542), 'sigma2': np.float64(4.677346397430984e-05)}
Finally, the sampler itself. Write a function that takes the current parameter values and log-posterior value as input (along with the data and proposal distributions), and returns the next set of parameters values and corresponding log-posterior as a tuple. These should be identical to the inputs if the proposal is rejected.
def step(data, current_params, current_lnP, proposal_dists, hyperparameters):
"""
data: the data
current_params: dictionary of parameter values
current_lnP: log-posterior density corresponding to current_params
proposal_dists: dictionary of proposal distributions
hyperparameters: dictionary of prior hyperparameter values
Return value: a tuple holding the next parameter dictionary and corresponding log-posterior density
"""
# trial_params = ...
# trial_lnP = ...
# if [accept/reject condition]:
# return (trial_params, trial_lnP)
# else:
# return (current_params, current_lnP)
And, again, make sure it works without crashing. We'll essentially just run a short chain to verify that the final position is not the same as the initial one. If it is the same, either some of the functions above are buggy, or the proposal distribution is so poor for this problem that it should be tweaked before moving on. The cell will run 100 iterations, though ideally you would see some movement within the first 10 proposals (printed below).
guess_lnp = log_posterior(z_gal, **guess, **hyperparams)
state = (guess, guess_lnp)
for i in range(100):
state = step(z_gal, state[0], state[1], proposal_distribution, hyperparams)
if i < 10: print(state)
assert guess_lnp != state[1]
({'z_cl': np.float64(1.977939393939394), 'sigma2': np.float64(9.31478420569332e-05)}, np.float64(117.4724440388524)) ({'z_cl': np.float64(1.977939393939394), 'sigma2': np.float64(9.31478420569332e-05)}, np.float64(117.4724440388524)) ({'z_cl': np.float64(1.977939393939394), 'sigma2': np.float64(9.31478420569332e-05)}, np.float64(117.4724440388524)) ({'z_cl': np.float64(1.977939393939394), 'sigma2': np.float64(9.31478420569332e-05)}, np.float64(117.4724440388524)) ({'z_cl': np.float64(1.977939393939394), 'sigma2': np.float64(9.31478420569332e-05)}, np.float64(117.4724440388524)) ({'z_cl': np.float64(1.978800373491487), 'sigma2': np.float64(9.673605914742313e-05)}, np.float64(117.29934840393263)) ({'z_cl': np.float64(1.978800373491487), 'sigma2': np.float64(9.673605914742313e-05)}, np.float64(117.29934840393263)) ({'z_cl': np.float64(1.9788200658260533), 'sigma2': np.float64(7.486598943471853e-05)}, np.float64(117.09868994161292)) ({'z_cl': np.float64(1.9788200658260533), 'sigma2': np.float64(7.486598943471853e-05)}, np.float64(117.09868994161292)) ({'z_cl': np.float64(1.9788200658260533), 'sigma2': np.float64(7.486598943471853e-05)}, np.float64(117.09868994161292))
/var/folders/94/15_9sbbs5d94bjspzjf0whj00000gn/T/ipykernel_15159/415494179.py:7: RuntimeWarning: invalid value encountered in log return np.where(x<=0, -np.inf, (-alpha-1)*np.log(x) -beta/x)
Assuming all this is working, let's run a chain and store the samples in an array, similar to above.
%%time
mchain = np.zeros((nsamples, len(param_names)))
current_lnP = guess_lnp
params = guess.copy()
for i in range(mchain.shape[0]):
params,current_lnP = step(z_gal, params, current_lnP, proposal_distribution, hyperparams)
mchain[i,:] = [params[k] for k in param_names]
/var/folders/94/15_9sbbs5d94bjspzjf0whj00000gn/T/ipykernel_15159/415494179.py:7: RuntimeWarning: invalid value encountered in log return np.where(x<=0, -np.inf, (-alpha-1)*np.log(x) -beta/x)
CPU times: user 4.66 s, sys: 218 ms, total: 4.87 s Wall time: 4.73 s
Once again, we'll look at the traces of each parameter...
fig, ax = plt.subplots(mchain.shape[1], 1, figsize=(20, mchain.shape[1]*3));
cr.plot_traces(mchain, ax, labels=param_labels, truths=[guess[k] for k in param_names]);
# transform sigma^2 to sigma again
# mchain[:,1] ...
... and the various credible intervals/regions.
fig, ax = plt.subplots(mchain.shape[1], 2, figsize=(9, mchain.shape[1]*3));
mCIs = {}
for i,a in enumerate(ax):
h = cr.whist(mchain[:,i], plot=a[0]); a[0].set_xlabel(final_labels[i]);
mCIs[final_names[i]] = cr.whist_ci(h, plot=a[1]);
a[1].set_xlabel(final_labels[i]);
mCIs
{'z_cl': {'mode': np.float64(1.978089775773057), 'level': array([0.68268949, 0.95449974]), 'prob': array([0.68587572, 0.95488122]), 'density': array([137.73195903, 26.79855786]), 'min': array([1.97627197, 1.97433813]), 'max': array([1.97983023, 1.98145465]), 'low': array([-0.00181781, -0.00375164]), 'high': array([0.00174045, 0.00336488]), 'center': array([1.9780511 , 1.97789639]), 'width': array([0.00177913, 0.00355826])}, 'sigma': {'mode': np.float64(0.009728449813512485), 'level': array([0.68268949, 0.95449974]), 'prob': array([0.6872313 , 0.95518594]), 'density': array([187.37014011, 34.04475224]), 'min': array([0.00853257, 0.00768217]), 'max': array([0.0110572 , 0.01310348]), 'low': array([-0.00119588, -0.00204628]), 'high': array([0.00132875, 0.00337503]), 'center': array([0.00979489, 0.01039283]), 'width': array([0.00126232, 0.00271066])}}
mtri = cr.whist_triangle(mchain, bins=50, smooth2D=1);
cr.whist_triangle_plot(mtri, paramNames=final_labels);
Compare with Gibbs¶
You might have noticed that we didn't provide a check of the core functionality of your Metropolis sampler as we did with Gibbs. Don't worry! For this problem, the Gibbs sampler works extremely well, so we can just compare the results from both samplers to verify that the Metropolis code works.
One caveat is that the comparisons below assume that both chains are predominantly sampling from the posterior correctly instead of struggling to even find their way to a place where the posterior is large from the starting position. If you see large, coherent movements taking up most of the traces in either case, it would be a good idea to adjust guess
. Similarly, we need the proposal distribution for Metropolis to be good enough that the chain jumps around at least a bit instead of rejecting a large fraction of proposals. We'll discuss these issues more in MCMC Diagnostics; for now, we just need things to be good enough that the two chains can be reasonably compared.
The bottom line is that the Metropolis chain is likely to be less efficient than the Gibbs one for this problem, so (for the number of samples above) we can't insist that they resemble one another too closely. The cell below will check whether some basic statistics of the chains, namely the "centers" and "widths" of each 1D marginalized posterior as defined by incredible
, agree within 10%.
for p in final_names:
print(p)
for n,CIs in zip(['Gibbs:', 'Metro:'],[gCIs, mCIs]):
print(" ", n, CIs[p]['center'][0], "+/-", CIs[p]['width'][0])
assert np.isclose(mCIs['z_cl']['center'][0], gCIs['z_cl']['center'][0], rtol=1e-1)
assert np.isclose(mCIs['z_cl']['width'][0], gCIs['z_cl']['width'][0], rtol=1e-1)
assert np.isclose(mCIs['sigma']['center'][0], gCIs['sigma']['center'][0], rtol=1e-1)
assert np.isclose(mCIs['sigma']['width'][0], gCIs['sigma']['width'][0], rtol=1e-1)
z_cl Gibbs: 1.9780083536215416 +/- 0.0017312264093769514 Metro: 1.9780510990239304 +/- 0.0017791304598127056 sigma Gibbs: 0.009791133586821026 +/- 0.0012445536912313367 Metro: 0.009794887498168387 +/- 0.0012623160084621306
That 10% does not require impressive agreement by any means, it's just something we chose to avoid this automatic check failing when the Metropolis chain is less than great (compared to what we could do using less basic proposal distributions, etc.). Therefore, also compare the credible regions from the two methods below - ideally, the Metropolis results (red) should just look like a jankier version of the Gibbs results (blue) rather than fundamentally different.
fig,ax = cr.whist_triangle_plot(gtri, paramNames=final_labels, fill2D=False, linecolor1D='b', linecolor2D='b');
cr.whist_triangle_plot(mtri, paramNames=final_labels, axes=ax, fill2D=False, linecolor1D='r', linecolor2D='r', linestyle1D='--', linestyle2D='--');
They_appear_to_agree_well = False # change to True when true
assert They_appear_to_agree_well
Run multiple chains¶
As before, we'll run 4 independent chains for future use.
%%time
nchains = 4
mchains = [np.zeros((nsamples,len(param_names))) for j in range(nchains)]
for samples in mchains:
# randomly initializing from within the prior is reasonable, unless it's improper
if np.isfinite(hyperparams['tau0']):
params = {'z_cl':st.norm.rvs(hyperparams['mu0'], hyperparams['tau0'])}
else:
params = {'z_cl':st.uniform.rvs(1.85, 0.3)} # just something to fall back on
if hyperparams['alpha0'] > 0 and hyperparams['beta0'] > 0:
params['sigma2'] = st.invgamma.rvs(hyperparams['alpha0'], scale=hyperparams['beta0'])
else:
params['sigma2'] = st.chi2.rvs(22) * 5e-6 # just something to fall back on
current_lnP = log_posterior(z_gal, **params, **hyperparams)
for i in range(samples.shape[0]):
params,current_lnP = step(z_gal, params, current_lnP, proposal_distribution, hyperparams)
samples[i,:] = [params[k] for k in param_names]
/var/folders/94/15_9sbbs5d94bjspzjf0whj00000gn/T/ipykernel_15159/415494179.py:7: RuntimeWarning: invalid value encountered in log return np.where(x<=0, -np.inf, (-alpha-1)*np.log(x) -beta/x)
CPU times: user 17.5 s, sys: 464 ms, total: 18 s Wall time: 17.7 s
Let's see what this version looks like:
fig, ax = plt.subplots(mchain.shape[1], 1, figsize=(20, mchain.shape[1]*3));
cr.plot_traces(mchains, ax, labels=param_labels, Line2D_kwargs={'markersize':1.0});
As before, depending on where the chains were initialized, you might be able to see some extreme values as the chains burn in, before they settle down to sampling the posterior distribution.
Uncomment and run the following cell to save the chains to your data folder. We will use them in a later tutorial.
#for i,samples in enumerate(mchains):
# np.savetxt(datapath+'clredshift_metro_'+str(i)+'.txt.gz', samples, header=' '.join(param_names))
I_have_saved_the_Metro_chains = False # again, you shouldn't actually need to change this
assert all([file_exists(datapath+'clredshift_metro_'+str(i)+'.txt.gz') for i in range(nchains)]) or I_have_saved_the_Metro_chains
Parting thoughts¶
In this notebook, you've hacked together two different MCMC algorithms:
- Conjugate Gibbs sampling is often, though not always, highly efficient when it's applicable. Problems that are fully conjugate are not all that common in astrophysics, although, as you'll see later on, it's sometimes possible to perform conjugate updates of a subset of parameters in a complex model, using other strategies for the remaining parameters.
- In contrast, Metropolis and Metropolis-Hastings are extremely widely applicable and therefore powerful, though one needs to think about how to provide an efficient proposal distribution. We'll see the consequences of not doing so in MCMC Diagnostics and some intelligent proposal strategies in More Sampling Methods.
In future notebooks, you won't need to write your own sampling code, and may not even use either of these specific algorithms. Still, this look under the hood of two of two sampling methods will hopefully provide some good intuition going forward.