Tutorial: Cluster Membership Fractions of redMaPPer Galaxies¶
Gaussian Mixtures and Mixed Sampling Methods¶
In this notebook, we will reproduce part of, and slightly expand on, the analysis of Myles et al. (2021), using data kindly provided by the first author. In the course of doing so, you will
- design and implement a Gaussian Mixture Model (GMM);
- fit the GMM to a subset of the data using brute-force MCMC;
- perform the same fit using conjugate Gibbs sampling, introducing latent group membership parameters along the way;
- fit the full data set, including hierarchical modeling aspects, using multiple sampling approaches within the same analysis.
# !pip install incredible lmc pygtc
from copy import deepcopy
from os import getcwd
from os.path import exists as file_exists
from yaml import safe_load
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
%matplotlib inline
import incredible as cr
import lmc
from pygtc import plotGTC
thisTutorial = 'cl_membership'
if getcwd() == '/content':
# assume we are in Colab, and the user's data directory is linked to their drive/Physics267_data
from google.colab import drive
drive.mount('/content/drive')
datapath = '/content/drive/MyDrive/Physics267_data/' + thisTutorial + '/'
else:
# assume we are running locally somewhere and have the data under ./data/
datapath = 'data/'
Background¶
Galaxies that fall into massive accumulations of dark matter and hot gas (these are, inaccurately, known as "clusters of galaxies") rapidly have their gas stripped and cease forming stars. Hence, if one has an optical photometric survey, one way of finding galaxy clusters is to look for... clusters of galaxies that have similar colors, corresponding to red/non-star-forming spectra at some common redshift.
The thing is, along any line of sight there are loads of galaxies of different spectral types with different redshifts, so color is not a reliable way of saying that a particular galaxy is in a cluster. Optical photometric cluster finders can only identify sight-lines where there is probably a cluster, and estimate what it's redshift would be. They also provide a "richness", loosely defined as the number of intrinsically red galaxies brighter than some threshold the cluster is supposed to contain. But, again, we can't really be sure whether any given galaxy is actually a member or not from photometric data alone. RedMaPPer, the main cluster-finder used by the Dark Energy Survey and (soon) Rubin/LSST, addresses this by assigning a membership probability to each galaxy near an identified cluster; the richness is then the sum of these probabilities, such that it should statistically mean something like the number of bright, red galaxies in the cluster.
So far so good. But this approach is only good to the extent that the membership probabilities are accurate - or, at least, similarly accurate as a function of cluster mass. This is worth checking, which we could do using redshifts measured for a large number of potential cluster members from spectroscopy. The redshifts (strictly speaking, the implied velocities) for cluster members should be distributed in a tight Gaussian distribution centered on the redshift of the bright central galaxy (BCG). We would expect to see a wider distribution of redshifts from galaxies that are not in the cluster, despite having similar colors to cluster members.
(Analogously to the photometric color problem, it's technically possible for a galaxy at a different cosmological redshift to have such a large peculiar velocity that its redshift is consistent with the cluster's, but redshifts can be measured so precisely, and the cluster distribution is so narrow, that this should be rare. The model we'll build below will account for it, in any case.)
Our goal, following the paper, will be to measure the fraction of galaxies that redMaPPer assigns to a cluster that are, in fact, not cluster members, as a function of richness. As described more below, we will go a little bit farther by modeling the width of the cluster velocity distribution to be a simple function of richness, which makes physical sense (as both correlate with mass).
Data¶
The data we'll be working with ultimately come from the Sloan Digital Sky Survey. Myles et al. have compiled redshifts from Sloan spectroscopy for potential members of RedMaPPer clusters with estimated redshifts of 0.08-0.12 (this is relatively low, but the only sample for which archival spectroscopy would be relatively complete). Various fascinating cuts, which you can read the paper to learn about if you want, have been applied, and the data have been grouped into 6 bins according to the reported richness of the cluster each galaxy is supposed to belong to, with bin 0, 1, ..., 5 corresponding to increasing richness ranges (nominally increasing cluster mass).
For each bin, we simply have a list of values: $x_i = \frac{z_i - z_\mathrm{BCG}}{1 + z_\mathrm{BCG}}$. This quantity is directly related to the velocity implied by the galaxy's redshift, assuming it does belong to the cluster. We will eschew the conversion to velocity units in the tutorial for simplicity, and because it gives us an excuse to use the word eschew.
Below, we read the data for each bin into a corresponding entry of the list data
. Note that the public solutions use the real data, while, as usual, you have been given mock data produced from a model with potentially different parameters than what we find.
data = [np.loadtxt(datapath+'bin'+str(i)+'.txt.gz') for i in range(5)]
richness = np.array([9., 24., 33., 42., 61.])
To reiterate, the different entries in data
refer to bins in cluster richness, each containing between a few and a few thousand clusters' worth of galaxies. This is already a bit of a compromise from the data we might like to have, with the galaxies for each cluster kept separate, and each cluster's specific richness known. But, without reproducing a bunch of catalog wrangling, this is what we have. The entries in the richness
array above correspond to something like the median richness for clusters in a given bin, estimated off of plots in the paper. In effect, we will just treat each bin as if it were a single cluster with the specified richness, for the purpose of measuring the corresponding velocity dispersion and fraction of non-members. This simplification is not ideal, but shouldn't be too bad as long as there are a reasonable number of clusters in each bin. (Those of you who do the outside reading might notice that we have removed the richest bin from the published analysis, on the basis that is contains only 7 clusters over a much wider richness range than the other bins.)
Let's have a look at histograms of $x$ for each bin:
plt.rcParams['figure.figsize'] = (18.0, 8.0)
fig, ax = plt.subplots(2, 3);
for i in range(len(data)):
ii = np.unravel_index(i, ax.shape)
ax[ii].hist(data[i], bins=100, label='bin '+str(i));
ax[ii].legend();
ax[ii].set_xlim(-0.1, 0.1);
ax[ii].set_xlabel('x');
ax[1,2].axis(False);
Already we can clearly see what might be a sharp spike consisting of cluster members, and a wider distribution of non-members. The latter aren't so easy to see, however, so we can instead plot the histograms with a log-scaled y axis:
plt.rcParams['figure.figsize'] = (18.0, 8.0)
fig, ax = plt.subplots(2, 3);
for i in range(len(data)):
ii = np.unravel_index(i, ax.shape)
ax[ii].hist(data[i], bins=100, label='bin '+str(i), log=True);
ax[ii].legend();
ax[ii].set_xlim(-0.1, 0.1);
ax[ii].set_xlabel('x');
ax[1,2].axis(False);
Recall that the log of a Gaussian is an upside-down parabola - it already looks like a couple of these might describe the data reasonably.
Model¶
So, what model will we fit to these distributions of $x$ values? The distribution for cluster members is thermodynamically expected to be a Gaussian, but it isn't obvious what we should expect for the non-members - it would depend on the cosmic large scale structure as well as the redMaPPer selection of these potential members in the first place. On the other hand, the wider distributions above look approximately Gaussian, and we can always build a model for them out of multiple Gaussians if need be. In other words, this might be a good place to use a Gaussian mixture model, describing the data for a single bin as
$p(x) = (1-f_\mathrm{b})\cdot\mathrm{Normal}(x|0,\sigma_\mathrm{cl}) + f_\mathrm{b}\cdot\mathrm{Normal}(x|\mu_\mathrm{b},\sigma_\mathrm{b})$.
Here, we've written the simplest mixture we might try, with 1 Gaussian component for cluster members (cl) and just 1 Gaussian component for the background (b). We've also baked in the expectation that the cluster component have a mean of zero (i.e., be centered on the BCG's $x$ value). The parameter $f_\mathrm{b}$ is defined in the sense of being the a priori probability (i.e. the fraction) if redMaPPer galaxies that are actually in the background component. As mentioned above, this could be generalized to include multiple Gaussians for the background, but it doesn't really seem to be necessary in practice, and our lives will be simpler with a mixture of just 2.
If we (perhaps unfairly) neglect measurement uncertainty in the redshift measurements, then our sampling distribution for the data in a given bin will resemble the expression above. Each bin is allowed to have a different value of $f_\mathrm{b}$ and $\sigma_\mathrm{cl}$, but we will follow Myles et al. by assuming that a single background distribution applies to all bins. (If you want, you can imagine that we're doing a follow-up study, with the appropriateness of that assumption already being established.) To be explicit, then, we have
$p(x_{ij}) = (1-f_{\mathrm{b},i})\cdot\mathrm{Normal}(x|0,\sigma_{\mathrm{cl},i}) + f_{\mathrm{b},i}\cdot\mathrm{Normal}(x|\mu_\mathrm{b},\sigma_\mathrm{b})$,
where $i$ indexes bins and $j$ indexes galaxies within each bin.
To begin with, we will fit the data for a single bin, independent of the others, in order to get key parts of our analysis up and running. After that, we will fit all the bins together, hierarchically modeling the cluster velocity dispersions as a power law,
$\sigma_{\mathrm{cl},i} = A_\sigma \, \left(\frac{\lambda_i}{10}\right)^{\alpha_\sigma}$,
where $\lambda_i$ is the median richness for the $i$th bin. We chose to normalize the power law at richness 10 because it's (a) a round number, and (b) close to the richness of the lowest bin, which has the largest number of clusters (and galaxies) contributing to it. Based on the published analysis, this $\sigma_{\mathrm{cl}}$ will be much better constrained than the others, so normalizing the function here should make the degeneracy between $A_\sigma$ and $\alpha_\sigma$ relatively small.
The full list of model parameters is thus, for $i=0,1,\ldots,4$,
- $f_{\mathrm{b},i}$: the fraction of redMaPPer-selected galaxies that are not actually in the associated cluster, and
- $\sigma_{\mathrm{cl},i}$: the standard deviation of the cluster Gaussian over $x_i$,
as well as global parameters
- $A_\sigma$: normalization of the scaling relation for $\sigma_{\mathrm{cl}}$,
- $\alpha_\sigma$: slope of the scaling relation for $\sigma_{\mathrm{cl}}$,
- $\mu_b$: the mean of the background Gaussian over $x$, and
- $\sigma_b$: the standard deviation of the background Gaussian over $x$.
As usual, we will collectively decide on priors in class; because we'll use conjugate Gibbs sampling below, some of these will need to be compatible with conjugate distributions. Also as usual, your next task is to draw the PGMs and write out the conditional probability expressions that define the models. Don't forget that there are two cases: a single-bin analysis, and a hierarchical analysis of all the bins.
Space for your generative model
In addition, we will discuss and decide on a strategy for randomly initializing the Markov chains that we run in this notebook (i.e. what distributions to draw each parameter from). This could be as simple as "from the prior", unless, of course, the prior distribution is improper.
Notes on what we decided
Visualizing the model¶
Before we launch into things, it will be useful to have a way to quickly compare a model with the data. We can directly plot $p(x)$ over the histrograms above, but the tails, where there are 0 or 1 measured values in a histogram bin, are tricky to judge. Therefore, we will look at comparisons of both the model PDF and CDF, the latter comparing to the empirical CDF of the data.
Below, fill in functions to evaluate $p(x)$ and its CDF.
def p_of_x(x, fb, sc, mb, sb):
"""
Return the sampling distribution PDF, p(x), for a single bin, modeled by the given background fraction (fb), cluster standard
deviation (sc), and background mean and standard deviation (mb, sb).
Should work if `x` is an array.
"""
# YOUR CODE HERE
raise NotImplementedError()
def cdf_of_x(x, fb, sc, mb, sb):
"""
Return the sampling distribution CDF for a single bin, modeled by the given background fraction (fb), cluster standard
deviation (sc), and background mean and standard deviation (mb, sb).
Should work if `x` is an array.
"""
# YOUR CODE HERE
raise NotImplementedError()
Here's a function to make the described comparison plots:
def compare(x, *args, **kwargs):
sortx = np.sort(x)
xx = np.linspace(sortx[0], sortx[-1], 1000)
plt.rcParams['figure.figsize'] = (10.0, 3.0)
fig, ax = plt.subplots(1, 2);
ax[0].hist(sortx, bins=100, log=True, density=True, label='data');
ax[0].plot(xx, p_of_x(xx, *args, **kwargs), label='model');
ax[0].set_ylim(1e-1);
ax[0].legend(); ax[0].set_xlabel('x'); ax[0].set_ylabel('p(x)');
ax[1].plot(sortx, np.arange(len(x))/len(x), '.', label='data');
ax[1].plot(xx, cdf_of_x(xx, *args, **kwargs), label='model');
ax[1].legend(); ax[1].set_xlabel('x'); ax[1].set_ylabel('CDF(x)');
Next, iteratively fill in guessed parameter values below until you have something that looks like a broadly reasonable description of the bin 1 data. You could, of course, repeat this for all the bins if you wanted to.
# guess = {'fb':..., 'sc':..., 'mb':..., 'sb':...}
# YOUR CODE HERE
raise NotImplementedError()
compare(data[1], **guess)
Fitting a single bin¶
Before launcing into anything too complex, we will first play with fitting the data from a single bin (specifically, $i=1$). We'll first use a standard MCMC approach, and then use conjugate Gibbs sampling. This is not just to demonstrate again that in certain circumstances conjugate Gibbs can be much more efficient, but because the more complex analysis you'll do later is one where using both sampling methods together is actually a nice option. Getting each of them working separately on a subset of the data will make that final leap more straightforward.
Below is the enumeration of parameters that we'll use in this section.
paramnames = ['fb1', 'sigma_cl1', 'mu_b', 'sigma_b']
param_labels = [r'$f_\mathrm{b,1}$', r'$\sigma_\mathrm{cl,1}$', r'$\mu_\mathrm{b}$', r'$\sigma_\mathrm{b}$']
Brute-force MCMC¶
With the above plan in mind, we will be using the lmc
package in this part, as it can be relatively easily modified to accomodate mixed Gibbs and non-Gibbs updates later on. lmc
has the peculiar feature that the log-posterior function we provide it takes a completely arbitrary argument of our choice. It could be a vector of parameter values, or the data, or a structure containing both, or None
. In this case, let's make that argument the data vector for a single bin; we'll only work with bin 1 in this section, but in principle you could plug in any of them using the same code. This means that we will be getting the current parameter values within the function from global scope, which is simultaneously terrible coding style and entirely functional. We take full responsibility for this, and therefore provide the corresponding part of the function for you below.
def lnpost(x):
"""
`x` is the data vector for a single bin
Return the log-posterior (don't forget the prior!)
"""
f_b1,sigma_cl1,mu_b,sigma_b = [p() for p in parameter_space] # parameter_space is global, to be defined below
# YOUR CODE HERE
raise NotImplementedError()
Here we define the lmc.Parameter
objects that the fitter will use; note that the "call" ()
syntax used above is the approved way of extracting the current value of one of these. The ParameterSpace
is also an lmc
thing, although the dictionary below is just for our convenience. (Recall that, since this is Python, exactly the same Parameter
objects are referenced from both parameter_space
and parameter_space
; one does not hold copies of the other.)
parameter_space = lmc.ParameterSpace([lmc.Parameter(name=s) for s in paramnames], lnpost)
parameter_dict = {k:p for k,p in zip(paramnames, parameter_space)}
This will set the value of each Parameter
to your guess from above, and test that the lnpost
function returns a finite number:
for k1,k2 in zip(['fb', 'sc', 'mb', 'sb'], paramnames):
parameter_dict[k2].set(guess[k1])
print(lnpost(data[1]))
assert np.isfinite(lnpost(data[1]))
The last thing that would be nice to have is a function that randomly initializes the starting point of our MCMC, following the decision we made in class together. The prototype below shows how to do this in practice. You will also need to set a width
atttribute for each Parameter
, which serves as an initial guess for the scale of the proposal distribution (this is adapted as the chain runs, but it's good to start with something reasonable). Your function should never return parameters that have zero prior probability.
# def initialize_params():
# parameter_dict['fb1'].set(...)
# parameter_dict['sigma_cl1'].set(...)
# parameter_dict['mu_b'].set(...)
# parameter_dict['sigma_b'].set(...)
# parameter_dict['fb1'].width = ...
# parameter_dict['sigma_cl1'].width = ...
# parameter_dict['mu_b'].width = ...
# parameter_dict['sigma_b'].width = ...
# YOUR CODE HERE
raise NotImplementedError()
Let's just verify that this function works:
initialize_params()
print([p() for p in parameter_space])
assert np.isfinite( lnpost(data[1]) )
Hint: If the assertion fails, do make sure you're taking a sum of logs rather than a log of a product.
While it doesn't turn out to be an issue in this case, the hint above gets to a weakness of brute-force posterior calculations when using mixture models. Namely, because part of the posterior involves a sum of PDFs, we do actually have to evaluate those PDFs, rather than exclusively working with log-PDFs. This opens a door to numerical over/underflows that we might not need to worry about otherwise.
Ok, finally we are ready to run chains. We'll just run several in series, since lmc
's parallelization is not really compatible with being in a notebook. We strongly recommend that you test this out first with a single, short chain, before re-running with settings more like those given below. As always, there will be diagnostic tests to pass afterward.
If you're wondering, the lmc
defaults here employ adaptive slice sampling, so you shouldn't see the perfect repetition of samples that Metropolis produces, but each one takes slightly longer to produce.
%%time
nchains = 4
nsamples = 3000
chains = [lmc.dictBackend() for i in range(nchains)]
for chain in chains:
initialize_params()
vehicle = lmc.Vehicle(parameter_space, backend=chain, adapt_every=100, adapt_starting=200)
vehicle(1, nsamples, struct=data[1], scatter=False)
Next we just reformat the results so that incredible can digest them. Since lmc
automatically kept track of the log-posterior for us (_LOGP_
), we may as well show it along with the other traces.
chains = np.array([np.array([chain[p] for p in paramnames+['_LOGP_']]).T for chain in chains])
fig, ax = plt.subplots(len(paramnames)+1, 1, figsize=(16.0, 2.0*len(paramnames)+2));
cr.plot_traces(chains, ax, labels=param_labels+['_LOGP_']);
Choose a burn-in length that seems about right. This can be iterated with the following cell, which re-displays the traces without the burn-in.
# burn =
# YOUR CODE HERE
raise NotImplementedError()
fig, ax = plt.subplots(len(paramnames)+1, 1, figsize=(16.0, 2.0*len(paramnames)+2));
cr.plot_traces(chains[:,burn:,:], ax, labels=param_labels+['_LOGP_']);
Next we will throw away the burn-in portion defined above (for good!) and test for convergence and length.
chains = chains[:,burn:,:len(paramnames)] # ditch the _LOGP_ column, too
R = cr.GelmanRubinR(chains)
print("R:", R)
assert np.all(R < 1.1)
maxlag = 500 # may need to adjust this
neff = cr.effective_samples(chains, maxlag=maxlag, throw=True)
print("neff:", neff)
assert np.all(neff >= 100) # we'll get better chains from Gibbs, so these don't have to be great
For completeness, let us gaze upon a triangle.
plotGTC(np.concatenate(chains), paramNames=param_labels,
figureSize=6, customLabelFont={'size':12}, customTickFont={'size':12}, customLegendFont={'size':16});
Conjugate Gibbs¶
Hopefully it won't come as a shock to learn that a model composed entirely of Gaussians (or any standard distribution, really) lends itself to conjugacy. In fact, you will be able to re-use functions from the MCMC Sampling tutorial to determine the conditional posterior distributions for the mean and variance of a Gaussian component, and the generalization from 1 Gaussian to a Gaussian mixture is not too much more involved.
On the other hand, exactly how one comes up with a conjugacy relation for a sum of Gaussians may not be obvious. The trick is... we don't need to! Instead, we can get around the question entirely by introducing a set of latent parameters, $g$, where $g_{ij}$ tells us whether galaxy $ij$ belongs to the cluster distribution for bin $i$ ($g_{ij}=0$) or the background distribution ($g_{ij}=1$). You could think of this as re-writing our model PDF as
$p(x_{ij}) = (1-g_{ij})\cdot\mathrm{Normal}(x_{ij}|0,\sigma_{\mathrm{cl},i}) + g_{ij}\cdot\mathrm{Normal}(x_{ij}|\mu_\mathrm{b},\sigma_\mathrm{b})$.
The $g_{ij}$ parameters appear in exactly the same way as $f_{\mathrm{b},i}$ did, but since $g_{ij}$ is either 0 or 1 for a given galaxy $j$, $p(x_{ij}|g_{ij})$ is always just a single Gaussian. Of course, $f_{\mathrm{b},i}$ is still part of the model, and will play a role in determining the $g_{ij}$ - after all, $f_{\mathrm{b},i}$ is exactly the a priori probability that any $g_{ij}=1$.
We'll fill in the details in a moment. But first, it will be very helpful to modify your PGM and expressions to explicitly show the latent $g_{ij}$ parameters. If you're confused about how to connect them to everything else, remember... think generatively! (And then re-read the final clause of the previous paragraph.)
Space for yet another PGM
Conjugate updates¶
Recall that, in conjugate Gibbs sampling, we update one parameter (or bunch of related parameters) at a time, with all the others fixed. The advantage of introducing the group membership parameters is that, at any given moment in our sampling, each galaxy's $x$ is drawn from exactly one Gaussian. Conversely, if we want to update the parameters of a Gaussian, we can use the update rules you worked out in the MCMC Sampling tutorial, as long as only the data for galaxies with the corresponding $g$ value are included in the calculation. Note that we will need to specify hyperparameters of the conjugate prior distributions that give us equivalent priors to what we decided above.
First, let's just paste in the functions you wrote for the MCMC Sampling tutorial that compute the parameters of the conditional posterior distributions for a Gaussian mean and variance. Please note one change of notation: we previously called the Gaussian mean z_cl
, since it made sense in the context of the earlier notebook. Here, let's just call it mu
, to be general. Also be careful to keep standard deviations vs variances straight.
# paste your functions from the "clredshift" notebook, but note the z_cl -> mu change of notation
def condpost_Gaussian_mean(data, par, hypar):
"""
`data` is a 1D array of all the data the Gaussian is supposed to describe
`par` is a dictionary {'mu', 'sigma2'} of the Gaussian's current parameters (sigma2 is the variance)
`hypar` is a dictionary {'mu0', 'tau0', 'alpha0', 'beta0'} of the hyperparameters (Gaussian mean and *STANDARD DEVIATION* for mu,
inverse-gamma shape and rate for sigma2)
return a tuple (mean, stdev) encoding the normal distribution from which the new mean (mu) should be drawn
"""
# YOUR CODE HERE
raise NotImplementedError()
def condpost_Gaussian_variance(data, par, hypar):
"""
`data` is a 1D array of all the data the Gaussian is supposed to describe
`par` is a dictionary {'mu', 'sigma2'} of the Gaussian's current parameters (sigma2 is the variance)
`hypar` is a dictionary {'mu0', 'tau0', 'alpha0', 'beta0'} of the hyperparameters (Gaussian mean and *STANDARD DEVIATION* for mu,
inverse-gamma shape and rate for sigma2)
return a tuple (alpha, beta) encoding the inverse-gamma distribution from which the new variance (sigma2) should be drawn
"""
# YOUR CODE HERE
raise NotImplementedError()
Even though these are cut/pasted, let's test them against known solutions just to be sure:
condpost_test = safe_load(open(datapath+'test_condpost.yaml', 'r').read())
assert np.allclose(condpost_Gaussian_mean(np.array(condpost_test['data']), condpost_test['testpar'], condpost_test['testhypar']), condpost_test['mu'])
assert np.allclose(condpost_Gaussian_variance(np.array(condpost_test['data']), condpost_test['testpar'], condpost_test['testhypar']), condpost_test['sigma2'])
Of course, we will also need rules for updating all of the $g_{ij}$ and $f_{\mathrm{b},i}$ parameters.
Updating $g$¶
To see how to update each of the $g_{ij}$, write down the terms in the posterior where it appears based on your PGM and equivalent expressions. (Remember, it's the fully conditional posterior that we need to work with. All other parameters are fixed, and so anything term not involving $g_{ij}$ can be ignored, as long as we remember to normalize at the end.) Hopefully without giving away too much, you should be left with something like
$p(g_{ij}|\ldots) \propto p(g_{ij}|f_{\mathrm{b},i}) \, p(x_{ij}|g_{ij},\ldots)$.
Since $g_{ij}$ can only take the values 0 and 1, we simply need to evaluate the conditional posterior at both values, normalize them, and then randomly choose a new $g_{ij}$ based on those 2 probabilities.
Consistent with the functions above, complete the following function to return the conditional posterior hyperparameters (that is, the probability of being in the background, $p(g_{ij}|\ldots)$) for each galaxy in a particular bin.
def condpost_group_membership(x, f1, par0, par1):
"""
`x` is a 1D array of all the data to be assigned to groups 0 or 1
`f1` is the a priori probability of belonging to group 1
`par0` is a dictionary {'mu', 'sigma2'} of parameters describing the group 0 Gaussian PDF
`par1` is a dictionary {'mu', 'sigma2'} of parameters describing the group 1 Gaussian PDF
return an array (same shape as data) holding the conditional posterior probability for each datum to belong to group 1
"""
# YOUR CODE HERE
raise NotImplementedError()
Here is a quick test that it's working:
assert np.allclose(condpost_group_membership(np.array(condpost_test['data']), condpost_test['testpar']['f'], condpost_test['testpar'], condpost_test['testpar2']), condpost_test['g'])
Updating $f$¶
Following the same approach as above, the conditional posterior for $f_{\mathrm{b},i}$ looks something like
$p(f_{\mathrm{b},i}|\ldots) \propto p(f_{\mathrm{b},i}) \, \prod_j p(g_{ij}|f_{\mathrm{b},i}) = p(f_{\mathrm{b},i}) \, f_{\mathrm{b},i}^{n_{i,0}} (1-f_{\mathrm{b},i})^{n_{i,1}}$,
where $n_{i,k}$ is the number of $g_{ij}$ equal to $k$. These last 2 factors you might recognize as being proportional to the binomial distribution, $p(n|q;N)={N \choose n} q^n (1-q)^{N-n}$. You've already solved this particular conjugacy problem in the Bayes' Law tutorial, but consult the Wikipedia or work it our again if in doubt (spoiler: the conjugate distribution is Beta).
Space to note down the Binomial-Beta conjugate update rule
Implement the corresponding calculation of the conditional posterior parameters ($\alpha$ and $\beta$ of the Beta distribution) for one of the $f_{\mathrm{b},i}$.
def condpost_fraction(groups, hypar):
"""
`groups` is a 1D array of group assignments for all data (0 or 1) that this fraction applies to
`hypar` is a dictionary {'alpha', 'beta'} of the Beta distribution hyperparameters
return a tuple (alpha, beta) encoding the Beta distribution from which the new fraction for group 1 should be drawn
"""
# YOUR CODE HERE
raise NotImplementedError()
And a test:
assert np.allclose(condpost_fraction(np.array(condpost_test['testpar']['g']), condpost_test['testhypar']), condpost_test['f'])
Implementation¶
Now that the individual conditional posterior calculations are done, let's put together the infrastructure to update the model parameters. We'll follow the convention that we mostly have in this course and organize the parameters into a dictionary. However, in the context of a mixture model, it makes sense to complicate things slightly by including "sub-dictionaries" for the mixture components. We think the end result is not too confusing. Consider the following, which encodes the parameters you guessed for the fit to bin 1 in the previous section:
guess2 = {'fb1':guess['fb'], 'gauss1':{'mu':0.0, 'sigma2':guess['sc']**2}, 'gaussb':{'mu':guess['mb'], 'sigma2':guess['sb']**2}}
Note that the parameters for the cluster-member Gaussian (which we assume has a mean of zero) and the background galaxy Gaussian are stored in guess2['gauss1']
and guess2['gaussb']
, respectively, each holding its own mu
and sigma2
.
What about the latent $g$ parameters that we introduced in this section? Those will need to appear in the parameter dictionary also, but we don't actually need to guess values for them (or come up with starting values for an MCMC) if we decide that $g$ will be the first set of parameters that get updated in the first iteration of the fit. This is because its fully conditional posterior doesn't depend on the current value of $g$, just on the other parameters. (We could choose any of the parameters to be exempt from choosing a starting value in this way, but $g$ is easily the most convenient in this case because there are so many of them.)
It is, however, fun to see what mixture assignments follow from our guessed values of the other parameters. So, let's go ahead and draw $g$ from its conditional posterior
guess2['g'] = st.bernoulli.rvs( condpost_group_membership(data[1], guess2['fb1'], guess2['gauss1'], guess2['gaussb']) )
and have a look at the histograms of $x$ for the galaxies belonging to the 2 mixture components:
plt.rcParams['figure.figsize'] = (6., 4.)
plt.hist(data[1][guess2['g']==1], label='background', color='C1');
plt.hist(data[1][guess2['g']==0], label='clusters', color='C0');
plt.legend();
print('Fraction of g in background component:', (guess2['g']==1).sum()/len(guess2['g']))
If your guess is at all reasonable, and the meaning of the various parameters hasn't gotten confused along the way, you should see a narrow spike of cluster galaxies and a much broader distribution of background galaxies. The fraction of galaxies with $g_{1j}=1$ printed above should also be similar to your guess for $f_{\mathrm{b,1}}$, although they need not be identical.
Finally, it's time to encode the priors we agreed on with specific hyperparameter values. Fill in the dictionary below, referring back to the conditional posterior functions if you've forgotten the meaning of any entries:
# hyperparams = {'fb1':{'alpha':..., 'beta':...},
# 'gauss1':{'mu0':..., 'tau0':..., 'alpha0':..., 'beta0':...},
# 'gaussb':{'mu0':..., 'tau0':..., 'alpha0':..., 'beta0':...}}
# YOUR CODE HERE
raise NotImplementedError()
Ok! With that, we can write a function that updates a parameter dictionary in place, as we did in the MCMC Sampling notebook. This is given below, but make sure you understand what is happening and why.
def gibbs_iteration2(x, par, hypar):
# x is the data for a single bin
# We could randomize the order of these updates if we cared to, but it's convenient to not have to initialize g beforehand.
# Sample group assignments
p = condpost_group_membership(x, par['fb1'], par['gauss1'], par['gaussb'])
par['g'] = st.bernoulli.rvs(p)
# Sample the background fraction parameter
alpha,beta = condpost_fraction(par['g'], hypar['fb1'])
par['fb1'] = st.beta.rvs(alpha, beta)
# Sample the variance of the cluster-member Gaussian component (mean is assumed fixed)
xcl = x[par['g'] == 0]
if len(xcl) > 0: # because you never know
alpha,beta = condpost_Gaussian_variance(xcl, par['gauss1'], hypar['gauss1'])
par['gauss1']['sigma2'] = st.invgamma.rvs(alpha, scale=beta)
# Sample the mean and variance of the background Gaussian component
xb = x[par['g'] == 1]
if len(xb) > 0: # because, seriously, you never know
alpha,beta = condpost_Gaussian_variance(xb, par['gaussb'], hypar['gaussb'])
par['gaussb']['sigma2'] = st.invgamma.rvs(alpha, scale=beta)
mean,sd = condpost_Gaussian_mean(xb, par['gaussb'], hypar['gaussb'])
par['gaussb']['mu'] = st.norm.rvs(mean, sd)
Let's verify that this does, in fact, update the parameters to allowed values (compare with your guess to make sure they changed):
params = deepcopy(guess2)
gibbs_iteration2(data[1], params, hyperparams)
params
Running the MCMC¶
That should be everything we need to run some chains. Below we just borrow the function you wrote to randomly initialize a chain, and then call gibbs_iteration2
, storing the parameter values in an array.
%%time
nchains = 4
nsamples = 3000
chains2 = np.empty((nchains, nsamples, len(paramnames)))
for i in range(chains2.shape[0]):
initialize_params() # puts initial values in parameter_dict
params = {'fb1':parameter_dict['fb1'](),
'gauss1':{'mu':0.0, 'sigma2':parameter_dict['sigma_cl1']()**2},
'gaussb':{'mu':parameter_dict['mu_b'](), 'sigma2':parameter_dict['sigma_b']()**2}
} # assumes params['g'] gets updated first, so we don't need to give it a starting value here
for j in range(chains2.shape[1]):
gibbs_iteration2(data[1], params, hyperparams)
chains2[i,j,:] = [params['fb1'], np.sqrt(params['gauss1']['sigma2']), params['gaussb']['mu'], np.sqrt(params['gaussb']['sigma2'])]
Go through the usual diagnostics below. The burn-in will likely be quite short, and you should see something like a factor of 10 more effectively independent samples compared with the brute-force MCMC of the same length.
fig, ax = plt.subplots(len(paramnames), 1, figsize=(16.0, 2.0*len(paramnames)));
cr.plot_traces(chains2, ax, labels=param_labels);
# burn2 = ...
# YOUR CODE HERE
raise NotImplementedError()
fig, ax = plt.subplots(len(paramnames), 1, figsize=(16.0, 2.0*len(paramnames)));
cr.plot_traces(chains2[:,burn2:,:], ax, labels=param_labels);
chains2 = chains2[:,burn2:,:]
R = cr.GelmanRubinR(chains2)
print("R =", R)
assert np.all(R < 1.1)
maxlag2 = 500 # change if needed
neff = cr.effective_samples(chains2, maxlag=maxlag2, throw=True)
print("neff =", neff)
assert np.all(neff > 1000) # should be easily exceeded
Let's see how these results compare with the brute-force MCMC. They should provide the same constraints, modulo the different number of effective samples.
plotGTC([np.concatenate(chains), np.concatenate(chains2)], paramNames=param_labels, chainLabels=['non-Gibbs','Gibbs'],
figureSize=6, customLabelFont={'size':12}, customTickFont={'size':12}, customLegendFont={'size':16});
Just to make sure, here's a quick check that the means and standard deviations of each parameter are within 10%:
assert np.allclose(np.concatenate(chains).mean(axis=0), np.concatenate(chains2).mean(axis=0), rtol=0.1, atol=1e-4)
assert np.allclose(np.concatenate(chains).std(axis=0), np.concatenate(chains2).std(axis=0), rtol=0.1, atol=1e-4)
Fitting all bins with linked backgrounds¶
As foreshadowed, we will eventually fit all the data hierarchically, with a shared background Gaussian component and cluster Gaussian widths related by a power law with richness. Before that, it would be nice to do a fit that doesn't assume the power law, but instead lets each bin have its own cluster-member Gaussian. We can compare the final constraints on the power law with these individual $\sigma_\mathrm{cl}$ constraints to see whether the assumption of that model is justified (and/or you could do a more formal model comparison, if you were so inclined).
The fit in this section can be almost entirely accomplished by reusing the work you've already done and just adding a little bookkeeping. To reiterate, we now have an $f_\mathrm{b}$ and a $\sigma_\mathrm{cl}$ parameter for each bin, but are not yet assuming a relationship between the $\sigma_\mathrm{cl}$s, and a single background distribution parametrized by $\mu_\mathrm{b}$ and $\sigma_\mathrm{b}$.
paramnames3 = ['fb0', 'fb1', 'fb2', 'fb3', 'fb4', 'sigma_cl0', 'sigma_cl1', 'sigma_cl2', 'sigma_cl3', 'sigma_cl4',
'mu_b', 'sigma_b']
param_labels3 = [r'$f_\mathrm{b,0}$', r'$f_\mathrm{b,1}$', r'$f_\mathrm{b,2}$', r'$f_\mathrm{b,3}$', r'$f_\mathrm{b,4}$',
r'$\sigma_\mathrm{cl,0}$',r'$\sigma_\mathrm{cl,1}$',r'$\sigma_\mathrm{cl,2}$',r'$\sigma_\mathrm{cl,3}$',r'$\sigma_\mathrm{cl,4}$',
r'$\mu_\mathrm{b}$', r'$\sigma_\mathrm{b}$']
Fill in the hyperparameter dictionary for this part. The priors for each of the $\sigma_\mathrm{cl}$ parameters can be the same as we used for $\sigma_\mathrm{cl,1}$ in the previous sections, but for the others you should use whatever priors we decided on for the full hierarchical model.
# hyperparams3 = {'fb0':{'alpha':..., 'beta':...},
# 'fb1':{'alpha':..., 'beta':...},
# ...
# 'fb4':{'alpha':..., 'beta':...},
# 'gauss0':{'mu0':..., 'tau0':..., 'alpha0':..., 'beta0':...},
# 'gauss1':{'mu0':..., 'tau0':..., 'alpha0':..., 'beta0':...},
# ...
# 'gauss4':{'mu0':..., 'tau0':..., 'alpha0':..., 'beta0':...},
# 'gaussb':{'mu0':..., 'tau0':..., 'alpha0':..., 'beta0':...}}
# YOUR CODE HERE
raise NotImplementedError()
Next, we need to generalize the function that updates all of the parameters once. We've provided the bookkeeping to deal with the multiple cluster-member Gaussians; you just need to add the part that updates the background Gaussian parameters based on the data from all bins that are assigned to the background component. (Hint: np.concatenate
is your friend.)
def gibbs_iteration3(xs, par, hypar):
# xs is a list of data vectors for each bin
nbins = len(xs)
for i in range(nbins):
si = str(i)
p = condpost_group_membership(xs[i], par['fb'+si], par['gauss'+si], par['gaussb'])
par['g'+si] = st.bernoulli.rvs(p)
alpha,beta = condpost_fraction(par['g'+si], hypar['fb'+si])
par['fb'+si] = st.beta.rvs(alpha, beta)
x = xs[i][par['g'+si] == 0]
if len(x) > 0:
alpha,beta = condpost_Gaussian_variance(x, par['gauss'+si], hypar['gauss'+si])
par['gauss'+si]['sigma2'] = st.invgamma.rvs(alpha, scale=beta)
# add the updates of par['gaussb']['mu'] and par['gaussb']['sigma2'] here
# YOUR CODE HERE
raise NotImplementedError()
This function will further abuse your initialize_params
function to provide a starting position.
def initialize_params3():
initialize_params() # puts values in parameter_dict
params = {'fb0':parameter_dict['fb1'](),
'gauss0':{'mu':0.0, 'sigma2':parameter_dict['sigma_cl1']()**2},
'gaussb':{'mu':parameter_dict['mu_b'](), 'sigma2':parameter_dict['sigma_b']()**2}
}
for i in range(1, len(data)):
initialize_params()
params['fb'+str(i)] = parameter_dict['fb1']()
params['gauss'+str(i)] = {'mu':0.0, 'sigma2':parameter_dict['sigma_cl1']()**2}
return params
Let's quickly check that we can both initialize a set of parameters, and change it using the updating function and hyperparameters:
params = initialize_params3()
params
gibbs_iteration3(data, params, hyperparams3)
params
If so, we should be able to run chains as usual. This should take about 5 times as long per iteration as the single-bin Gibbs fit, but similarly should converge quickly and have low autocorrelation. We'll reduce the default number of steps accordingly, since we don't really need that many samples.
%%time
nchains = 4
nsteps = 1000
chains3 = np.empty((nchains, nsteps, len(paramnames3)))
for i in range(chains3.shape[0]):
params = initialize_params3()
for j in range(chains3.shape[1]):
gibbs_iteration3(data, params, hyperparams3)
chains3[i,j,:] = [params['fb0'], params['fb1'], params['fb2'], params['fb3'], params['fb4'],
np.sqrt(params['gauss0']['sigma2']), np.sqrt(params['gauss1']['sigma2']), np.sqrt(params['gauss2']['sigma2']), np.sqrt(params['gauss3']['sigma2']), np.sqrt(params['gauss4']['sigma2']),
params['gaussb']['mu'], np.sqrt(params['gaussb']['sigma2'])]
fig, ax = plt.subplots(len(paramnames3), 1, figsize=(16.0, 2.0*len(paramnames3)));
cr.plot_traces(chains3, ax, labels=param_labels3);
# burn3 = ... # this can probably be very short again
# YOUR CODE HERE
raise NotImplementedError()
fig, ax = plt.subplots(len(paramnames3), 1, figsize=(16.0, 2.0*len(paramnames3)));
cr.plot_traces(chains3[:,burn3:,:], ax, labels=param_labels3);
chains3 = chains3[:,burn3:,:]
R = cr.GelmanRubinR(chains3)
print('R =', R)
assert np.all(R < 1.1)
maxlag3 = 500 # doubt this needs to be raised
neff = cr.effective_samples(chains3, maxlag=maxlag3, throw=True)
print('neff =', neff)
assert np.all(neff > 1000)
Here is the giant triangle plot, for posterity:
plotGTC(np.concatenate(chains3), paramNames=param_labels3,
figureSize=20, customLabelFont={'size':12}, customTickFont={'size':12}, customLegendFont={'size':16});
Hopefully, this is a lot of uninteresting blobs. To condense it, let's have a look at the background fraction and cluster velocity dispersion as a function of richness, conventionally denoted $\lambda$. Below, we just summarize each of the 1D constraints using the median and percentiles.
fig, ax = plt.subplots(1, 2, figsize=(10.0, 3.0));
mid,low,high = np.quantile(chains3[:,:,np.arange(len(data))], [0.5,0.1585, 0.8415], axis=(0,1))
ax[0].errorbar(richness, mid, [mid-low, high-mid], fmt='.'); ax[0].set_xlabel(r'$\lambda$'), ax[0].set_ylabel(r'$f_\mathrm{b}$');
mid,low,high = np.quantile(chains3[:,:,np.arange(len(data))+len(data)], [0.5,0.1585, 0.8415], axis=(0,1))
ax[1].errorbar(richness, 1e3*mid, 1e3*np.array([mid-low, high-mid]), fmt='.'); ax[1].set_xlabel(r'$\lambda$'), ax[1].set_ylabel(r'$10^3\sigma_\mathrm{cl}$');
Have a look at the constraints for $f_{\mathrm{b},i}$ here in particular, and compare them to the solutions. Although the specific best values will differ, they should be similar in magnitude. Also similarly, the constraints should be tighter at low richness, but not infinitely tight (i.e., the sampler should actually be changing the values).
I_checked_and_everything_looks_good = False # make True when true
# YOUR CODE HERE
raise NotImplementedError()
assert I_checked_and_everything_looks_good
Fitting all bins with linked backgrounds and a power-law scaling relation¶
The final change we will make in this notebook is to model $\sigma_\mathrm{cl}$ as a power law in richness rather than having independent parameters for each bin,.
$\sigma_\mathrm{cl}(\lambda) = A_\sigma \left(\frac{\lambda}{10}\right)^{\alpha_\sigma}$.
This is both physically expected and broadly justified by the plot above. We will not similarly introduce a model for $f_\mathrm{b}(\lambda)$, since there is no equivalent theoretical expectation for it.
This has two immediate consequences. First, we will have 3 fewer parameters to juggle. Second, the new parameters, $A_\sigma$ and $\alpha_\sigma$, are not conjugate with anything else (that we noticed). However! This doesn't change the fact that 7 parameters can still be conjugately sampled, conditional on $\sigma_\mathrm{cl}(\lambda)$. Therefore, we will use this as an excuse to set up a mixed sampling MCMC, where most parameters are updated using conjugate relations, and 2 are sampled through more a brute-force method.
Based on the figure above, the data provide much more information about $\sigma_\mathrm{cl}(\lambda)$ at low richness than high richness. This is why we decided to normalize the power law such that $A_\sigma$ parametrizes the curve near the lower end of the data. If it isn't obvious why this is convenient, try changing the "pivot" value of 10 in the plotting cell below to something else (e.g. 35) and trying to iteratively guess parameter values that make the resulting curve go more or less through the contraints on $\sigma_\mathrm{cl}$. It's much less annoying, for both a human and a sampler, to set the value where the data have things locked down and then adjust the slope to something reasonable than to try to juggle both (i.e., normalizing at low richness in this case will reduce the degeneracy between the two parameters).
Of course, we will be fitting the actual data and not the constraints from the last section. But go ahead and come up with some broadly reasonable parameter values anyway.
# guess_A = ...
# guess_alpha = ...
# YOUR CODE HERE
raise NotImplementedError()
xvals = np.arange(7,75)
yvals = guess_A * (xvals / 10.)**guess_alpha
fig, ax = plt.subplots(1, 1, figsize=(4.0, 3.0));
mid,low,high = np.quantile(chains3[:,:,np.arange(len(data))+len(data)], [0.5,0.1585, 0.8415], axis=(0,1))
ax.errorbar(richness, mid, [mid-low, high-mid], fmt='.'); ax.set_xlabel(r'$\lambda$'), ax.set_ylabel(r'$\sigma_\mathrm{cl}$');
ax.set_xscale('log'); ax.set_yscale('log');
ax.plot(xvals, yvals);
Anyhoo, here is the official enumeration of parameter names for this analysis:
paramnames4 = ['fb0', 'fb1', 'fb2', 'fb3', 'fb4', 'A_sigma', 'alpha_sigma', 'mu_b', 'sigma_b']
param_labels4 = [r'$f_\mathrm{b,0}$', r'$f_\mathrm{b,1}$', r'$f_\mathrm{b,2}$', r'$f_\mathrm{b,3}$', r'$f_\mathrm{b,4}$',
r'$A_\sigma$',r'$\alpha_\sigma$', r'$\mu_\mathrm{b}$', r'$\sigma_\mathrm{b}$']
Implementation¶
Let's first look at the conjugate sampling of the background fraction and background distribution parameters. Naturally, for these updates, we hold $A_\sigma$ and $\alpha_\sigma$ (and thus all the $\sigma_{\mathrm{cl},i}$ values that follow from them) fixed. Hence, this will be a slight simplification over the last section.
We will still need a hyperparams
dictionary encoding the priors for the conjugate-sampled parameters, as before. Let's just copy what we need from the previous section:
hyperparams4 = {}
for k in ['fb0', 'fb1', 'fb2', 'fb3', 'fb4', 'gaussb']:
hyperparams4[k] = hyperparams3[k]
The updating function is identical to the last section, except that the $\sigma_{\mathrm{cl},i}$ updates are left out, and we assume that their current values (based on $A_\sigma$ and $\alpha_\sigma$) are provided in the usual way via par
:
def gibbs_iteration4(xs, par, hypar):
# xs is a list of data vectors for each bin
nbins = len(xs)
for i in range(nbins):
si = str(i)
p = condpost_group_membership(xs[i], par['fb'+si], par['gauss'+si], par['gaussb'])
par['g'+si] = st.bernoulli.rvs(p)
alpha,beta = condpost_fraction(par['g'+si], hypar['fb'+si])
par['fb'+si] = st.beta.rvs(alpha, beta)
# add the updates of par['gaussb']['mu'] and par['gaussb']['sigma2'] here (identical to the last section)
# YOUR CODE HERE
raise NotImplementedError()
We will use lmc
to sample the remaining parameters and to handle the juggling of the two parameter subsets into the production of a single chain. We'll provide most of the logistical nonsense, since it isn't especially educational. However, it will be helpful to understand how things are pasted together.
In brief, we anticipate using parameter and hyperparameter dictionaries, as above, for compatibility with all of the conjugate sampling code. The "official" parameters known to the chain runner will be lmc.Parameter
s, though, so we will need to copy values back and forth at appropriate times. Finally, while priors on the conjugate parameters are handled by the functions above, you will need to provide a (conditional) log-posterior function for $A_\sigma$ and $\alpha_\sigma$ that incorporates their priors. This is not necessarily the most elegant way of accomplishing the goal, but it will work well enough with the separate analyses you've done above, with minimal re-writing of code.
With that in mind, we define an lmc.ParameterSpace
(and equivalent dictionary) that organizes all of the parameters other than the group assignments. This is not the list of parameters that lmc
will try to sample with its own methods, but rather the one that it will output to form the chain. (We could also keep track of the group assignment parameters in the chain, but... there are a lot of them.)
parameter_space4 = lmc.ParameterSpace([lmc.Parameter(name=s) for s in paramnames4])
parameter_dict4 = {k:p for k,p in zip(paramnames4, parameter_space4)}
The cell below defines a new type of lmc.Updater
that will handle the conjugate parameters. It simply calls, in turn, the function you completed above, and then ensures that the conditional posterior for the non-conjugate parameters is re-computed at the next opportunity to reflect those updates.
class GibbsUpdater(lmc.Updater):
def __call__(self, par):
# update the parameter values in the par dictionary
gibbs_iteration4(data, par, hyperparams4)
# set lmc Parameter values from our dictionary, so current values are available that way
for k in ['fb0', 'fb1', 'fb2', 'fb3', 'fb4']:
parameter_dict4[k].set(par[k])
parameter_dict4['mu_b'].set(par['gaussb']['mu'])
parameter_dict4['sigma_b'].set(np.sqrt(par['gaussb']['sigma2']))
# force re-evaluation of the log-posterior with these new values the next time the non-Gibbs updater is called
self.engine.current_logP = None
gibbs_updater4 = GibbsUpdater(None, None, 0, 0, None, None) # all arguments are ultimately ignored, as they do not apply to this simple class
Next, define the conditional posterior function for $A_\sigma$ and $\alpha_\sigma$. Recall from way above that the argument passed to various lmc
functions is somewhat arbitrary, being ultimately up to the user. In this case, we've written things such that this argument will be the parameter dictionary used in the conjugate updates. This means that you will need to get some other things from global scope.
- The
data
andrichness
arrays can/should be directly referenced from global scope. - The values of $A_\sigma$ and $\alpha_\sigma$ to test are obtained from
parameter_dict4['A_sigma']()
andparameter_dict4['alpha_sigma']()
(or equivalently fromparameter_space4
). They should not be extracted frompar
, aslmc
doesn't change that dictionary when proposing a step.
In contrast, any other parameter values that you need can be accessed either from par
or parameter_dict4/parameter_space4
, since the GibbsUpdater
synchronizes them. The exception to this is the group assignment arrays (g0
, g1
, ...) which are only in par
. (But note that this function will be called after the conjugate updates, so they will be there even though we haven't represented them yet.)
def nongibbs_lnpost4(par):
# compute log priors
# sum up the log likelihood for each galaxy and bin
# return the log-posterior
# keep in mind that ONLY terms involving A_sigma and alpha_sigma need be included here
# YOUR CODE HERE
raise NotImplementedError()
We'll want a function that provides a random starting point for chains, as in the previous sections, so we may as well write that before testing out the code above. Below, you'll need to provide initializations for all the parameters in parameter_dict4
, which we then copy to a conjugate parameter dictionary. You will also need to supply proposal scales for $A_\sigma$ and $\alpha_\sigma$ (the width
attributes of the conjugate parameters will never be looked at).
def initialize_params4():
# Initialize global Parameter objects
# parameter_dict4['fb0'].set( ... )
# parameter_dict4['fb1'].set( ... )
# parameter_dict4['fb2'].set( ... )
# parameter_dict4['fb3'].set( ... )
# parameter_dict4['fb4'].set( ... )
# parameter_dict4['A_sigma'].set( ... )
# parameter_dict4['alpha_sigma'].set( ... )
# parameter_dict4['mu_b'].set( ... )
# parameter_dict4['sigma_b'].set( ... )
# parameter_dict4['A_sigma'].width = ...
# parameter_dict4['alpha_sigma'].width = ...
# YOUR CODE HERE
raise NotImplementedError()
# Copy everything to our conjugate-style parameter dictionary, and return it
params = {'gaussb':{'mu':parameter_dict4['mu_b'](), 'sigma2':parameter_dict4['sigma_b']()**2}}
for k in ['fb0', 'fb1', 'fb2', 'fb3', 'fb4', 'A_sigma', 'alpha_sigma']:
params[k] = parameter_dict4[k]()
for i in range(len(data)):
params['gauss'+str(i)] = {'mu':0.0, 'sigma2':(params['A_sigma']*(richness[i]/10.)**(params['alpha_sigma']))**2}
return params
Check that this does something reasonable, and that the values stored in the two places agree:
# YOUR CODE HERE
raise NotImplementedError()
params = initialize_params4()
for p in parameter_dict4.keys():
print(p, '=', parameter_dict4[p]())
params
Next, let's check that the conjugate updating function appears to work (i.e. changes values to something finite/not entirely unreasonable, and adds the g
parameters)...
gibbs_iteration4(data, params, hyperparams4)
params
... and that the conjugate posterior function for the other parameters behaves.
lnp = nongibbs_lnpost4(params)
print(lnp)
assert np.isfinite(lnp)
Just a couple final bits of setup. First we need to define the ParameterSpace
that lmc
will handle sampling with its own methods, based on the log-posterior function you wrote...
nongibbs_space4 = lmc.ParameterSpace([parameter_dict4['A_sigma'], parameter_dict4['alpha_sigma']], nongibbs_lnpost4)
... and finally a function that is executed after each iteration (that is, after performing both the conjugate and non-conjugate updates), which we can use to synchronize the non-conjugate parameter values back into our own dictionary.
def onStep(par):
# set our dictionary from the lmc Parameters that the non-Gibbs updater just changed
for k in ['A_sigma', 'alpha_sigma']:
par[k] = parameter_dict4[k]()
# don't forget the cluster distribution widths for the individual bins!
for i in range(len(data)):
par['gauss'+str(i)]['sigma2'] = (par['A_sigma']*(richness[i]/10.)**(par['alpha_sigma']))**2
Running the MCMC¶
Ok! Hopefully, we are now just about ready to push go and see what happens. But, before that, a tiny bit more explanation of this mixed sampling approach that has been implemented.
In the chain-running cell below, you will see a familiar structure, and some unfamiliar code. For each chain, we use your initialize_params4
function to provide a random starting position. What follows is some lmc
setup that isn't needed for a bog-standard analysis. Namely, we explicitly select a proposal method for the non-conjugate parameters, in this case one that uses slice sampling to jointly update both parameters, adapting the proposal scale every so often (this is the default that was silently used in the first section). The Engine
that runs the chain is created and told that each iteration consists of calling the conjugate parameter updater and the non-conjugate parameter updater (in order), followed by a call to the onStep
function, and that it will be recording the values of all the parameters in parameter_space4
. Finally, the call to the engine provides the data structure in which to store the parameter traces, and the params
dictionary that is passed to the various updater functions above. You don't have to understand how all this fits together; the point is more that mixed sampling like this is possible and sometimes advantageous.
As always, you might want to run this first with a single, short chain just to verify that things are working.
%%time
nchains = 4
nsamples = 1500
chains4 = [lmc.dictBackend() for i in range(nchains)]
for chain in chains4:
params = initialize_params4()
nongibbs_updater4 = lmc.MultiDimRotationUpdater(nongibbs_space4, lmc.Slice(), 100, 100, None, None)
engine = lmc.Engine([gibbs_updater4, nongibbs_updater4], parameter_space4, onStep)
engine(nsamples, params, [chain])
chains4 = np.array([np.array([chain[p] for p in paramnames4]).T for chain in chains4])
fig, ax = plt.subplots(len(paramnames4), 1, figsize=(16.0, 2.0*len(paramnames4)));
cr.plot_traces(chains4, ax, labels=param_labels4);
Take us through the usual diagnostics.
# burn4 = ...
# YOUR CODE HERE
raise NotImplementedError()
fig, ax = plt.subplots(len(paramnames4), 1, figsize=(16.0, 2.0*len(paramnames4)));
cr.plot_traces(chains4[:,burn4:,:], ax, labels=param_labels4);
chains4 = chains4[:,burn4:,:]
R = cr.GelmanRubinR(chains4)
print('R =', R)
assert np.all(R < 1.1)
maxlag4 = 500 # may need to adjust
neff = cr.effective_samples(chains4, maxlag=maxlag4, throw=True)
print('neff =', neff)
assert np.all(neff > 400)
You will probably notice, above, that the effective number of samples for the non-conjugate parameters is much smaller than for the conjugate ones.
Below is our slightly smaller triangle for this section.
plotGTC(np.concatenate(chains4), paramNames=param_labels4,
figureSize=12, customLabelFont={'size':12}, customTickFont={'size':12}, customLegendFont={'size':16});
Finally, let's compare the posterior for $\sigma_\mathrm{cl}(\lambda)$ with the non-hierarchical constraints from the last section, through the time-honored method of overplotting curves from many posterior samples, with transparency.
xvals = np.arange(7,75)
fig, ax = plt.subplots(1, 1, figsize=(6.0, 4.0));
mid,low,high = np.quantile(chains3[:,:,np.arange(len(data))+len(data)], [0.5,0.1585, 0.8415], axis=(0,1))
ax.errorbar(richness, mid, [mid-low, high-mid], fmt='.'); ax.set_xlabel(r'$\lambda$'), ax.set_ylabel(r'$\sigma_\mathrm{cl}$');
ax.set_xscale('log'); ax.set_yscale('log');
for A,alpha in np.concatenate(chains4)[np.random.choice(np.prod(chains4.shape[0:2]), size=100), 5:7]:
yvals = A * (xvals / 10.)**alpha
ax.plot(xvals, yvals, color='C1', alpha=0.1, zorder=0);
You can imagine going on to test whether the power law is a sufficient description of this scaling relation, but does it visually look consistent with the independent constraints from earlier?
I_have_pondered_this = False # make True when true
# YOUR CODE HERE
raise NotImplementedError()
assert I_have_pondered_this
Parting thoughts¶
In this notebook, you got some experience with a small Gaussian mixture, implementing it with a traditional posterior-evaluation and with conjugate Gibbs sampling. The latter case demonstrated the introduction of latent parameters in a hierarchical setting. Here the Gibbs approach turned out to be more efficient in terms of time per effectively independent sample, but you can imagine that juggling an additional vector of parameters that scales with the size of the data set isn't always the best decision. We'll also note that for models more complex than this (arguably even for this one), we should probably have used JAGS or a similar Gibbs sampling code, rather than implementing the updates ourselves. Too bad we are allergic to installing non-Python software in this class.
You also got some practice implementing a simple modeling hierarchy - the shared background distribution - and a slightly less simple one - the power-law scaling relation. The latter gave us an excuse to demonstrate the use of different sampling methods for different subsets of parameters. This is a nice option to know about for those times that you have intimate knowledge of the model and canned algorithms aren't efficient enough to be practical.
This was a slightly unusual case in that the cluster member velocities genuinely should be Gaussian based on physical considerations, while the background component was surprisingly well described by a single additional Gaussian. Perhaps a more common use of GMMs is to marginalize over a potentially complicated distribution which constitutes a "nuisance" part of the model. That is, where we don't care about the model being physically interpretable, and just throw enough Gaussians at the problem to reproduce the data. There are interesting problems that arise in so doing, such as deciding how many is "enough", and the sampling quirks that follow from the exchangeability of Gaussians in the mixture (i.e. the same posterior probability can be obtained by swapping parameters between components). These aren't reasons to avoid mixture models, but in general we should be careful not to overinterpret the GMM parameters as meaning that there actually are a certain number of distinct components making up a given distribution.