Tutorial: Cluster Membership Fractions of RedMaPPer Galaxies

Behold the power of conjugacy!

In this notebook, we will reproduce part of the analysis of Myles et al. (2021), using data kindly provided by the first author. In the course of doing so, you will

Note that having already done the Gibbs sampling tutorial will save you some time. If not, you'll want to at least review to the exposition there.

Background

Galaxies that fall into massive accumulations of dark matter and hot gas (these are, inaccurately, known as "clusters of galaxies") rapidly have their gas stripped and cease forming stars. Hence, if one has an optical photometric survey, one way of finding galaxy clusters is to look for... clusters of galaxies that have similar colors (corresponding to red/non-star-forming spectra at some common redshift).

The thing is, along any line of sight there are loads of galaxies of different types with different redshifts, so color is not a reliable way of saying that a particular galaxy is in a cluster. Optical photometric cluster finders can only identify sight-lines where there is probably a cluster, and estimate what it's redshift would be. They also provide a "richness", loosely defined as the number of intrinsically red galaxies brighter than some threshold the cluster is supposed to contain. But, again, we can't really be sure whether any given galaxy is actually a member or not from photometric data alone. RedMaPPer, the main cluster-finder used by the Dark Energy Survey and (soon) Rubin/LSST, addresses this by assigning a membership probability to each galaxy near an identified cluster; the richness is then the sum of these probabilities, such that it should statistically mean something like the number of bright, red galaxies in the cluster.

So far so good. But this approach is only good to the extent that the membership probabilities are accurate - or, at least, similarly accurate as a function of cluster mass. This is worth checking, which we could do using redshifts measured for a large number of potential cluster members from spectroscopy. The redshifts (strictly speaking, the implied velocities) for cluster members should be distributed in a tight Gaussian distribution centered on the redshift of the bright central galaxy (BCG). We would expect to see a wider distribution of redshifts from galaxies that are not in the cluster, but despite having similar colors to cluster members.

(Analogously to the photometric color problem, it's technically possible for a galaxy at a different cosmological redshift to have such a large peculiar velocity that its redshift is consistent with the cluster's, but redshifts can be measured so precisely, and the cluster distribution is so narrow, that this should be rare. The model we'll build below will account for it, in any case.)

Data

The data we'll be working with ultimately come from the Sloan Digital Sky Survey. Myles et al. have compiled redshifts from Sloan spectroscopy for potential members of RedMaPPer clusters with estimated redshifts of 0.08-0.12 (this is relatively low, but the only sample for which archival spectroscopy would be relatively complete). Various fascinating cuts, which you can read the paper to learn about if you want, have been applied, and the data have been grouped into 6 bins according to the reported richness of the cluster each galaxy is supposed to belong to, with bin 0, 1, ..., 5 corresponding to increasing richness ranges (nominally increasing cluster mass).

For each bin, we simply have a list of values: $x_i = \frac{z_i - z_\mathrm{BCG}}{1 + z_\mathrm{BCG}}$. This quantity is directly related to the velocity implied by the galaxy's redshift, assuming it does belong to the cluster. For simplicity, we'll eschew the conversion to velocity units in the tutorial.

Below, we read the data for each bin into a corresponding entry of the list data.

Next, let's have a look at histograms of $x$ for each bin:

Already we can clearly see what might be a sharp spike consisting of cluster members, and a wider distribution of non-members. The latter aren't so each to see, however, so we can instead plot the histograms with a log-scaled y axis:

Recall that the log of a Gaussian is an upside-down parabola - it already looks like a couple of these might describe the data reasonably.

Model

So, what model will we fit to these distributions of $x$ values? The distribution for cluster members is thermodynamically expected to be a Gaussian, but it isn't obvious what we should expect for the non-members - it would depend on the cosmic large scale structure as well as the RedMaPPer selection of these potential members in the first place. On the other hand, the wider distributions above look approximately Gaussian, and we can always build a model for them out of multiple Gaussians if need be. In other words, this might be a good place to use a Gaussian mixture model,

$p(x) = f\cdot\mathrm{Normal}(x|0,\sigma_c) + (1-f)\cdot\mathrm{Normal}(x|\mu_b,\sigma_b)$.

Here, we've written the simplest mixture we might try, with 1 Gaussian component for the cluster ($c$) and just 1 Gaussian component for the background ($b$). We've also baked in the expectation that the cluster component have a mean of zero (i.e., be centered on the BCG's $x$ value).

To begin with, we'll fit the data for each richness bin with this model independently, so the expression above includes all the parameters we need:

As for priors, I propose we use generic "uninformative" ones:

As always, your next task is to draw the corresponding PGM and write out the conditional probability expressions that define the model.

Here are the canonical parameter names and order we'll use in code:

Visualize the model

Before going further, it will be useful to be able to quickly compare a model with the data. We can directly plot $p(x)$ over the histrograms above, but the tails, where there are 0 or 1 measured values in a histogram bin, are tricky to judge. Therefore, we will look at comparisons of both the model PDF and CDF, the latter comparing to the empirical CDF of the data.

Below, fill in functions to evaluate $p(x)$ and its CDF.

Here's a function to make the described comparison plots:

Next, iteratively fill in the list below with guessed parameter values until you have something that looks like a broadly reasonable fit to the bin 1 data.

MCMC fit

Next, you'll do a standard MCMC fit to the bin 1 data, mainly to establish a baseline using a bog-standard method.

Below, complete the log-posterior function for our model. Don't forget to include the prior!

Now, let's check the function on your guessed parameter values.

Hint: If you're looking at $\pm\infty$ above, do make sure you're taking a sum of logs rather than a log of a product.

While it doesn't turn out to be an issue in this case, the hint above gets to a weakness of brute-force posterior calculations when using mixture models. Namely, because part of the posterior involves a sum of PDFs, we do actually have to evaluate those PDFs, rather than exclusively working with log-PDFs. This opens a door to numerical over/underflows that we might not need to worry about otherwise.

Next, we'll do sampling with emcee. The number of steps and burn-in length supplied below worked well enough given my starting point, but you should change them if need be.

Here are the parameter traces...

... and the usual quantitative diagnostics.

Checkpoint: All looking good?

If so, let's make compare the posterior mean to the data.

It should look at least as good as your guess. Finally, let's make a triangle.

Checkpoint: The "truth" values plotted over top show the posterior mean we obtained, so they should go through the center of your distributions.

Conjugate Gibbs fit

Hopefully it won't come as a shock to learn that a model composed entirely of Gaussians (or any standard distribution, really) lends itself to conjugacy. If you've done the Gibbs sampling tutorial, your work from there can be mostly (see warning below) re-used, and the generalization from 1 Gaussian to a Gaussian mixture is not too much more.

On the other hand, exactly how one comes up with a conjugacy relation for a sum of Gaussians may not be obvious. The trick is... we don't need to! Instead, we can get around the question entirely by introducing a set of latent parameters, $g$, where $g_i$ tells us whether galaxy $i$ belongs to the cluster distribution ($g_i=0$) or the background distribution ($g_i=1$). You could think of this as re-writing our model PDF as

$p(x_i) = (1-g_i)\cdot\mathrm{Normal}(x_i|0,\sigma_c) + g_i\cdot\mathrm{Normal}(x_i|\mu_b,\sigma_b)$.

So the $g_i$ parameters are playing a role similar to $f$ before, but, for any given galaxy, $p(x_i|g_i)$ is just a single Gaussian. Of course, $f$ is still part of the model, and will play a role in determining the $g_i$ - after all, $f$ is exactly the a priori probability that any $g_i=0$.

We'll fill in the details in a moment. But first, modify your PGM and expressions to explicitly show the latent $g_i$ parameters. If you're confused about how to connect them to everything else, remember... think generatively! (And then re-read the final clause of the previous paragraph.)

Conjugate updates

Recall that, in conjugate Gibbs sampling, we update one parameter (or bunch of related parameters) at a time, with all the others fixed. The advantage of introducing the group membership parameters is that, at any given moment in our sampling, each galaxy's $x$ is drawn from exactly one Gaussian. Conversely, if we want to update the parameters of a Gaussian, we can use the update rules you worked out in the Gibbs sampling tutorial, as long as only the data for galaxies with the corresponding $g$ value are included in the calculation. Note that we will need to choose hyperparameters of the conjugate prior distributions that give us equivalent priors to what we decided above (see the Gibbs sampling notebook).

Of course, we will also need rules for updating all of the $g_i$ values, as well as $f$.

Updating $g$

To see how to update each of the $g_i$, write down the terms in posterior where it appears based on your PGM and equivalent expressions. (Remember, it's the fully conditional posterior that we need to work with.) Hopefully without giving away too much, you should be left with something like

$p(g_i|\ldots) \propto p(g_i|f) \, p(x_i|g_i,\ldots)$.

Since $g_i$ can only take the values 0 and 1, we simply need to evaluate the conditional posterior at both values, normalize them, and then randomly choose a new $g_i$ based on those 2 probabilities.

Updating $f$

Following the same approach as above, the conditional posterior for $f$ looks like

$p(f|\ldots) \propto p(f) \, \prod_i p(g_i|f) = p(f) \, f^{n_0} (1-f)^{n_1}$,

where $n_k$ is the number of $g_i$ equal to $k$. These last 2 factors you might recognize as being proportional to the binomial distribution, $p(n|q;N)={N \choose n} q^n (1-q)^{N-n}$. Consult the Wikipedia (or work it out yourself) to find the conjugate form of the prior for $f$, and determine the values of the hyperparameters $\alpha_0$ and $\alpha_1$ that reproduce the uniform prior we assumed above. Let us know the answer below.

Implementation

Let's begin by defining a more complete dictionary of model parameters, including ones that aren't free:

Below, we fill in one corresponding to your guess. The g entry just gets a placeholder for now, which is fine. In fact, if we make $g$ our first update, we never need to bother coming up with initial values for it!

This is a convenience function to translate such a dictionary back to a list of values, so we can tabulate our samples.

We'll also want a similar dictionary holding the prior hyperparameters you chose above. We provide a template below; the p entry is again a list with dictionaries corresponding to (the hyperparameters of the priors for the parameters of) each Gaussian in the mixture. The names of the hyperparameters in p are chosen to simplify re-use of code from the Gibbs sampling tutorial. (Note that we don't need hyperparameters for the mean of the cluster Gaussian because we'll still be keeping it fixed to zero.)

With that, it's time to write some parameter updating functions.

Warning: Before you gleefully copy/paste your update_sigma function from the Gibbs tutorial, remember that in that notebook we were in the funny position of having 2 bits of information about $\sigma$ for every data point instead of 1. There is a small but crucial change to be made compared with the previous case, in addition to identifying the relevant data points for a given Gaussian.

And that, dear friends, is why codes like JAGS that figure out and implement conjugacies for you exist. Fear not, we won't ask you to go through this again. Probably.

In any case, the cell below will test each one of these functions. If you get crashes, break it up and debug as needed.

If your code is successfully producing numbers, let's have a look at which galaxies it assigned to each group:

Checkpoint: Does the fraction of $g$ in component 0 approximately agree with the post-update value of $f$? It should.

Before we try to run a chain, let's do a more stringent check to make sure the parameters don't run away to silly values after repeated updates. First, we'll codify the updates in the same order as above, for brevity.

And now run 100 iterations, printing out the final parameter values.

Checkpoint: Compare these to the posterior means and standard deviations from the emcee analysis (reproduced in the next cell). If anything is significantly off, one or more of your updaters needs more debugging. (As you'll see below, 100 steps is more than enough for this method to converge.)

If everything above is working, we're ready to try a longer chain. The class below will simplify the process of running and checking multiple chains, which we'll eventually use for the other richness bins as well.

Here we set up one of these things for each of the richness bins, in a list.

The cells below run a set of 4 chains for the bin 1 data, and go through the usual diagnostics and checks. As before, the lengh, burn-in and maximum lag are given values that worked well for us, given our starting guess, but you should change them if necessary.

(If you're worried about giving all 4 of the chains the same initial position, good for you! Experimentation has shown that this particular problem converges extremely well regardless, so we're being lazy.)

Checkpoint: The chains should converge very quickly, such that our chosen burn-in period of 50 is overkill. If not, something is buggy.

If things look ok, let's see how these samples compare with those from emcee:

Checkpoint: The two posteriors should agree very well, of course. If not, again, something is fishy.

Results for all bins

If everything appears to be working well for bin 1, let's go ahead and run for the other bins. Our experience was that the same initial guess, chain length, burn-in, etc. worked fine in all cases, but as always you should change anything that needs it. We've also added plots to compare the posterior modes with the data. Please do look at all of these rather than just plowing through... we will be very disappointed if we see tutorials with clear yet unidentified issues turned in.

Checkpoint: Do all of the diagnostics look ok? How about the comparisons to data? Anything stand out?

Because you've earned it, below we show the results for all bins on the same triangle. As a last checkpoint, the "truth values show the our posterior mean for bin 4.

Some features should stand out here. First, there are clearly trends in the cluster member fraction, $f$, and cluster velocity dispersion, $\sigma_c$, from low richness (bin 0) to high richness (bin 5). Finding the former was the point of the paper we're following, while the latter had better be the case, given that velocity dispersion is like a dynamical temperature, and should therefore increase with halo mass.

The other feature, which wasn't obvious a priori, is that the background distribution looks the same across bins. This led the authors to do a second fit with a common background distribution across all bins. We won't reproduce it here, since it's essentially an exercise in bookkeeping given what you've done above, and doesn't change the results in unexpected or interesting ways.

Parting thoughts

In this notebook, you got some experience with a small Gaussian mixture, implementing it with a traditional posterior-evaluation and with conjugate Gibbs sampling. The latter case demonstrated the introduction of latent parameters in a hierarchical setting. Here the Gibbs approach turned out to be more efficient in terms of time per effectively independent sample, but you can imagine that juggling an additional vector of parameters that scales with the size of the data set isn't always the best decision. We'll also note that for models more complex than this (arguably even for this one), we should probably have used JAGS or a similar Gibbs sampling code, rather than implementing the updates ourselves.

This was a slightly unusual case in that the cluster member velocities genuinely should be Gaussian based on physical considerations, while the background component was surprisingly well described by a single additional Gaussian. Perhaps a more common use of GMMs is to marginalize over a potentially complicated distribution which constitutes a "nuisance" part of the model. That is, where we don't care about the model being physically interpretable, and just throw enough Gaussians at the problem to reproduce the data. There are interesting problems that arise in so doing, such as deciding how many is "enough", and the sampling quirks that follow from the exchangeability of Gaussians in the mixture (i.e. the same posterior probability can be obtained by swapping parameters between components). These aren't reasons to avoid mixture models, but in general we should be careful not to overinterpret the GMM parameters as meaning that there actually are a certain number of distinct components making up a given distribution.