Tutorial: Evaluation of Galaxy Cluster Centering Models

Previously, we have mostly done simple, visual comparisons of fitted models or posterior predictions with data. In this notebook, we'll get more quantitative about saying whether a model could plausibly produce the observed data. Specifically, for a simple data set encoding the difference between 2 methods of defining a galaxy cluster's center, you will

The Dataset

Our data is just a list of numbers. Each one represents a measured distance, $y$, between two different estimates of the center of a galaxy cluster: the location of the presumed central galaxy and a centroid of the diffuse, emissive gas. The context here is that automated algorithms sometimes fail to chose the central galaxy correctly (because of image artifacts or other problems), whereas the gas centroid is more reliable but also more expensive to measure. Therefore, we'd like to use this data set to characterize the distribution of mis-centerings so that the galaxy-based centers can be used for large sample, with the resulting errors propagated forward through future processing, e.g. weak lensing estimates.

Let's load up the data and have a look.

Load data into global variable y. Each entry is an offset in units of kpc. This particular data set is fictional, but intentionally similar to a real one that has actually been used for this purpose.

Check out a quick histogram of the data.

1. Choosing a Test Statistic

The model we will test in this tutorial is outlined in the next section - but how well that model fits the data is a question we will answer in part using a test statistic.

In this case, a feature we especially want to capture with our model is the fraction of clusters where $y$ is small (in the context of a galaxy cluster) vs not small. For concreteness, in the work you turn in, make $T$ the number of clusters with $y<100$ kpc. However, you are encouraged to play around with other possibilities to see how the simple model below succeeds or fails to reproduce the corresponding features of the data.

Below, define a function that computes this test statistic.

Compute the test statistic of the real data to verify that it works.

Self-check: The recommended test statistic evaluates to 118 on the real data.

Setting up a Computational Framework

Once we define a model to work with (below), we'll want to fit that model to the data, and then evaluate it using the methods we saw in the notes. These include:

In the next notebook, we'll choose and fit a second, alternative model, and compare the two in terms of

Notice that each of these bulleted operations can be coded as a function of the model, $H$ (e.g. a visual check of the model, the evidence of the model, and so on). We will therefore be fancy and write a class that completely describes the model, and also contains a set of functions that act on model (methods). Since we anticipate looking at multiple models, we'll use inheritance. While this level of object oriented programming may not be familiar, it's a good thing to learn, and most of the details are filled in for you below.

We start by defining a base class, which contains the functionality common to any model we care to define later. To make it clear what functionality we expect derived classes to provide, we'll include defintions of non-functional methods that the derived classes will need to override. (This is completely unnecessary in Python, but has the benefit of providing a complete list of functions we'll eventually want to fill in.)

The functions that are filled in below (towards the bottom) are the ones that we won't need to modify for any particular model. Do take a moment to make sure you understand how they work; they are mostly doing things you've seen before. Wherever you see **params here, a derived class would have a specific list of arguments corresponding to the model parameters.

2. Evaluating a Simple Model

First, let's assume the simple model $H_1$, that the sampling distribution is an exponential:

$p(\{y_i\}|a, H_1) = \prod_i \frac{1}{a}e^{-y_i/a}$; $y_i\geq0$

Our single parameter is $a$, the mean of the exponential distribution.

Note that this is the entirety of the sampling distribution. Don't be scared of the simplicity! Our model is that the data are realizations of some distribution, and we want (initially) to find the parameter(s) of that distribution. There's nothing tricky (but also nothing Gaussian) about this.

2a. Implementation in code

Complete the implementation of this model as a derived class of Model, below. An ExponentialModel object will still have all the methods defined for Model, in particular the ones that are not redefined (overriden) here.

Note that this includes choosing a reasonable prior for $a$. It should be a proper (normalizable) distribution. We don't want to deal with improper distributions when calculating the evidence later on. To compare with the self-check below, use a wide, uniform (but bounded!) prior. If you need to get some intuition for what constitutes a "wide" prior compared with what the posterior is likely to allow, try plotting some example curves over the histogram of the data. Below, state your prior for $a$ for the record.

Look for the magic work "TBC" to find functions that need to be completed. Again, even when a function below is completely given, you should make sure you understand how it works.

Now try instantiating a model and drawing a dozen samples from its prior as a test:

Test out the log-posterior function to make sure it's not obviously buggy.

Similarly the mock-data producing function (arbitrarily with $a=500$).

Finally, test the sampling distribution function.

2b. Fit the model to the data

The draw_samples_from_posterior method carries out a parameter inference with emcee and displays its Markov chains. remove_burnin removes burn-in and concatenates the chains. Since these step aren't really the point of this problem, we'll just give you the code... for this model. Keep in mind that you're going to choose the alternative model to consider below, so we can't say what values of the keyword arguments used below will be appropriate for that case. Best not to speed through this section without looking at and thinking about what's happening, in other words.

The concatenated MCMC samples are eventually stored in the Model1.samples array.

This will compute the usual convergence criteria and effective number of samples:

Assuming that looks ok, remove the burn-in for good, and plot the concatenation of what's left.

It will be useful for later to know the mean of the posterior:

Checkpoint: With wide, uniform priors, you should find a posterior mean $a$ of around 96.

2c. Visually compare the posterior predictions with the data.

First, let's just plot the posterior-mean model over the data.

This kind of plot should be familiar: it's often a good idea to evaluate model adequacy in data space. You should already be able to see telling differences between the a well-fitting model's sampling distribution, and the data histogram. However, we don't know to what extent those could be explained simply by the variability in the sampling distribution, let along the posterior uncertainty in $a$.

So, next, let's compare a random predicted ("replica") data set, drawn from the posterior predictive distribution, with the data. To do this we first draw a parameter value from the posterior PDF, and then generate a dataset from the sampling distribution conditioned on that value. The result is a sample from $p(y_\mathrm{rep}|y,H_1)$.

This plot is nice because it is comparing apples with apples, asking: do mock datasets drawn from our model sampling distribution with any plausible parameter value "look like" the real data?

However, to best evaluate this, we want to visualize the posterior predictive distribution of replica datasets instead of looking at just one. We can do this by plotting many replica datasets drawn from the posterior predictive PDF. Let's put this plot in a function, so we can re-use it later.

The use of transparency above means that the darkness of the red histogram lines below corresponds to the PPD at that offset.

Based on these visual checks, would you say the model does a good job of predicting the observed data?

2c. Quantitative posterior predictive model check

Now let's quantify the (in)adequacy of the fit with a quantitative posterior predictive model check, based on the test_statistic function you've already defined.

To sample the posterior predictive distribution of test statistics $p[T(y_\mathrm{rep})|y,H]$, we need to generate replica datasets from the model:

We can now do the following:

And we want all of that in packaged in functions of the model, so that we can re-use it later (on different models!).

First write a function to estimate the p-value, $P(T > T(y)|y,H)$, based on posterior predictions from the chains you ran above.

Here's a function that plots the distribution of T, and reports the p-value:

Let's see how it does:

Checkpoint: If you implemented the test statistic and priors recommended above, your p-value should be about 0.001. The precise value will vary because of all the randomization involved in estimating it, and the fact that we are out in the tail of the posterior predictive distribution.

3. Evaluating a more complex model

Next, we'll repeat all of that (it should be much easier this time!) with a second model. The functional form is up to you, but to make things interesting, choose a model with at least two parameters. Since this is a PDF over distances, it should be defined over the non-negative real line. The test statistic will, of course, have to be the same as before.

Put some thought into the priors for this model. They will be important when we come to comparing the two models quantitatively in the next notebook.

Copy/paste your implementation of ExponentialModel and edit it to implement some other model (give the class a different name, naturally).

Test the prior sampling function:

Test the posterior function on a couple of these prior samples:

Test the mock data generator. Here and in the following test, you can replace the call to draw_samples_from_prior with specific parameter values if your prior is so wide that you're unlikely to see anything on the plot.

Test the sampling distribution.

Fit the model. Remember that a thoughtless copy/paste of the corresponding Model1 lines above will not work!

Check the convergence

Remove burn-in.

Since we have multiple parameters this time around, let's look at their covariances. (The size below is good for 2 parameters, but increase it as needed.)

Print the posterior mean.

Visually compare posterior predictions with the data, alongside those of the exponential model.

Finally, compute the p-value.

Does your model appear to reproduce the data better than the exponential one? Does it have a "better" p-value?

So is it a better model, in the sense of being justified by the data? To answer that, we need to keep in mind that a more flexible model (i.e. with more free parameters) will usually be able to fit the data more closely, even if it's only reproducing the particular realization of noise present in our specific data set. In other words, directly comparing the maximum likelihood achieved, or even the p-values, is not enough to say that the more complex model is statistically better at reproducing the features of the observed data. For that, we will need to turn from model evaluation to model comparison. We'll be re-using the chains for both models, so save them below.