Tutorial: Comparison of Galaxy Cluster Centering Models

In the previous tutorial, we evaluated the "goodness of fit" of two different models on a data set representing the distribution of centering offsets in a galaxy cluster sample. Here we will continue by doing two styles of quantitative test to compare the two models, with the goal of deciding whether an improved goodness of fit actually justifies the additional complexity of the second model. Specifically, you will calculate

Getting set up

First, we'll want to take advantage of your work in the previous tutorial. As irritating as it may be, paste the (completed) definitions of Model, ExponentialModel and your alternative model below.

Next, we'll read in the chains for each model that you produced. (You'll need to fill in the name of your model class.)

If you're as lazy as me, your implementation above relied on the data y being at global scope, so here they are again. If there were any other global variable dependences I haven't anticipated, you'll need to reproduce their definitions here also.

1. Calculate the DIC for each model

Recall that the Deviance Information Criterion is given by:

$\mathrm{DIC} = \langle D(\theta) \rangle + p_D; \quad p_D = \langle D(\theta) \rangle - D(\langle\theta\rangle)$

where $\theta$ are the parameters of a model, the deviance $D(\theta)=-2\log P(\mathrm{data}|\theta)$, and averages $\langle\rangle$ are over the posterior distribution of $\theta$.

Write a function to compute this.

Compute the DIC for each model.

Checkpoint: For Model 1 (the exponential), I get $p_D \approx 1.0$ and DIC $\approx 1668$. As with anything else computed from chains, there will be some stochasticity to the values you compute.

Do your values of $p_D$ make intuitive sense?

Now, to interpret this, we can compare the reduction (hopefully) in the DIC of Model 2 compared with Model 1 to the Jeffreys scale (see the notes). By this metric, is your second model better at explaining the data than the exponential model?

2. Compute the evidence by Monte Carlo integration

To do this, note that

$p(\mathrm{data}|H)=\int d\theta \, p(\mathrm{data}|\theta,H) \, p(\theta|H)$

can be approximated by averaging the likelihood over samples from the prior:

$p(\mathrm{data}|H) \approx \frac{1}{m}\sum_{k=1}^m p(\mathrm{data}|\theta_k,H)$, with $\theta_k\sim p(\theta|H)$.

This estimate is much more straightforward than trying to use samples from the posterior to calculate the evidence (which would require us to be able to normalize the posterior, which would require an estimate of the evidence, ...). But in general, and especially for large-dimensional parameter spaces, it is very inefficient (because the likelihood typically is large in only a small fraction of the prior volume). Still, let's give it a try.

Write a function to draw a large number of samples from the prior and use them to calculate the evidence. To avoid numerical over/underflows, use the special scipy function logsumexp (which we imported directly, way at the top of the notebook) to do the sum. As the name implies, this function is equivalent to log(sum(exp(...))), but is more numerically stable.

Do a quick test to check for NaNs:

Roughly how precisely do we need to know the log Evidence, to be able to compare models? Run log_evidence with different values of N (the number of prior samples in the average) to until you're satisfied that you're getting a usefully accurate result for each model.

So, we have log evidences computed for each model. Now what? We just compare their difference to the Jeffreys scale again:

Note that we might end up with a different conclusion as to the strength of any preference of Model 2, compared with the DIC! The reason is that the evidence explicitly accounts for the information in the prior (which, recall, counts as part of the model definition), while the DIC does this much less directly.

We could also be good Bayesians and admit that there should be a prior distribution in model space. For example, maybe I have a very compelling theoretical reason why the offset distribution should be exponential (I don't, but just for example). Then, I might need some extra convincing that an alternative model is required.

We would then compute the ratio of the posterior probabilities of the models as follows:

Depending on the fitness of the alternative model you chose, you may find that only an extremely lopsided prior in model space would influence your conclusion.

Comment on what you find from the evidence, compared with the DIC.

3. Compute the evidence with dynesty

To get some experience with a package that uses nested sampling to compute the evidence, let's repeat Section 2 using dynesty.

Looking at the docs, we first need a function that maps the unit cube onto our prior. That is, the code is doing a substitution like

$\int d\theta\,p(\theta|H)\,p(\mathrm{data}|\theta,H) = \int_0^1 dF \,p\left[\mathrm{data}|\theta(F),H\right]$,

where $F=\int_{-\infty}^\theta d\theta'\,p(\theta'|H)$ is the cumulative distribution function of the prior (hence the identity $dF/d\theta = p(\theta|H)$ makes the equation above work out).

If our priors were uniform, the translation from $F$ to $\theta$ is a simple translation and rescaling, but for other priors it involves going through the prior distribution's quantile function. Fortunately, we assumed uniform priors for Model 1! However, if you followed the advice in the previous notebook to keep a dictionary of univariate priors in the Model object, and those priors are scipy.stats distributions, it's relatively easy to handle this slightly more general case using functions provided by scipy. Regrdless of how you choose to do it, implement a function that performs this transformation for the 1-parameter Exponential model (and priors) below.

Checkpoint: Do a sanity check - the cell below should return the quartiles of the uniform prior you chose.

We'll also need a log-likelihood function that takes a vector of parameters as input, in contrast to how it's written in the Model class. This could be solved by using the ridiculous construction in the Model.log_posterior method, or we could just define:

Then we can just go ahead and run.

And look at some stuff (see the Dynesty documentation for an explanation of what all this is):

The Evidence panel appears to be utterly unhelpful in this case (at least when I ran it), so below is a plot of the log-evidence as a function of iteration, discarding the beginning when the accumulated evidence is truly tiny. It should show us converging to a value similar to what we found earlier.

This extracts the log-evidence at the last iteration, which may or not be the approved method.

Let's compare with what we got from simple monte carlo:

That was so easy, let's go ahead and do the same for Model 2. Implement the transformation from the unit cube to your model's parameter space:

And run this elegant definition:

Now we can go ahead and run the sample.

... and produce the same set of plots.

Finally, extract the final evidence,

... and compare it to the SMC result.

Given the work you'd already done, that was (I hope) staggeringly easy to set up, even if it did take longer to compute the same answer (I find) in this case. In larger-dimensional parameter spaces, it's easy to imagine this method being far preferable to blindly sampling from the prior. Note that we run dynesty out of the box here - there are many options (and some potential failure modes) that you should read about before using it in the real world.