Tutorial: Toy Photometry Model with Gibbs Sampling

In this notebook, we will use conjugate Gibbs sampling to solve the inference problem that we previously struggled to carry out on a grid. Have a look back at that tutorial if you need a refresher on the general setup. You will:

Setup

We'll begin by reading in the mock data set you worked with the grid-based tutorial, and storing it in an Image object, as we did before.

We'll still want to refer to the truth model used to generate the above data:

Fitting for 2 parameters

We'll see below that the conjugate Gibbs approach works well for the full 4-parameter problem. To begin with, however, we'll use the simplified 2-parameter model (with the source position fixed) to introduce the method.

The likelihood

Time for some math. (Recall that conjuate Gibbs sampling sometimes lets us design a very efficient sampler at the cost of doing some math ourselves.)

Our likelihood in this problem was Poisson, with the mean parameter in each pixel being given by the average number of photons expected from the source in our observation ($\mu_0$), the width of a symmetric Gaussian PSF ($\sigma$), the source position $(x_0,y_0)$, and the position of the pixel in question $(x,y)$.

We previously implicitly alluded to the fact that this likelihood can be factorized into the probability of detecting $N$ photons in total within the image, and the probability of a particular spatial distribution of those photons. Let's formalize this, since it will be helpful in a minute.

Namely, if $N_i$ is the number of photons measured in pixel $i$, with $N=\sum_i N_i$, then

$p(\{N_i\}|\mu_0,\sigma) = \mathrm{Poisson}(N|\mu_0) \prod_{j=1}^N \mathrm{PSF}(x_j-x_0,y_j-y_0)$.

Here the index $j$ runs over pixels where photons were detected, multiply counting those pixels where there are multiple photons.

Since we need to do math with this, let's explicitly plug in the form of the PSF. Note that (as we discussed in the previous tutorial), we are going to be naughty and pretend that the PSF evaluated at the center of a pixel is the same as its value integrated over the pixel. This means that the equality is not a proportionality (the sampling distribution is no longer normalized), but that detail won't impact us in the end.

$p(\{N_i\}|\mu_0,\sigma) \propto \mathrm{Poisson}(N|\mu_0) \prod_{j=1}^N \mathrm{Normal}(x_j|x_0,\sigma) \, \mathrm{Normal}(y_j|y_0,\sigma)$.

Finally, we can shorten this even more by concatenating $\mathbf{x}-x_0$ and $\mathbf{y}-y_0$ into $\mathbf{z}$, indexed by $k$, writing

$p(\{N_i\}|\mu_0,\sigma) \propto \mathrm{Poisson}(N|\mu_0) \prod_{k=1}^{2N} \mathrm{Normal}(z_k|0,\sigma)$.

This last step is just a notation trick to emphasize that all the displacements in both $x$ and $y$ from $(x_0,y_0)$ tell us about the PSF width, $\sigma$. This product indexed by $k$ thus has twice the number factors as the product over $j$ above.

For convenience, the improved Image class defined above provides direct access to $N$ and arrays of photon $x$ and $y$ positions. As in:

Conjugacy

The form of the likelihood is now about as simple as we could hope for. Next, we need to work out the conjugate relations for updating each parameter, and decide whether we can live with the form of the prior that this would require.

Relation for $\mu_0$

The dependence of the likelihood on $\mu_0$ is all in a Poisson term, and we see from the repository of all knowledge that the conjugate prior is the Gamma distribution. Since this conjugacy was worked out in the notes, we won't make you derive it here. Lucky you!

Recall that, for conjugate Gibbs sampling, we need to find the posterior for $\mu_0$ conditioned on all other parameters. This is why we can afford to only pay attention to terms in the posterior that contain $\mu_0$ - any term that doesn't must come out in the wash when the conditional posterior is normalized. Much simpler than dealing with marginalization or some other operation! So, using the Poisson-Gamma conjugacy relation, we have that

$\mathrm{Gamma}(\mu_0|\alpha_0,\beta_0) \, \mathrm{Poisson}(N|\mu_0,\ldots) \propto \mathrm{Gamma}(\mu_0|\alpha_0+N,\beta_0+1) = p(\mu_0|\{N_i\};\ldots)$,

where "$\ldots$" stands for all the other parameters: $x_0,y_0,\sigma$.

Again, remember that the posterior must be a normalized PDF. Hence, the RHS of the proportionality is the posterior for $\mu_0$. If we had carefully kept track of all the normalizing factors and computed the evidence, this is what it would work out to be. Aren't we lucky that we didn't need to do so explicitly?

So we could use conjugate Gibbs sampling for $\mu_0$, but should we? In other words, can we find a prior we can live with within the Gamma family? Yes, in this case. Here are some special cases of the distribution, some of which you can verify by inspection of its PDF:

Relation for $\sigma$

Specifically, we are looking for a conjugate relation for the standard deviation of a Normal likelihood, with a known mean, since we are conditioning on $x_0$ and $y_0$. This conjugacy relation is much more annoying to work out, but the Wikipedia tells us what the update rule for the variance (not the standard deviation, but $\sigma^2$) is, and that the conjugate prior is the "scaled inverse chi-square" distribution:

$\mathrm{SclInv}\chi^2(\sigma^2|\nu_0,\sigma_0^2) \, \prod_{k=1}^{2N}\mathrm{Normal}(z_k|0,\sigma,\ldots) \propto \mathrm{SclInv}\chi^2\left(\sigma^2\left|\nu_0+2N, \frac{\nu_0\sigma_0^2 + \sum_{k=1}^{2N} z_k^2}{\nu_0+2N}\right.\right) = p(\sigma^2|\{N_i\};\ldots)$.

Again, the $2N$s here come about because, for each of the $N$ detected photons, both their $x$ and $y$ displacements from the source position tell us something about $\sigma$.

The scaled inverse $\chi^2$ distribution is a little uglier than we're used to, but not all that hard to deal with due to it's close relationship with the $\chi^2$ distribution. Again we can find values or limits of the parameters that are compatible with some of the "uninformative" choices we might want to make. See if these make sense by inspection of the PDF (all these have non-positive-integer degrees of freedom, which is admittedly bizarre, but the math works out):

One of these is equivalent to the prior we used before, $p(\sigma)\propto \sigma^{-1}$, but it may not be the one you think! Remember that we've made a non-linear change of variables here ($\sigma$ to $\sigma^2$) and that in general the prior density is not invariant under such a transformation. Use what you learned way back in Probability Tranformations to work out the prior we want on $\sigma^2$, and hence which of the hyperparameter choices above we should use.

Implementation

So, now we have rules from drawing samples from the posterior distributions of both parameters. In fact, we even have the posterior distribution in closed form, since in this case it factored cleanly into $p(\mu_0,\sigma|\{N_i\})=p(\mu_0|\{N_i\})p(\sigma|\{N_i\})$. That is, we won't actually need to generate a Markov chain, we can sample directly from the posterior without correlation between samples! This is called "independence sampling".

Later, we'll see that this is no longer the case when we introduce the source $x$ and $y$ positions as free parameters; then we will need to sample from the fully conditional posterior of each parameter in turn, producing a Markov chain of samples that approximate draws from the full posterior.

But, before tackling that, let's get some code together to produce samples from the simpler case we're starting with, with just the two free parameters. Note that sampling from the scaled inverse $\chi^2$ distribution is slightly involved, but not too complicated; it just follows from the definition of the distribution, so you ultimately end up drawing from a $\chi^2$ distribution and applying some transformations to those random draws.

The cell below just defines dictionaries for the model parameters and prior hyperparameters as usual. Fill in the correct hyperparameters for the $\sigma^2$ prior based on your answer above. Note that, because we can do independence sampling here, we don't actually need starting guesses for $\mu_0$ and $\sigma$!

Now for the fun part: write a function that takes the data, model parameters and prior hyperparameters as input, and returns a new sample of $\mu$ and $\sigma$ based on the posteriors above. Note that you should return $\sigma$, not $\sigma^2$. (There is nothing to this other than taking a square root - post-facto transformations of parameters is simple; the only wrinkle is in chosing the appropriate prior density as we did above.)

Even though we chose specific hyperparameter values above, you should write your code such that any valid values can be passed.

Note that the scipy implementation of the gamma distribution doesn't have $\alpha$ and $\beta$ as parameters; instead one needs to pass arguments a=alpha and scale=1/beta (for reasons).

Let's see what a few samples generated by this function look like.

Results

We could generate lots of samples with a python construction like the one above (because, again, this case is independence sampling). But, instead, let's belabor it as though we actually had to create a Markov chain, where each sample is dependent on the previous one. The cell below does this, updating the params dictionary within a loop (compare with the pseudocode in the notes), and collecting all the samples in an array.

We can use plotGTC to quickly visualize the posterior. This package shows us all the 1D marginalized posteriors and every pair of 2D marginalized posteriors (as a contour plot), after some smoothing, in a triangular grid. Dashed lines show the truth values the data were simulated from.

Checkpoint: The cell below will compare your samples with some we have generated. They won't be identical, but should be extremely close if you've used the priors and data specified above.

In order to compare to the corresponding grid analysis, let's compute credible intervals with the same code:

Compare the center and width summaries of each parameter with what you got from the grid analysis. Do you have the same results to the precision that one would normally report?

Fitting for 4 parameters

Let's now fit for the source position also ($x_0$ and $y_0$).

Doing the math

Let's explicitly write down the fully conditional posteriors for the case where the source position is free.

Nothing has changed for the total mean number of counts, which remains indepedent of the other parameters:

$p(\mu|\{N_i\};\ldots) = \mathrm{Gamma}(\mu|\alpha_0+N,\beta_0+1)$.

The fully conditional posterior for $\sigma^2$ is just as it was before, but we should explicitly admit that we are conditioning on $x_0$ and $y_0$, and show where they enter the expression. Make sure you're happy with this, comparing with our previous equations, before going on.

$p(\sigma^2|\{N_i\};x_0, y_0,\ldots) = \mathrm{SclInv}\chi^2\left(\sigma^2\left|\nu_0+n_k, \frac{1}{\nu_0+2N}\left[\nu_0\sigma_0^2 + \sum_{j=1}^N (x_0-x_j)^2 + \sum_{j=1}^N (y_0-y_j)^2 \right]\right.\right)$.

Finally, we need the fully conditional posteriors for $x_0$ and $y_0$. Each of these parameters is, independently, the mean of a normal distribution with standard deviation $\sigma$ in our model. Looking it up, we see that the conjugate prior is also normal. This is a little bit fiddly to show, but relatively straightforward using the Gaussian identities you've worked with previously. Whether or not you take the time to work this out yourself, it's worth checking that the way the information from the prior and data are combined in the expressions below makes sense.

Denoting the hyperparameters of the conjugate priors $(m_x,s_x,m_y,s_y)$, we have

$p(x_0|\sigma, \{N_i\}) = \mathrm{Normal}\left(x_0\left|\left[\frac{1}{s_x^2}+\frac{N}{\sigma^2}\right]^{-1}\left[\frac{m_x}{s_x^2}+\frac{\sum_j x_j}{\sigma^2}\right], \left[\frac{1}{s_x^2}+\frac{N}{\sigma^2}\right]^{-1/2}\right.\right)$,

$p(y_0|\sigma, \{N_i\}) = \mathrm{Normal}\left(y_0\left|\left[\frac{1}{s_y^2}+\frac{N}{\sigma^2}\right]^{-1}\left[\frac{m_y}{s_y^2}+\frac{\sum_j y_j}{\sigma^2}\right], \left[\frac{1}{s_y^2}+\frac{N}{\sigma^2}\right]^{-1/2}\right.\right)$.

In the grid solution, we used uniform priors for $x_0$ and $y_0$ confined to the image. This is not something that a Gaussian distribution can reproduce. On the other hand, we know from that analysis (and could reasonably suspect in this case) that $x_0$ and $y_0$ will be constrained well enough that they will not be consistent with the image edges, in which case the functional difference between a uniform prior over the image and an improper uniform prior over the whole $x_0$-$y_0$ plane is minimal. Let's therefore adopt the improper uniform prior, which corresponds to the limit $s_x,s_y\rightarrow\infty$ (at which point the values of $m_x$ and $m_y$ cease to be important). However, as before, you should write your code such that any valid hyperparameters can be used. Here we add them to the hyperparameters dictionary:

Implementation

Remember that we can no longer use independence sampling. Now each parameter must be updated in turn, meaning the new value of one parameter is used when updating the next parameter, and so on. So, let's explicitly write separate functions for updating different sets of parameters. The first two can, naturally, re-use code you've written above.

The updates for $x_0$ and $y_0$ will naturally be very similar, but let's keep them in different functions anyway.

Let's test all of that by calling each function and verifying that all the parameters changed (to finite, allowed values).

Results

As before, we can fill in an array with samples generated with the functions above. Note that, this time, the for loop is necessary, since we can't fill in row $i$ without knowing the contents of row $(i-1)$. The order of the individual parameter updates is arbitrary, and could even be randomized if you particularly wanted to.

Let's do the most basic (yet still extremely important) visual check to see how our sampler performed, looking at traces of the Markov chain for each parameter. (It's ok if you haven't read the notes on MCMC Diagnostics yet; we will go more in-depth later.) These trace plots show the value of each parameter as a function of iteration, and we'll add a line showing the value that was used to create the mock data.

Note that if you started with pretty reasonable parameter values, which is likely if you've been following the notebook top to bottom, it's entirely possible that there isn't a clear burn-in phase that needs to be thrown out.

We can look at the triangle-plot summary of the posterior as before:

Checkpoint: Once again, you can compare these to our own solutions; the posteriors should look very similar.

Let's find the credible intervals, for completeness. We won't bother with the goodness of fit, since the grid analysis did address that.

Any qualitative observations on how the marginalized posteriors compare to their counterparts from the course grid analysis?

We weren't overly concerned with the starting point for the test chain above. But, for later notebooks, we'll want to see how multiple, independent chains with different starting points behave when using this method. The cell below will take care of running 4 chains, started at random yet broadly reasonable positions.

Now we can look at a more colorful version of the trace plots, showing all of the chains simultaneously:

Here you might be able to see some extreme values as the chains burn in, before they settle down to sampling the posterior distribution.

Save them for later, and we're done!

Parting thoughts

There you have it - at the cost of engaging our brains, you have now fit a model with enough free parameters to make a grid-based solution uncomfortably slow. Conjugate Gibbs sampling is often, though not always, highly efficient when it's applicable. Problems that are fully conjugate are not all that common, although, as you'll see later on, it's sometimes possible to perform conjugate updates of a subset of parameters in a complex model, using other strategies for the remaining parameters.

Bonus

If you have already done the Metropolis Sampling tutorial, comment on the differences you see between the chains that the two methods produce.