Tutorial: Toy Photometry with Metropolis Sampling

In this notebook, we will use Metropolis sampling to solve the inference problem that we previously struggled to carry out on a grid. Have a look back at that tutorial if you need a refresher on the general setup. You will:

Setup

We'll begin by reading in the mock data set you worked with the grid-based tutorial, and storing it in an Image object, as we did before.

We'll still want to refer to the truth model used to generate the above data:

Model implementation

This time we're going to do a fit using the Metropolis algorithm. Unlike the case of Gibbs sampling, this doesn't require (conversely: doesn't get to directly exploit) any nice mathematical properties of the posterior, so you can recycle the log_prior and log_likelihood functions from your grid-based analysis.

And here is the free log-posterior function again:

As always, let's check that they return finite values without obvious bugs.

Sampler implementation

Next, we need a proposal distribution. I'll use a multivariate Gaussian centered on the current position. This is translationally invariant, so later on we can use the simple Metropolis acceptance rule instead of the slightly more complex Metropolis-Hastings rule. A Gaussian isn't necessarily the best choice in general, since the most likely proposals are very small steps, but it will do for the moment.

To further keep it simple, let's make the proposal independent in each parameter (a diagonal covariance matrix for the 4-dimensional Gaussian). Similarly to the grid method, you'll want to guess the appropriate order of magnitude for steps in each parameter, which is the same order as the width of the posterior, and you may need to return to this point to adjust them after seeing the performance.

Since we're assuming a diagonal covariance, let's go ahead and just represent the proposal distribution as 4 univariate Gaussians, as in the dictionary below.

Aside: you may not have seen it before in this class, but calling scipy.norm() as below produces what they call a "frozen" probability distribution object, with fixed parameters. The standard deviation is whatever you specify via the scale argument, and the (unspecified) mean would remain at the default value of 0.0. This means that each entry should be interpreted as a distribution for the displacement of a proposal from the current position of a chain. In other words, the proposal distributon density for a change to $x_0$ could be computed with proposal_distribution['x0'].pdf(x0_proposed - x0_current). Other methods of scipy probability distributions, in particular random number generation, are also available; the main difference from how we've used them before is that we can't/don't need to specify parameter values in the call because they've already been frozen in.

Next, define a function that returns a proposed point in parameter space, given the current location and the above dictionary of proposal distributions.

Technical note: You might be tempted to begin this function with a command like proposal = current_params. If so, remember that, in Python, b = a does not make a copy of a if a is a dictionary (or a numpy array for that matter). Both b and a would point to the same data in memory. The safest/quickest way to get a new dictionary with the same structure as a whose values can then be safely overwritten is with b = a.copy().

See if it works:

Finally, the sampler itself. Write a function that takes the current parameter values and log-posterior value as input (along with the data and proposal distributions), and returns the next set of parameters values and corresponding log-posterior as a tuple. These can be identical to the inputs, if the proposal is rejected.

And, again, make sure it works without crashing:

Fitting for 2 parameters

There's fundamentally no functional difference between fitting the 2-parameter or 4-parameter models with this method. But, for symmetry with the other notebooks, let's start with the 2-parameter fit (with $x_0$ and $y_0$ fixed to the truth) anyway.

Given the work you've done above, a quick and dirty way to accomplish this is to use a delta-function proposal distribution for $x_0$ and $y_0$, ensuring that they never change in value. That way we don't need to change any of the existing code to explicitly account for some parameters being fixed.

Now let's run a chain, starting the free parameters, $\mu_0$ and $\sigma$, from random but broadly reasonable values.

Make sure you understand how the cell below is making use of the functions you defined above.

Let's do the most basic (yet still extremely important) visual check to see how our sampler performed, looking at traces of the Markov chain for each parameter. (It's ok if you haven't read the notes on MCMC Diagnostics yet; we will go more in-depth later.) These trace plots show the value of each parameter as a function of iteration, and we'll add a line showing the value that was used to create the mock data.

Exactly what you see here (in particular, how correlated each sequence appears and how many repeated values there are) will depend on the width of your proposal distributions. But, hopefully, you can see the sequence finding its way to a part of parameter space that it likes, and then continuing to jump around there. If you want to, you can go back and adjust the proposal scales to be smaller (if there are many rejected proposals) or larger (if the accepted steps are tiny compared with the long-term variation).

We can use plotGTC to quickly visualize the posterior. This package shows us all the 1D marginalized posteriors and every pair of 2D marginalized posteriors (as a contour plot), after some smoothing, in a triangular grid. However, before doing so, we want to remove the "burn-in" period during which the chain is finding its way to the region of high posterior probability. I set this to be the initial 1000 steps below, but you might need to adjust it for your results (though if your burn-in phase is longer than this, it might be worth re-running with a smaller proposal scale).

Dashed lines show the truth values the data were simulated from.

Checkpoint: The cell below will compare your samples with some we have generated. They won't be identical, but should be extremely close if you've used the priors and data specified above.

In order to compare to the corresponding grid analysis, let's compute credible intervals with the same code:

Compare the center and width summaries of each parameter with what you got from the grid analysis. Do you have the same results to the precision that one would normally report?

Fitting for 4 parameters

Let's now fit for the source position also ($x_0$ and $y_0$). All we really need to change compared with the last section is to use the original proposal distribution dictionary, which contained non-delta-function proposals for all the parameters. We'll again start at a random but broadly reasonable point in parameter space.

Let's have a look at the traces:

Again, remove the clear burn-in phase and plot the triangle:

Checkpoint: Once again, you can compare these to our own solutions; the posteriors should look very similar.

Let's find the credible intervals, for completeness. We won't bother with the goodness of fit, since the grid analysis did address that.

Any qualitative observations on how the marginalized posteriors compare to their counterparts from the course grid analysis?

For later notebooks, we'll want to see how multiple, independent chains with different starting points behave when using this method. The cell below will take care of running 4 chains, started at different positions.

Now we can look at a more colorful version of the trace plots, showing all of the chains simultaneously:

We'll save them for later, without removing burn-in.

Parting thoughts

There you have it - at the cost of some random walking, you have now fit a model with enough free parameters to make a grid-based solution uncomfortably slow. Metropolis and Metropolis-Hastings are extremely widely applicable and therefore powerful, though one needs to think about how to provide an efficient proposal distribution. We'll see the consequences of not doing so in MCMC Diagnostics and some intelligent proposal strategies in More Sampling Methods.

Bonus

If you have already done the Gibbs Sampling tutorial, comment on the differences you see between the chains that the two methods produce.