Tutorial: Toy Photometry Model on a Grid

This will be your first fully (well, mostly) worked-out inference problem for the course. We're going to keep it unrealistically simple in order to focus on process; you'll solve the same problem using different methods later on. Later tutorials will be based on real-life (if still somewhat simplified) astrophysics inferences.

Specifically, you will

Problem definition

The problem we will address here is that of measuring the position and flux of a point-like source in sparse imaging data. By "sparse", we mean that we have relatively few detected photons in each pixel of our image, such that the sampling distribution is Poisson rather than approximately Gaussian. This is generally the case for observations at X-ray wavelengths and shorter.

"Point-like" in this context means that the angular size of the source is smaller than point-spread-function (PSF) of our telescope, which describes the Green's function response of the optical system. That is, a truly point-like source would appear in our image as a smeared out shape (the PSF), and our source is small enough that it also looks like the PSF, with its true extent being negligible by comparison. Normally the PSF is larger than the pixel size of an image, so we actually do see the shape of the PSF; this is preferable to undersampling the image, in which case a point-like source would show up only in 1 pixel, but we would have no idea where in that pixel it was located.

The most egregious simplifying assumption we will make is that we recieve photons only from the source in question, without there being any backgrounds or other sources to confuse things. If we have a very small PSF and a very bright source, this might actually not be a terrible assumption, though generally we do have to worry about such things. We will also assume that we know the shape of the PSF, an uncorrelated (i.e. symmetric) 2D Gaussian. That is, the image we see is related to the true, unsmeared image by convolution with the kernel

$\mathrm{PSF}(x,y) = \frac{1}{2\pi\sigma^2} e^{-\frac{1}{2}\left(\frac{x^2+y^2}{\sigma^2}\right)}$.

However, we will not assume that we know the PSF width, $\sigma$, perfectly.

We'll see some features of real imaging data later in the course, but for now let's simply describe the source flux in terms of it's Poisson expectation value for the observation in question, rather than a real flux in units of energy per unit time and area.

To recap, our model is:

What about priors? As always, you are encouraged to experiment to see what different choices might produce. As a baseline, though, let's adopt

(Incidentally, the Jeffreys priors for $x$ and $y$ would be uniform on the real line, so we're pretty much at maximum fanciness for this problem.)

Before continuing with an implementation, go through our usual generative model sanity check: enumerate the model parameters and data, write down the probabilistic relationships among them, and visualize them with a PGM. Keep in mind which parameters will be free vs fixed in our initial analysis, as described above.

Generate data

Since this is a fictional inference problem, we'll need to generate some fake data for you to fit. Fortunately, your expertise with generative models will make this straightforward!

For consistency, use the true model parameter values below in the notebook you turn in:

We'll be simpleminded, and have the $x$ and $y$ coordinates correspond to the 2nd and 1st indices into the image. That is, a $32 \times 32$ image spans pixel-center coordinates $x_i=0,1,\ldots,31$, and similarly for $y$. Please note that "2nd and 1st" is not a typo above; the convention for 2D images is usually that the first index is row (vertical) and the second is column (horizontal). To keep our brains from exploding, please implement this switch so that the horizontal axis of plots corresponds to $x$ and the vertical to $y$.

Produce a mock image of the dimension given above, and we'll visualize it below. There are two approaches one might take here. One is to compute $\mu(x,y)$ over the image (remembering the subtlety regarding the normalization of the PSF noted above) and then draw a Poisson sample for each pixel. The other, equivalently, is to do a single Poisson draw for the total number of photons we'll see, and assign their positions in the image according to the source position and PSF.

Note that you should technically handle the case of photons that land outside of the defined image, since the PSF tails go to infinity. This is relatively unlikely to be an issue for the values above, however.

Have a look. The image should have a good amount of shot (Poisson) noise, but the center and width of the photon distribution should make sense in light of the truth values.

Well done! However, it will be more useful to have everyone working with the same mock data, so that you can compare results with others and with our solutions. So, overwrite your hard work with this data set that was also generated from the model above:

Checkpoint: It should look - broadly - similar to yours.

For convenience, it's nice to also carry around arrays that store the $x$ and $y$ value corresponding to each pixel. This is massively redundant and wasteful from a memory perspective, but it makes it simple to use array arithmetic to, for e.g., compute the distance of every pixel from $(x_0,y_0)$. The code below defines a lightweight class to hold these arrays, along with the actual image.

To be explicit, data.im now points to mock_image, and data.imx and data.imy contain the coordinate information described above:

Implementation

It's time to write functions to evaluate the prior, sampling and posterior distributions.

Normally, we try to evaluate the log of these distributions. This is because floating-point underflows can be an issue, especially when the sampling distribution is a product with many terms. Another benefit is that we don't have to worry about normalizing coefficients that don't depend on model parameters, unless we wanted to compute the absolute value of the evidence for some reason. Since we'll be evaluating over a grid in this notebook, we could always normalize the posterior later by dividing it by its numerical integral. If any of these functions should return zero probability, you can and should have them return a log-probability of $-\infty$.

Complete the function evaluating the log-prior. Remember that we want is to return a probability of 0 when any arguments are out of bounds or would cause numerical errors (e.g. $\sigma \leq 0$) rather than throwing an exception.

As always, we should make sure it returns some kind of value instead of crashing when fed an example parameter dictionary. We can just use truth for this type of check.

Next, implement the log-likelihood/sampling distribution. This is, of course, where all the fun of evaluating the model happens, although I'd suggest outsourcing the evaluation of $\mu(x,y)$ to a separate function; that way we can also easily check that it produces something reasonable. Keep in mind that our model before the PSF smooths things is just a delta function, so it isn't necessary to carry out the convolution in the model numerically. Hint: remember to normalize mean_img so that it contains the appropriate total expected counts.

For good measure, let's see what $\mu(x,y)$ looks like:

and look at its integral, which should not (exercise for the reader) be precisely $\mu_0$:

Finally, the log-posterior. As will often be the case, we will just return the sum of the log-prior and log-likelihood, i.e. we will neglect the normalizing constant (evidence), which is constant with repect to the model parameters.

The construction below is a good one to be in the habit of using. Usually, the likelihood is much more expensive to compute than the prior, and might even crash if passed prior-incompatible parameter values. So it's worth the extra check of a non-zero prior probability before attempting to evaluate the likelihood.

If everything appears to work so far, we should be good to move on.

Evaluating the posterior

We should now be able to define a grid and evaluate the posterior over it, as we've seen before. In practical terms, the extent of the grid defines the extent of our priors, regardless of whether they are uniform or not. As such, we need to ensure that the grid encompasses very close to 100% of the posterior, which means we might need to iteratively repeat this evalutaion.

A resonable approach is to use a numerical optimizer to find the maximum of the log-posterior (or the minimum of its additive inverse). That won't instantly tell us how wide to make the grid in order to contain nearly all the posterior probability, but it tells us a good place to center the grid to begin with.

We'll use scipy.optimize.minimize (imported above as minimize). Refer to its documentation if needed. Define the function that minimize will work on here.

Note that it will want a vector argument rather than a dictionary of parameter values. It's convenient therefore to define a standard ordering for the parameters that we can use any time that's relevant.

Since we're pretending to not know the true parameters, let's start the optimization from an arbitrary guess: $(x_0,y_0)$ at the middle of the image, $\mu_0$ at the sum of the image, and $\sigma$ at a number I made up.

The scipy.optimize.minimize function that we will use has a ton of options, and is probably worth reading the docs on as a matter of general preparation for life. For our purposes, the defaults are probably fine. However, it's a good idea to let the minimizer know that we aren't allowing all real values of all of the parameters, using the bounds option. This is because most of the methods in minimize will numerically approximate the derivatives of mlnpost; depending on our starting point and the typical order of magnitude of the parameters, it's entirely possible for many of the evaluations involved to be outside the allowed region, resulting in infinities returned from the prior, and terrible crashes. The bounds option should be a list of (lower_limit,upper_limit) tuples, where either can be None. Fill in the missing pieces of the list below, in agreement with the priors defined for this problem (we'll provide the last entry, for $\sigma$).

Now we'll do the optimization and print out the best fit:

Assuming this worked, the last line printed above (x) is the location of the optimum.

Checkpoint: while there will be variation due to different mock data sets, your optimum should probably be well within 1 of the truth for x0, y0 and sigma, and different by no more than a few in mu0.

Now, define the bounds and spacing of the grid to evaluate on. You may need to refine these values after seeing the results. You can get an initial guess at the appropriate size in each direction by treating bestfit['hess_inv'] as the covariance of the log-posterior and looking at the square root of its diagonal - this won't be wonderfully accurate, but will get you an order of magnitude.)

Let's say we want to have 100 grid points in each dimension to be able to resolve the posterior reasonably well. (This might be something else to tweak, if it looks necessary.) The code below defines the grids for each parameter.

Next we will define a 4D array to hold the posterior evaluations, and loop over all these grids. But before that, let's take a second to think about how many posterior evaluations we're talking about and how long they might take. In fact, let's begin by benchmarking how long it takes to call the posterior function 1000 times.

On my laptop that took about 0.6 seconds. Undoubtedly, my version of the functions above could have been written much more efficiently, but in the grand scheme of things this is not bad at all. Nevertheless, without thinking too hard, we just proposed to evaluate the posterior $100^4=10^8$ times, which naively translates to more than 16 hours! This is doable with parallelization and/or patience... but let's not.

Of course, with one fewer parameter this would take 100 times less time, which would be way more reasonable (though still longer than we like to make the calculations in these tutorials). And, on the other hand, a 5-parameter model puts us in the realm of a week of runtime.

Already we see the pitfall of working on a grid, which is why we will spend a good amount of time learning about more intelligent approaches later on. For now, let's just revise our sights downward a bit and content ourselves with a much more course grid of dimension 10.

Now the giant nested loop below should take only a few seconds to run.

In any case, we still need to check whether the grid extent is large enough to contain the bulk of the posterior, witout being so large that its course resolution prevents us from seeing anything. A reasonable albeit not foolproof way to do this is to check whether the credible regions of interest are comfortably contained within the grid.

Finding credible regions

Next, we'll go through the steps to find standard credible regions and intervals for the parameters. We can use the machinery you saw in the Credible Regions tutorial by reformating our grid evaluation in a form those functions will like.

Below, finish the calculations that integrate the posterior grid to find the 1D marginalized posteriors for each parameter, storing them in a dictionary that looks like the output of incredible.whist. (This is a single numpy function call in each case.)

Recall that we haven't properly normalized the log-posterior, so it's a good idea to subtract off the maximum log-posterior before exponentiating to avoid numerical under/overflows (that way the maximum posterior value is automatically 1.0 and not something like 1e-300 or 0.0). The given code does this and provides the un-logged posterior in post_grid.

At this point, you are probably looking at some plots where only one point in the marginalized posterior is signficantly non-zero, and maybe some where the posterior doesn't go to zero by the edge. This is a good time to go back (maybe with a few iterations) and redefine the grid extent such that the posterior goes to zero a little inside either edge of these plots. Don't forget to also re-run the cell defining *_values, and re-run the grid evaluation and the cells in this section. And remember that the grid is still very course, to the CI's will never look acceptably well sampled (so we're not even bothering to print them out). We'll wait.

It won't look pretty, but hopefully (checkpoint?) your posteriors are consistent with the truth values (orange vertical lines above).

Once you've done that, before you get too comfortable, recall that the 1D credible intervals for say 95.4% probability will be narrower than any 2D or nD credible regions for the same probability, so if we go on to look at those then this grid might be too tight. We won't... hopefully the point is made that working on a grid (generally in more than 2D) is painful. We would need to spend significant time figuring out what part of parameter space we need to explore, and then waiting to evaluate over a grid with higher resolution than we used here. Instead, we'll just use more sensible methods in future notebooks, and you'll get to see what the results should look like.

Curbing our ambition

Rather than give up on this tutorial entirely, let's simplify further to 2 free parameters, with the source position fixed to the truth. This might be the case if the source is already known and localized to much better than our current PSF by a higher-resolution observation in a different bandpass.

Using the power of arithmetic, I predict that we can evaluate a $100 \times 100$ grid over 2 parameters in the same time it took to do the $10 \times 10 \times 10 \times 10$ grid above. Here's an opportunity to change the grid extend, just in case you need to:

Let's define and run over this 2D grid:

Repeat your magic to find the 1D marginal posteriors

and let's have a look:

You can go back and refine the grid extent if this isn't beautiful enough for you, but it should look like a much better (plausibly smooth) sampling of the posterior.

We may as well also check out the 2D posterior. Here we'll find the credible regions corresponding to 1, 2, and $3\sigma$, since we want to make sure our grid contains nearly all of the posterior probability ($3\sigma$ corresponds to $\approx99.7$%).

Checkpoint: is your grid wide enough to avoid cutting off the posterior distribution before it becomes tiny, while still resolving its shape?

If so, then great. Let's print out the 1D credible intervals, for posterity.

Goodness of fit

We're not done yet! There's still the small matter of evaluating whether the model described the data well.

There are a few simple ways we might do this. Perhaps the most obvious is just to compare the image with the mean image, $\mu(x,y)$, for a representative set of parameters. For this purpose, let's use the posterior mode (the mean is also a common and reasonable choice).

This definitely tells us something, but it's tough to judge whether the width parameter $\sigma$ matches by eye given the Poisson noise in the data, not to mention it tells us nothing about $\mu_0$ directly.

Beyond this, we could compare the number of photons in the data to posterior predictions from the model, and, separately their spatial distribution. For this, it will be helpful to generate samples $(\mu_0,\sigma)$ from our gridded posterior distribution, which the cell below accomplishes.

Use the arrays mu0_ppd and sigma_ppd to produce an array N_ppd of posterior predictions for the total number of photons in the image. Note that both mu0_ppd and sigma_ppd are needed in order to account for the small probability of a photon landing outside of the image and not being counted. Hint: your mean_img function can be useful here, even if it isn't the most efficient way of accomplishing things.

Let's compare that to the data:

Next, generate posterior predictions for $x-x_0$ or $y-y_0$ offsets of photons from the source position (the two are equivalent, since we assumed the PSF was symmetric). Note that this is actually extremely simple by comparison (1 line), and does not require the code below that finds offsets in the data.

The code below laboriously finds the $x$ and $y$ photon offsets in the data, and returns them as a list of offsets and weights, the latter corresponding to the number of photons with that offset.

Let's compare that distribution with your prediction:

Checkpoint: Once again, the random nature of the mock data means we can't say exactly what this will look like for you, but the data and PPDs should appear compatible.

Finally, it's perhaps worth knowing that, for the Poisson likelihood, the expectation value and variance of $C=-2\ln\mathcal{L}$ (with $\mathcal{L}$ the likelihood), has been approximated to useful precision. These formulae are accurate in the limit of many data points, which in our case means many pixels (not photons necessarily). This means we can, if we want, do a goodness of fit test analogous to the frequentist $\chi^2$ test. There is nothing TBC below, but we go through the motions as an illustration.

The first of the functions above computes $C$ from the data, and its expectation value and variance from a model, $\mu(x,y)$. In each case, the calculations are pixel-wise, and the total $C$ (or expectation, or variance) is found by summing the result over the image. In the large-data limit, $C$ should be consistent with a normal distribution with the given expectation and variance.

Here we find the $C$ corresponding to the posterior mode (logically, we should probably have used the maximum likelihood if we are in frequentist mode, but it makes little difference here):

And the theoretical mean and variance, for comparison:

Parting thoughts

So, we've gamely applied our inference tools to an oversimplified toy problem and... run into a computational bottleneck that seems to make them useless in practice. Unsatisfying? Perhaps. But fear not! The next part of the course will cover computational methods that will make this problem quite straightforward.