hierarchical

Now that you've gained some experience carrying our Bayesian inference, it's time to face the fact that real life is often much more complicated than the problems we've worked so far. (The original title of these notes, years ago, was "Coping with Reality".)

In any case, real data sets tend to be complicated. Some common features:

Heteroscedastic measurement uncertainties (i.e. different for each measurement)
Measurement uncertainty for every measured quantity
Correlations between measurement uncertainties, amongst different quantities and/or data points
Intrinsic scatter (i.e. real differences that are not just due to measurement error)
Selection effects and systematically incomplete data
Presence of unwanted sources in the data set
Systematic uncertainties every which way

All of these things represent something that the Universe/atmosphere/telescope/etc. has done to the signal we measure before we get our hands on it, carve it in stone, and call it data. (Insert the spicy adjective of your choice before "Universe", if you'd like.) Recalling our notes on generative models, that means that any of these effects that apply deserve to be represented in our model. Very often, this involves introducing additional nuisance parameters and/or expanding the model such that it has a hierarchical structure. We've seen both of these features before, in fact, but it's time to give them a name and see how enabling they really are.

Nuisance parameters¶

A nuisance parameter is any model parameter that we are not particularly interested in learning about, at the end of the day. This is an entirely subjective label - nuisance parameters are not formally or functionally treated any differently from other parameters. We're coming back to them now because adding nuisance parameters to a model provides a way to explicitly account for and marginalize over systematic uncertainties. This can be as simple as assigning a prior distribution to some quantity that would otherwise remain fixed, or it can involve a more explicit expansion of the model. Nuisance parameters often represent latent variables - things that are logically or physically part of a model, but which cannot be directly measured.

Example: measuring flux with a background¶

Recall our simple measurement of a source's flux based on a number of detected counts from way back in the Bayes' Law notes. Let's explicitly include the constant conversion between flux and counts, so that our model might be described by

$F \sim \mathrm{Gamma}(\alpha,\beta)$
$\mu = C\,F$
$N \sim \mathrm{Poisson}(\mu)$

The form of the prior on $F$ isn't critical for our purposes here, so we'll stick with the Gamma distribution. The corresponding PGM looks like this:

PGM for the Poisson model, explicitly showing the expectation value corresponding to a given flux

Now let's expand the model to account for the fact that our measurement actually includes counts from both the source (a galaxy, say) and some background. Being good astronomers, we also remembered to take a second measurement of a source-free (background-only) patch of sky.

With subscripts $b$ for background, $g$ for galaxy, $s$ for the science observation that includes both, our model might be

$F_b \sim \mathrm{Gamma}(\alpha_b,\beta_b)$
$\mu_b = C_b\,F_b$
$N_b \sim \mathrm{Poisson}(\mu_b)$
$F_g \sim \mathrm{Gamma}(\alpha_g,\beta_g)$
$\mu_s = C_s(F_g+F_b)$
$N_s \sim \mathrm{Poisson}(\mu_s)$

PGM for the Poisson model, including flux from both a galaxy and background, and measured counts for both the science and background regions

We'd probably regard $F_b$ and $\mu_b$ as nuisance parameters here - we need to account for our uncertainty in $F_b$, but measuring it isn't the point of the analysis.

More interestingly, suppose we want to marginalize over systematic uncertainty in the instrument gain, assuming it's the same for both observations. Maybe the instrument team told us it was known to $\sim5\%$. We might introduce a new parameter, $\gamma$, representing the true (unknown) value of the gain relative to its published value (which would be included in $C$ already). Knowing only that $\gamma$ is uncertain at the "$\sim5\%$" level, we'd probably assign it a normal prior. Observe how the model changes:

$\gamma \sim \mathrm{Normal}(1.0, 0.05)$
$F_b \sim \mathrm{Gamma}(\alpha_b,\beta_b)$
$\mu_b = \gamma C_b\,F_b$
$N_b \sim \mathrm{Poisson}(\mu_b)$
$F_g \sim \mathrm{Gamma}(\alpha_g,\beta_g)$
$\mu_s = \gamma C_s(F_g+F_b)$
$N_s \sim \mathrm{Poisson}(\mu_s)$

In this case, nothing in the data allow us to constrain $\gamma$; its posterior will end up looking like its prior. At the same time, marginalizing over it allows us to propagate the prior uncertainty in $\gamma$ to the posterior distributions of the parameters we care about measureing. This is a common role of nuisance parameters.

As an aside, we'll note that there is a difference, albeit an arguable one, between being complete and being needlessly pedantic. For example, if the prior uncertainty on $\gamma$ were $\sim0.1\%$, and the Poisson uncertainty in the sampling distributions were $\sim10\%$, we would be quite justified in fixing $\gamma=1$, even though it isn't known perfectly.

Hierarchical models¶

Often, the model for a data set takes a hierarchical structure. This is naturally the case when we are measuring properties of multiple sources in order to learn about properties of a class of sources to which they belong. It's honestly difficult to think of astrophysical use cases where this doesn't apply, in principle.

In statistics, the applicability of hierarchical models is related to the concept of exchangeability - meaning that, as far as we know, individual sources of a given class are equivalent (until we measure them). In other words, exchangeability means that it's fair to apply a common prior distribution to properties of multiple subjects (be they astronomical sources, people, etc.). This prior parameters describe the statistical properties of the source class, and are often what we're most interested in measuring.

To be mind-numblingly formal for a moment, the general form for a hierarchical model goes like this:

$p(\mathrm{data}|\theta)$ describes the measurement process (what we've been calling the sampling distribution),
$p(\theta)$ decomposes as $p(\theta|\phi_1)\,p(\phi_1)$,
$p(\phi_1)$ decomposes as $p(\phi_1|\phi_2)\,p(\phi_2)$,
$\ldots$,
$p(\phi_n)$, usually taken to be "uninformative"

In the simplest hierarchical case, $\theta$ would represent the properties of individual sources that we've observed; $\phi_1$ would parametrize the statistical properties of the class of sources, $p(\theta|\phi_1)$; and $\phi_2$ would be the hyperparameters of some prior distribution for $\phi_1$, $p(\phi_1|\phi_2)$. Note that there is now some ambiguity as to what we call "the priors" of the model - $p(\theta|\phi_1)$ is the prior on $\theta$, but it is also an integral part of the model. Possibly this distribution represents prior uncertainty on $\theta$ in the "usual" way, i.e. we are uncertain about the values of $\theta$ and assign some PDF to account for this. Possibly, $p(\theta|\phi_1)$ might instead encode a physical model for a distribution of source properties, as in the example below. Possibly it's a bit of both. You might recall earlier comments that it's best to regard the choice of priors as part of the model definition; this fuzziness is one reason for it.

Example: galaxy luminosity function¶

Let's modify the previous example as follows

We're interested in the luminosity, $L$, rather than flux - if we know the distance to the target, this just means including another known factor in $C_g$, which now converts luminosity to counts.
We'll measure $m>1$ galaxies, and are interested in constraining the luminosity function, the number density of galaxies as a function of luminosity. Traditionally, it's modelled as

$n(x) = \phi^\ast x^\alpha e^{-x}; \quad x=\frac{L}{L^\ast}$.

Here's a typical curve.

The canonical luminosity function resembles a power-law decrease followed by an exponential cutoff

For simplicity, we'll assume that we've measured every galaxy above a given luminosity in some volume. This is not very realistic, but we'll leave the issues raised by incomplete data sets for later. In essence, this assumption allows us to treat the luminosity function as being proportional to a prior PDF for the luminosities of galaxy in our data set. For simplicity, let's also assume that the same background applies to each galaxy measurement, and that we have a galaxy-free observation of it, as before.

Compressing the $L\rightarrow N$ and $F\rightarrow N$ conversions once more, the PGM for our experiment looks like this:

PGM for luminosity measurements of multiple galaxies

You might now be seeing a resemblance to some of the problems we've done before. Whether they refer to astronomical sources or pixels in a camera, we generally end up with multiple measurements within a plate in the PGM, all connected to common parameters that live outside the plate, which in turn have their own prior distributions or connections farther upstream. In fact, one could easily imagine making this a doubly hierarchical problem by explicitly modeling the counts in each pixel involved in the measurement of each galaxy.

Example: linear regression with most of the trimmings¶

Consider the following common features in the context of fitting a line to $x$ and $y$ data.

Heteroscedastic measurement uncertainties (i.e. different for each measurement)
Measurement uncertainty for every measured quantity (both $x$ and $y$ for each data point, in this case)
Correlations between measurement uncertainties, amongst different quantities and/or data points
Intrinsic scatter (i.e. real differences that are not just due to measurement error)

In the generative sense, this translates to words like these:

The Universe has some distribution of the $x$ values of things. Call the true (unknowable) values of $x$ for all the things $\xi$.
For a given $\xi$, there is some variance in the true $y$ values of things (call these $\eta$). Their mean value as a function of $\xi$ follows a line, parametrized by $(\alpha,\beta)$, with the shape of the PDF given by some parameters $\Sigma$.
We don't know $\xi$ or $\eta$ for any source, but we make measurement of them and call the resulting data $x$ and $y$. The measurement uncertainties for a given data point are correlated, meaning that $(x,y)$ is drawn from some 2D PDF that depends on the true values $(\xi,\eta)$ and some other parameters, $M$, that we will mercifully assume we know. We'll assume that there are no measurement correlations between different data points, although that's also sadly plausible.

Stare at this long enough and you will see the classic least-squares line fitting scenario in there somewhere - but with a lot more fun added in. Here is a PGM based on those words:

Note that we are still assuming that our set of measured things is not systematically incomplete somehow. We'll deal with that noise later in the course.

Example: class membership¶

What if our set of observed sources contains members of multiple classes? We could imagine adding a discrete "group label" parameter, $g$, to the model for each source. The prior for $g$ would be multinomial, depending on the (possibly unknown) fractions of sources belonging to each class,

$g_i \sim \mathrm{Multinomial}(\{f_j\})$,

with $i$ indexing over sources and $j$ over classes. $g_i$ then determines which class model provides the sampling distribution for the measurements of the $ith$ source.

Alternatively, we could just implement the sampling distribution as a sum over possible classes, which is just marginalizing over $g_i$ by brute force,

$p(\mathrm{data}_i|\theta_1, \theta_2, \ldots) = \sum_j f_j\,p_j(\mathrm{data}|\theta_j)$.

Here $p_j(\mathrm{data}|\theta_j)$ would be the appropriate sampling distribution for the $j$th class.

Parting thoughts¶

While it might not appear that we've done anything new in these notes compared with earlier in the course, we've named and explicitly defined two features of Bayesian analysis that allow systematic uncertainties - that is, uncertainties about the generative model apart from the sampling distribution - to be faithfully propagated throughout our inferences. This is a key task if we want anyone (including ourselves) to believe our results, and something that other approaches generally struggle with. The fact that they weren't "new" as such just underscores how fundamental nuisance parameters and hierarchical structures are to principled inference.

An inevitable (and tautological) feature of complex models that use these features is that they contain more parameters than simpler models. Generically, this makes our job computationally more challenging, whether we're just trying to find the posterior maximum, explore the posterior to define confidence regions, or calculate the evidence. There are a number of strategies beyond brute force that can simplify things, and there's a certain amount of artistry involved in choosing and carrying them out (as with any difficult numerical calculation). One such strategy is described above in the "class membership" example, and we'll see more in the tutorials here in the latter part of the course. Indeed, getting you to this point, and then providing hands-on experience with realistically complex problems ("coping with reality", you might say) is one of the central goals of this course. Onward!