modelfreemodels

What is this all about?¶

Sometimes we simply don't have a good first-principles model for what's going on in our data, but we're also confident that making a simple assumption (e.g. Gaussian scatter) is dead wrong. Examples include

photometric redshifts, which can suffer from "catastrophic errors" (i.e. a very wide and not well understood sampling distribution)
photometric supernova detections, where the data will include multiple, physically different populations

In these situations, we're motivated to avoid strong modeling assumptions and instead be more empirical. Common adjectives used in this sphere are

non-parametric,
model-independent,
data-driven,
empirical.

In reality, several of the approaches discussed below are models, and in fact have loads of parameters. Still, the terminology seems to be with us.

Resampling methods: jackknife and bootstrap¶

These methods try to compensate for "small sample" effects in the data, or otherwise not knowing the sampling distribution. Resampling is usually seen in frequentist estimation rather than Bayesian inference, but there are Bayesian adaptations and interpretations.

The classic example for resampling is robustifying the estimation of a sample mean, when the sampling distribution appears to have heavy, non-Gaussian tails. But, in general, there is some quantity that we would like to infer from the data - we don't have a proper model to fit, but we can somehow estimate it from the data more directly.

Jackknife¶

The jackknife procedure is

Remove 1 (or more) data points from the data set.
Calculate the estimate of interest using the reduced data set.
Repeat this for every possible reduced data set.

The average (compared to the full-data-set calculation) and scatter of these estimates provides some idea of the small-sample bias and its scatter.

Bootstrap¶

The bootstrap is a little more sophisticated. The idea is that we have data that sample a distribution (what we normally aptly call the sampling distribution), so they can be used as a direct (if crude) estimate of that distribution without further assumptions. A key requirement is that the measured data are a fair representation of draws from that distribution. The procedure is

Generate a new data set of the same size as the real data by sampling with replacement from the real data points.
Calculate whatever statistic or estimate is of interest from the bootstrap data set.
Do this many times, and interpret the resulting distribution as indicative of the true uncertainty from the sampling distribution, translated to the estimand.

Mixture models¶

This refers to the general practice of building a complicated distribution out of simpler components. Technically, a mixture model is one where a PDF is composed of a sum of simpler PDFs,

$p(x) = \sum_i \pi_i \, q_i(x)$,

where the coefficients $\sum_i \pi_i=1$, and the $q_i(x)$ are normalized.

We could generate from this PDF by drawing from $q_i$ with probability $\pi_i$. Equivalently, we could interpret this as saying that $x$ belongs to one of several populations, each described by $q_i$, with prior probability $\pi_i$. Or we could just use the mixture as a tool to allow significant (but obviously not complete) freedom in $p(x)$.

How would we decide on the number of mixture components? Depending on the application, we might:

Test how sensitive our inferences are to the number, and use the smallest number at which the results of interest "converge".
Do formal model comparison (e.g. via an information criterion, or the evidence) to decide.
Explicitly marginalize over the number of components (requires more sophisticated sampling techniques than we have covered).
Use a number of components that we are confident is more than enough, along with a prior on the $\pi_i$ that favors keeping most of the weight in as few $\pi$s as possible.

Each of these, in one way or another, seeks to limit the mixture to the smallest size justified by the data.

"Non-parametric" Models¶

The term "non-parametric" is used vaguely (and often inaccurately), so it's best explained by example:

In gravitational lensing, image shear (or stronger distortions) can be measured at the positions of background galaxies in the image plane. Often, the mass distribution of the lens is modeled as the sum of a small number of idealized structures with parametrized mass distributions.

Alternatively, Bradac et al (2005) model the deflection potential on a regular grid (eg. their Figure 5), interpolating to the position of measured galaxies, avoiding explicit assumptions about the nature of the lens.

In other words, a "non-parametric" model is usually one with many more parameters than a standard, "parametric" model. What's different is that we don't make a global assumption about the form of some function in the model, and instead assume that it's piecewise linear, or piecewise constant, or otherwise simply interpolatable between various values that comprise the parameters of the model.

A common feature of non-parametric models is that they bypass the usual business of defining a physically motivated model. Instead, they are usually "data-driven":

They usually attempt to define a "physics-agnostic" model, but with enough flexibility to describe the data.
This flexibility scales with the size of the dataset, in order that the data continues to be well described.

Thus, "non-parametric" models are in no sense assumption-free - they just involve different assumptions than more simply parametrized, physics-based models.

Stochastic Processes¶

Stochastic processes are one way to define a non-parametric model, with a bit more sophistication and much more complexity than the "piecewise-linear" option. A stochastic process is collection of variables drawn from a probability distribution over functions (as opposed to the familiar probability distributions over real numbers, integers, etc.). In other words, if our function of interest is $y(x)$, a stochastic process assigns probabilities $P\left[y(x)\right]$.

Gaussian Processes¶

A Gaussian process is a particular stochastic process for which

$P\left[y(x) | y(x_1), y(x_2), \ldots\right]$

is a Gaussian PDF whose mean and variance depend on the $x_i$ and $y(x_i)$. The process is specified by a "mean function" $\mu(x)$ and a "covariance function" $C(x)$, or "kernel," which determines how quickly $y(x)$ can vary.

A nice feature of Gaussian processes is that all the calculations involved in the conditioning above are algebraic. In other words, if we know the value of the function at some set of $x_i$, its relatively easy to compute the uncertainty on its value at some other $x$, conditioned on that knowledge.

More technically, a draw from $P[y(x^*)]$ would represent a prior prediction for the function value $y(x^*)$. But, typically, we are more interested in the posterior prediction, drawn from $P[y(x^*)\vert y_{\rm obs}(x_{\rm obs})]$, where $x_{\rm obs}$ are the locations where we know the function and $x^*$ is some other location. This posterior PDF for $y(x^*)$ is a Gaussian, whose mean and standard deviation can be computed algebraically, and which is constrained by all the previously observed $y(x)$. The formalism for all these calculations easily extends to the case where our measurements of $y_{\rm obs}$ come from a Gaussian sampling distribution.

Gaussian processes provide a natural way to achieve high flexibility (and uncertainty) when interpolating data, provided we're willing to make the appropriate assumptions (e.g. Gaussian measurement errors). For a given kernel, the required computations are quite efficient. Marginalization over hyperparameters such as the width of the kernel is more computationally expensive (involving the determinants of the matrices), but reasonably fast methods have been developed.

Additional GP resources¶

Rasmussen & Williams Gaussian Processes for Machine Learning

Notes: "Model-free" Models¶