In which we
Sometimes we simply don't have a good first-principles model for what's going on in our data, but we're also confident that making a simple assumption (e.g. Gaussian scatter) is dead wrong. Examples include
In these situations, we're motivated to avoid strong modeling assumptions and instead be more empirical. Common adjectives used in this sphere are
In reality, several of the approaches discussed below are models, and in fact have loads of parameters. Still, the terminology seems to be with us.
These methods try to compensate for "small sample" effects in the data, or otherwise not knowing the sampling distribution. Resampling is usually seen in frequentist estimation rather than Bayesian inference, but there are Bayesian adaptations and interpretations.
The classic example for resampling is robustifying the estimation of a sample mean, when the sampling distribution appears to have heavy, non-Gaussian tails. But, in general, there is some quantity that we would like to infer from the data - we don't have a proper model to fit, but we can somehow estimate it from the data more directly.
The jackknife procedure is
The average (compared to the full-data-set calculation) and scatter of these estimates provides some idea of the small-sample bias and its scatter.
The bootstrap is a little more sophisticated. The idea is that we have data that sample a distribution (what we normally aptly call the sampling distribution), so they can be used as a direct (if crude) estimate of that distribution without further assumptions. A key requirement is that the measured data are a fair representation of draws from that distribution. The procedure is
This refers to the general practice of building a complicated distribution out of simpler components. Technically, a mixture model is one where a PDF is composed of a sum of simpler PDFs,
$p(x) = \sum_i \pi_i \, q_i(x)$,
where the coefficients $\sum_i \pi_i=1$, and the $q_i(x)$ are normalized.
We could generate from this PDF by drawing from $q_i$ with probability $\pi_i$. Equivalently, we could interpret this as saying that $x$ belongs to one of several populations, each described by $q_i$, with prior probability $\pi_i$. Or we could just use the mixture as a tool to allow significant (but obviously not complete) freedom in $p(x)$.
How would we decide on the number of mixture components? Depending on the application, we might:
Each of these, in one way or another, seeks to limit the mixture to the smallest size justified by the data.
The term "non-parametric" is used vaguely (and often inaccurately), so it's best explained by example:
In gravitational lensing, image shear (or stronger distortions) can be measured at the positions of background galaxies in the image plane. Often, the mass distribution of the lens is modeled as the sum of a small number of idealized structures with parametrized mass distributions.
Alternatively, Bradac et al (2005) model the deflection potential on a regular grid (eg. their Figure 5), interpolating to the position of measured galaxies, avoiding explicit assumptions about the nature of the lens.
In other words, a "non-parametric" model is usually one with many more parameters than a standard, "parametric" model. What's different is that we don't make a global assumption about the form of some function in the model, and instead assume that it's piecewise linear, or piecewise constant, or otherwise simply interpolatable between various values that comprise the parameters of the model.
A common feature of non-parametric models is that they bypass the usual business of defining a physically motivated model. Instead, they are usually "data-driven":
Thus, "non-parametric" models are in no sense assumption-free - they just involve different assumptions than more simply parametrized, physics-based models.
Stochastic processes are one way to define a non-parametric model, with a bit more sophistication and much more complexity than the "piecewise-linear" option. A stochastic process is collection of variables drawn from a probability distribution over functions (as opposed to the familiar probability distributions over real numbers, integers, etc.). In other words, if our function of interest is $y(x)$, a stochastic process assigns probabilities $P\left[y(x)\right]$.
A Gaussian process is a particular stochastic process for which
$P\left[y(x) | y(x_1), y(x_2), \ldots\right]$
is a Gaussian PDF whose mean and variance depend on the $x_i$ and $y(x_i)$. The process is specified by a "mean function" $\mu(x)$ and a "covariance function" $C(x)$, or "kernel," which determines how quickly $y(x)$ can vary.
A nice feature of Gaussian processes is that all the calculations involved in the conditioning above are algebraic. In other words, if we know the value of the function at some set of $x_i$, its relatively easy to compute the uncertainty on its value at some other $x$, conditioned on that knowledge.
More technically, a draw from $P[y(x^*)]$ would represent a prior prediction for the function value $y(x^*)$. But, typically, we are more interested in the posterior prediction, drawn from $P[y(x^*)\vert y_{\rm obs}(x_{\rm obs})]$, where $x_{\rm obs}$ are the locations where we know the function and $x^*$ is some other location. This posterior PDF for $y(x^*)$ is a Gaussian, whose mean and standard deviation can be computed algebraically, and which is constrained by all the previously observed $y(x)$. The formalism for all these calculations easily extends to the case where our measurements of $y_{\rm obs}$ come from a Gaussian sampling distribution.
Gaussian processes provide a natural way to achieve high flexibility (and uncertainty) when interpolating data, provided we're willing to make the appropriate assumptions (e.g. Gaussian measurement errors). For a given kernel, the required computations are quite efficient. Marginalization over hyperparameters such as the width of the kernel is more computationally expensive (involving the determinants of the matrices), but reasonably fast methods have been developed.
Rasmussen & Williams Gaussian Processes for Machine Learning