Tutorial: Frequentism and Bayesian Analysis

Here we will work through an example inference described by Jake Vanderplas here (pages 4-6). You will carry out

Jaynes' Truncated Exponential Model

Let's explore Jaynes' model, as relayed by Vanderplas:

Consider a device that "...will operate without failure for a time $\theta$ because of a protective chemical inhibitor injected into it; but at time $\theta$ the supply of the chemical is exhausted, and failures then commence, following an exponential failure law (with a characteristic decay time of 1 day). It is not feasible to observe the depletion of this inhibitor directly; one can observe only the resulting failures. From data on actual failure times, estimate the time $\theta$ of guaranteed safe operation..."

Suppose we observe $N=3$ device failures. We don't know the exhaustion time $\theta$, but we see the three devices fail after $\mathbf{x} = [10, 12 , 15]$ days. What is the value of $\theta$?

Let's begin with the ingredients that are common to all the approaches we'll use. First, the data (failure times in days):

Traditionally, data like this would be visualized as a survival curve, the fraction of devices that have not yet failed as a function of time, with the times of failures highlighted.

Second, both frequentist and Bayesian analysis use the likelihood function, although we might refer to it as the sampling distribution in the Bayesian context. From the description, the sampling distribution for the $k$th failure time is

$p(x_k|\theta) = e^{\theta - x_k}$ for $x_k > \theta$,

$p(x_k|\theta) = 0$ for $x_k \leq \theta$.

Note that this looks extra-simple thanks to the fact that the decay time was given to be 1 day (i.e. this is the exponential distribution for $x_k-\theta$, with a rate parameter of 1).

Classic Frequentist Approach

Vanderplas points out that the mean of $x$ over its sampling distribution $P(x|\theta)$ is $\theta + 1$, and hence that a reasonable estimator for $\theta$ is

$\hat{\theta} = \bar{x} - 1$

where $\bar{x}$ is the sample mean of the data $\mathbf{x}$.

Vanderplas then derives an approximate form for the sampling distribution for this estimator:

The exponential distribution has a standard deviation of 1, so in the limit of large $N$, we can use the standard error of the mean to show that the sampling distribution of $\hat{\theta}$ will approach normal with variance $\sigma =1/\sqrt{N}$.

Under this Gaussian approximation, the the $\pm 2\sigma$ range corresponds to the 95.4% Frequentist confidence interval for the estimator $\hat{\theta}$:

$\mathrm{CI} \approx (\hat{\theta} - 2/\sqrt{N}, \hat{\theta} + 2/\sqrt{N})$

To warm up, write a function to compute the estimator $\hat{\theta}$ and its confidence interval, and calculate them for the given dataset.

Because we can, let's visualize these results along with the data:

Above, the solid vertical line indicates $\hat{\theta}$, and the dashed lines bound its confidence interval. It seems problematic that this confidence interval only contains values of the exhaustion time $\theta$ that are greater than the minimum observed failure time (checkpoint: this should be the case), which by definition have zero likelihood. Recall that the meaning of the confidence interval is the following, in Vanderplas' words,

If this experiment is repeated many times, in 95% of these cases the computed confidence interval will contain the true value of $\theta$.

(Note that, more precisely, this should be 95.4 and change percent.)

Let's test this assertion. Write a function that generates an arbitrary number of equal-sized data sets from the model given a true $\theta$ of your choice, computes the $2\sigma$ frequentist confidence interval for each one, and reports the fraction of realizations for which the confidence interval contains the true value (known as the "coverage"). Because we'll do a similar test for the Maximum Likelihood method later, this function should take classical_frequentist (i.e. the function that computes the interval) as an argument.

Let's see how it does, for a default of $10^4$ realizations of length-3 data sets:

Checkpoint: while there will be some random variation, you should indeed recover 0.95+change for any positive theta_true (not just the example value of 10). Feel free to play around with the size of the simulated data sets and/or number of simulations also.

Given that the confidence interval procedure appears to do its job in principle, does the strangeness of the interval we obtained for our particular data set tell us anything interesting about that data set? Does it tell us anything interesting about $\theta$?

Maximum Likelihood Approach

Next, we turn to the maximum likelihood approach within frequentism. To begin with, code up the likelihood function, $\mathcal{L}(\theta; \mathbf{x}) = p(\mathbf{x}|\theta)$.

Below we'll plot it as a function of $\theta$ for our data set, indicating the measured values of $x$ on the axis for context. The likelihood should naturally drop to zero for $\theta > \min \mathbf{x}$, per the definition of $\theta$.

Given the form of the likelihood function, derive an expression for the Maximum Likelihood estimator, $\hat{\theta}_\mathrm{ML}$.

We'll also want to find the 95.4% ML confidence interval. Recall that this is defined as the interval satisfying

$S(\theta) - S(\hat{\theta}_\mathrm{ML}) \leq 4$, where $S(\theta) = -2 \ln \mathcal{L}(\theta|\mathbf{x})$.

Derive an analytic expression for the bounds of this interval.

Write a function analogous to the one for the classical estimator to compute the ML estimator and 95.4% confidence interval, and run it on the data.

Let's see how that compares with the classical estimator:

As above, we'll see what the coverage of the confidence interval is by testing on simulated data sets.

Checkpoint: if you are scratching your head at the result of the cell above, that's a good thing. It turns out that the coverage of the ML interval, which was intended to be 95.4%, is more like ~0.86.

This is a disaster from a frequentist perspective, since the whole point of the interval is to have well defined coverage (but, on the other hand, at least the ML interval includes allowed values of $\theta$). Can you find the key feature of this problem that makes ML misbehave in this case, as foreshadowed in the notes?

Bayesian Approach

Assuming a uniform prior PDF for $\theta \geq 0$, Vanderplas gives the posterior PDF for $\theta$ as:

$p(\theta|\mathbf{x}) = N e^{N(\theta-\min \mathbf{x})}$, for $\theta \leq \min \mathbf{x}$; 0 for $\theta > \min \mathbf{x}$.

(This should be proportional to your likelihood function above.)

The maximum density credible interval containing probability $q$ extends from $\theta_1$ to $\theta_2 = \min \mathbf{x}$, where $\theta_1$ is determined by the integral

$\int_{\theta_1}^{\theta_2} P(\theta|\mathbf{x}) d\theta = q$.

Vanderplas even does the integral for us:

$\theta_1 = \theta_2 + \frac{1}{N} \ln \left[ 1 - q ( 1 - e^{-N \theta_2} ) \right] $.

Write a function that returns the posterior mode and 95.4% credible interval.

Here's how this result compares with the previous sections:

Checkpoint: the credible interval should be quite similar to the ML confidence interval, though slightly wider at the low end.

The Bayesian credible region has the following meaning - again articulated by Vanderplas:

Given our observed data, there is a 95% probability that the true value of $\theta$ lies within the credible region.

Weirdly enough, the veracity this statement can be tested using exactly the same procedure as we used to test the coverage of confidence intervals. That is, the credible interval should contain the true value the prescribed fraction of the time, given a large number of independent data sets. Let's try it out.

Checkpoint: lo and behold, it does! (And, to boot, it contains only allowed values of $\theta$.)

Of course, the above test isn't really necessary except to the extent that it might catch a programming error. By definition and by math, the credible interval must do what it claims to; we made no creative decisions and relied on no approximations to define the procedure for finding the credible interval (apart from defining the generative model, which all the approaches depend on).

Parting thoughts

What have we done here? You have, for an admittedly contrived yet entirely plausible scenario, showed that

The point here is less to bash frequentism than to point out some of its complexities and the apparent contraditions that can it can sometimes produce, as well as to get a little practice using those methods. We note, however, that in general the process of finding a maximum-likelihood confidence interval would involve numerical optimization rather than pencil and paper work.