Containing:
You've seen how Bayes' Law follows simply from the insight that probability distributions are the natural mathematical tool for describing quantities that are uncertain. Yet it's extremely likely that you've done some kind of data analysis before, and even taken multiple statistics courses, without ever having heard of it. Why is that?
The answer to that inevitably takes us into the uncomfortable realm of history and culture. Yes, really.
Let's be clear that I am not a historian of science or mathematics and have only a passing interest in these things. What you're getting here is my understanding, based on having read some stuff (by actual historians of science and mathematics) in pursuit of the question above. Do not consider this account to be complete or free of editorializing.
First, you might be under the impression that the Bayesian approach is somehow new. This is false. Its extensive use in our field is relatively new, but Bayes and (mostly) Laplace were doing their thing in the 1700's contemporaneously with the establishment of probability as a branch of mathematics. Laplace and Gauss were using these methods to constrain models of planetary motion. Bayes is not a new invention, nor is its application to astrophysics.
However, things changed as "statistics" was being established as an academic field in the 19th century, largely with the purpose of providing a quantitative basis for biology and the also-new field of sociology. At least partly, this was due to discomfort with priors and the fact that they can't be made perfectly uninformative. A different approach was proposed, in which probabilities were strictly defined as the long-run frequencies of independent observations, and where (by fiat) they may not be used to describe uncertainty. This was known as frequentism, which we describe below.
For whatever reason, the Big Names who built up frequentist statistics were vociferously anti-Bayes. Perhaps they were genuinely if misguidedly disturbed by the existence of priors or the less strict definition of probability used by Laplace (which has since been placed on a firm theoretical basis). Or maybe there was some other reason.$^1$
Somehow, despite the fact that Bayes' method was pioneered by revered physicists, the frequentist approach is the one that became embedded in the culture of experimental physics as it developed in the 20th century (along with every other field). This is some kind of coup, given the physicists' usual preference for clear and elegant solutions that follow simply from first principles (see Bayes' Law); rather than thinking about things that way, the average practitioner was convinced to accept that the statisticians knew what they were doing, and that choosing from the menu of ad hoc, statistician-approved method methods was the thing to do. Presumably, the false claim to objectivity that frequentism makes was a convincing factor.$^2$ Either way, frequentism became firmly entrenched in the hard and soft sciences as well the statistics establishment.
Eventually, and far from universally, things have started to shift back. In astrophysics, this is probably due to the fact that complex problems of the kind that we now have data to address simply can't feasibly be tackled within frequentism (apart from Maximum Likelihood, which is essentially Bayes under another name). It doesn't hurt that we are, for the most part, only answerable to ourselves when it comes to what analysis approaches are acceptable. This is in contrast to medical studies, for e.g., where until quite recently specific types of frequentist analysis were the only evidence the FDA would accept for evaluating drugs, procedures, etc.
Nevertheless, it is still the case that most scientists do not use Bayesian analysis, have never been taught it, and do not understand what it is (if they're even aware of it). There is no particular movement (as far as I can see) in the statistics establishment to change their standard curriculum, partly because what people in certain fields are expected to learn is now entrenched, and, one assumes, because their field has been self-selecting those who enjoy proving things about estimators for the past century. And that, dear reader, is why you may be first hearing about this stuff in a an upper-level graduate physics course, despite the fact that the content is not physics, and not even all that advanced in the grand scheme of things.
With that bumbling and not complementary introduction, we're now going to see what frequentism is all about. Why? Because it's good to understand what people who aren't on the Bayes train are talking about when they write their own papers. It's also useful to know when frequentist results can be interpreted as Bayesian in certain limits.
To begin with, let's remember what "Bayesianism", which frequentism is very much defined in opposition to, says. At its core, Bayes' Law provides a principled mathematical framework for drawing conclusions from data. To accomplish this, it is necessary to adopt a mathematical language for "conclusions" that are less certain than pure logical deductions; probability provides that language. We haven't (and won't) dive into what might be called the philosophical underpinnings of probability theory, but they boil down to the propositions that probabilities must be well behaved mathematically and comport with common sense in the limits where they should do so (see this if interested in equations; also note that our definition includes the frequentist definition below as a consequence, while being broader). Once we have adopted mathematical probability as the language of uncertainty, Bayes' Law follows immediately. In practice, once we have enumerated the state of our knowledge a priori and constructed a generative model for the data being considered, the conclusions also follow simply and transparently.
The defining proposition of frequentism is that probabilities used in data analysis may only describe the limiting frequencies of different possible measurements under a counterfactual infinite repetition of equivalent but independent experiments. That is, of the various ingredients we have used in Bayesian analysis, only the sampling distribution is allowed. The central conceit of frequentism is that this "lets the data speak for themselves", independent of assumptions. However, this comes at the cost of the simple and transparent connection between measurement and conclusion encoded in Bayes' Law.
Instead, the frequentist approach is to define estimators, functions of the data, of the things we would like to learn. Because the estimators are functions of the data, it's permitted to make probabilistic statements about them. The art of frequentism is in finding estimators that can be proven to lie near the true value of what's being estimated, in the sense that the distribution of estimators computed from many independent but otherwise equivalent data sets is peaked near the true value. This naturally leads to a proliferation of estimators with different descriptors (unbiased, minimum-variance, efficient, ...) for different quantities estimated from different types of data. One example:
And this is just for one prototypical analysis (fitting a line to $x,y$ data). The nested epicycles of Ptolemaic astronomy come to mind.
For all of these estimators, we need to not only prove something about the distribution of estimators peaking near the truth, but also the width of that distribution. On that basis, we would provide confidence (as opposed to credible) intervals, saying that, over many hypothetical data sets, the confidence interval contains the true value X% of the time (we'll recap the principal differences between the formalisms below in "Frequentism redux", so don't worry if the credible/confidence distinction isn't clear yet). And, as you might guess, many of the things that get proved are in the form of inequalities, or hold only in the limit of large data sets, or come with other caveats and assumptions.
To be sure, each of the bullets above corresponds to a different generative model in Bayesian analysis, but that is the only thing that distinguishes them; each generalization does not require the method to be reinvented. We will see all of these complicating factors again later; they're in no way contrived.$^3$
More fundamentally, the frequentist approach sets up the user to be misled. Unlike the Bayesian formalism, frequentism doesn't say what conclusion follows from the data - that would be a forbidden probabilistic statement about something other than data. In the case of constraining a continuous parameter, what we get is a confidence interval, from which we can, strictly speaking, conclude nothing about the parameter's true value. All we can say is that if we independently ran the experiment infinitely many times, and the assumed model were correct, then X% of the resulting confidence intervals would contain the truth. For discrete parameters or model comparisons, we would do a hypothesis test, computing the fraction of the time an estimator encoding the difference between the models would be at least as extreme as it was in practice (called a p-value). Again, one is not technically allowed to draw a conclusion from this, only to say what the outcome of the hypothesis test was.
The upshot is that the only was to say we've actually learned anything from a frequentist analysis is to break the frequentism's own rules and interpret the confidence interval as saying it contains the truth with some probability, or the p-value as the probability that one model vs another is right. Is that so terrible? It can be, yes. Let's see how.
Recall the very first example of Bayesian inference we looked at:
Imagine you're being screened for an uncommon disease or condition using a test that is known to produce false positives only 1% of the time and false negatives only 0.01% of the time, and your test is positive. A false positive rate of 1% means that 99% of test's positive results are correct and only 1% of them are incorrect, so you're probably sick, right?
We sketched out a model with these parameters
and wrote out the sampling distribution/likelihood function for the real data, $T=1$,
$P(T=1|\theta) = \theta(1-r_\mathrm{fn}) + (1-\theta)r_\mathrm{fp}$.
Let's spell this out for the two possible values of $\theta$, where we can think of $\theta=0$ as the null hypothesis (not sick) and $\theta=1$ as the alternative hypothesis (sick).
In this simple problem, it would make sense for $T$ to be our estimator for $\theta$ (it isn't clear what other function of data we could choose). In that case, $P(T=1|\theta=0) = 0.01$ is the p-value for comparing the two hypotheses, the probability of getting our data (no "more extreme" values of $T$ are possible here) under the null hypothesis. Since the p-value is small (a typical threshold for considering a hypothesis test to be significant is 0.05), we would reject the null hypothesis of not being sick. We could claim that we're only dispassionately accepting or rejecting a hypothesis based on an estimator, not concluding anything about the probability that $\theta=1$. But real life defies this kind of pedantry: we would probably need to make some kind of decision based on whether we were sick or not, so for all intents and purposes we are drawing a conclusion about the likely value of $\theta$. Unlike the Bayesian case, we are left to draw our conclusion on an ad hoc basis, unsupported by the formalism we're using.
What's missing from all this is the rest of the generative model that we naturally took into consideration in the Bayesian analysis. Recall that we found the posterior probability of $\theta$ to be highly dependent on the prior, which might reflect how widespread the illness or condition was overall, and/or other information about the patient. In particular, in a "standard screening" type situation where the condition is rare to begin with, it was entirely possible for $\theta=1$ to be very unlikely even in the case of a positive test. On the other hand, if we were showing symptoms that were very specific to the condition in question, the appropriate prior might lead to a higher probability of $\theta=1$, leading us to prefer it over $\theta=0$.
At this point someone familiar with frequentist analysis might be complaining that this is a problem where we should have accounted for the "base rate" of illness in the population. (Yay epicycles!) We used the wrong test for this scenario, so of course one can contrive a situation where the estimator would perform badly. To which the only possible answer is: Yes, exactly! That base rate is what we would call the prior, and we didn't need to recognize this as a special class of analysis problem to know that it should be accounted for. This is a case where the obvious frequentist estimator fails, and so another one needs to be invented; to actually work, the approach needs to modified to be equivalent to using Bayes' Law - not incidentally - even if this is obscured by the "base rate" terminology.
This is a common theme: in cases where we could have gotten away with using weak priors, frequentism works well and is approximately equivalent to Bayes with weak priors. In cases where one would need informative priors to get something sensible, frequentism needs to be specially adjusted until it's... approximately equivalent to Bayes with informative priors.
So why all the fuss if things will wind up being equivalent? Well, recall that we began the course not with Bayes' Law but with generative models. This is not typical in frequentism, where you get what you're given: a menu of estimators and techniques for particular parameters and particular models. In other words, with Bayes' Law we can, in principle, work things out for any model we care to specify. Within frequentism, we can work things out for any model where someone has put in the effort to define good estimators and prove things about them, so long as we don't stray from the situations where they're valid. In real life, at least in my experience, we are rarely in territory where specific, well established estimators apply. Which brings us to the uber-example of frequentism approaching Bayes in the search for correctness...
Within frequentism, we can choose our estimator(s) to be the value(s) of the parameter(s) that (jointly) maximize the likelihood function given the data. This has a number of advantages. It's intuitive. It can be applied in principle no matter what the likelihood function is, so we don't need to go searching for specifically derived estimators for a given problem, though we do need to be willing to deal with numerical optimization of an arbitrary function. It provides a prescription for jointly estimating multiple parameters of a model, which is otherwise a generally more challenging than inventing estimators for individual parameters. And, via math, the maximum likelihood estimate (MLE) is proven to have various nice properties (though these do not include being unbiased in general), which imply that, for sufficiently large data sets, the distribution of the MLE approaches a multivariate Gaussian centered near the true parameter values.
This provides a neat and general way to estimate confidence intervals and regions in maximum likelihood. The math is not entirely terrible, so let's sketch it out. I'll specialize to the case where the likelihood is uncorrelated (i.e. we have a sum below instead of vectors multiplying a covariance matrix), which I think is easier to follow. We have a likelihood function
$\mathcal{L}(\theta) \propto \exp\left[ \sum_i \frac{-(\theta_i-\mu_i)^2}{2\sigma_i^2} \right]$,
where $\mu_i$ is the MLE of the $i$th parameter, and the $\sigma_i$ set the width of the Gaussian. (We don't know what the $\sigma_i$ are a priori, but that isn't important for this sketch. We could always estimate them numerically by evaluating the likelihood in the neighborhood of its maximum, which we have probably done as part of finding the maximum to begin with.) Equivalently, the function
$S(\theta) = -2 \ln \mathcal{L}(\theta) = \sum_i \frac{(\theta_i-\mu_i)^2}{\sigma_i^2} +$ const
contains a sum that looks a lot like a chi-square. We could make this exact, to the extent that the likelihood was exactly gaussian, by referencing $S$ to its value at the MLE,
$S(\theta) - S(\mu) = \sum_i \frac{(\theta_i-\mu_i)^2}{\sigma_i^2}$.
Being in frequentist mode, the picture here is that $\mu$ (our estimator) is displaced from the true parameter values by a Gaussian PDF. It isn't instantly obvious, but via fancy math it turns out that this PDF has the same width as the likelihood function. Since the MLE is our best estimate for the true parameters, our joint X% confidence region over $\nu$ parameters will be defined as the region where the $\nu$-dimensional projection of the likelihood integrates to X%.
We could calculate this by using the curvature of the likelihood near the MLE to estimate the $\sigma_i$. It turns out that we can get more accurate confidence regions (in the sense that they hold up better when the likelihood is not perfectly Gaussian) by following a slightly different procedure. Specifically, the X% confidence region is defined as the set of $\theta$ satisfying
$S(\theta) - S(\mu) < F^{-1}_{\chi^2}(X\%|\nu)$,
where $S(\mu)$ by definition is the minimum achievable value of $S$ and $F^{-1}_{\chi^2}(\cdot|\nu)$ is the quantile function of the $\chi^2$ distribution with $\nu$ degrees of freedom (Essential Probability notes!).$^4$
If we want to define a 1-parameter X% confidence interval over $\theta_0$, we need to find the range of $\theta_0$ satisfying
$S_0(\theta_0) - S(\mu) = \min_{\theta_1,\theta_2,\ldots} S(\theta_0,\theta_1,\theta_2,\ldots) - S(\mu) < F^{-1}_{\chi^2}(X\%|1)$,
where $\nu$ has become 1 (for a single parameter), and $S_0(\theta_0)$ is the minimum value of $S(\theta)$ achieved for the parameter value $\theta_0$ (i.e. $S_0$ is the function of $\theta_0$ formed by minimizing $S$ over all other parameters for the given $\theta_0$). You can think of each parameter being minimized in $S_0$ as taking away 1 degree of freedom from $\nu$ until just 1 is left in the case of 1-parameter confidence interval.
Analogously, for a 2D confidence region for $\theta_0$ and $\theta_1$, we find the region in the $\theta_0$-$\theta_1$ plane satisfying
$S_{0,1}(\theta_0,\theta_1) - S(\mu) = \min_{\theta_2,\ldots} S(\theta_0,\theta_1,\theta_2,\ldots) - S(\mu) < F^{-1}_{\chi^2}(X\%|2)$,
where now $S_{0,1}$ is minimized over the remaining parameters for fixed $\theta_0$ and $\theta_1$, as a function of those 2 parameters.
Notice that the quantile function on the right hand side depends on the dimensionality of the confidence region we're looking for, even if the confidence level (X%) is the same. The quantiles corresponding to the canonical confidence levels may be familiar if you've dabbled in maximum likelihood before:
import numpy as np
import scipy.stats as st
n = np.arange(1.0, 4.0)
P = st.chi2.cdf(n**2, 1)
q1 = st.chi2.ppf(P, 1)
q2 = st.chi2.ppf(P, 2)
print('For "confidence levels"', P)
print('corresponding to "number of sigma" confidence', n)
print('the S-Smin thresholds for 1D confidence intervals are:', q1)
print('and the S-Smin thresholds for 2D confidence regions are:', q2)
For "confidence levels" [0.68268949 0.95449974 0.9973002 ] corresponding to "number of sigma" confidence [1. 2. 3.] the S-Smin thresholds for 1D confidence intervals are: [1. 4. 9.] and the S-Smin thresholds for 2D confidence regions are: [ 2.29574893 6.18007431 11.82915808]
This method for finding confidence regions isn't valid when the MLE is squashed up against the upper or lower limit of the values it's allowed to take$^5$ (nor is the other method mentioned above), but otherwise works pretty well. Along with the MLE itself, it gives us a way to avoid inventing new estimators for every model we might want to fit, within frequentism. The downside is computational complexity. For models with more than a couple parameters, if we want to find confidence intervals and regions, we are going to end up doing a lot of minimizations - and when the likelihood is not, in fact, all that Gaussian, getting these to converge can be numerically troublesome.
To recap, the essential frequentist methodologies are
Classic frequentism (not recommended)
Maximum likelihood
It probably wouldn't be accurate to say that there's a healthy debate between frequentist and Bayesian adherents in physics or astrophysics. In my limited experience, the objections to Bayes from within our field are overwhelmingly based on a misunderstanding of how it works, or a fundamental disagreement on the job of an experimental scientist. (I always thought it was to draw conclusions from data, which is why this course is what it is.)
However, if you'd like a recap of some of the arguments that normally get made on either side, I recommend looking over these pro- and anti-Bayes arguments by Gelman and a response by Skilling. (Warning: Gelman is a social scientist, so his writing contains some jargon and some particular concerns that are not so relevant to us. On the other hand, it's a good reminder that not everyone lives in a world where models are physically motivated.)
Why did we go into all that detail about a method we don't recommend? Two reasons.
First, maximum likelihood is so widely used that you must understand it. In your field, if someone isn't using Bayes, they are (hopefully) using maximum likelihood.
Second, step back and gaze upon the previous section. Our MLE is the maximum of the likelihood function in parameter space. If we put our Bayes hats on and are willing to live with improper uniform priors, the mode of the posterior distribution will be the MLE. Furthermore, again with uniform priors, the method above for finding confidence regions would also be an approximate (to the extent that it requires approximate normality of the likelihood) way to find credible regions. In practice (caveat: my limited practice), for reasonably large data sets, it produces regions and intervals that agree with credible regions and intervals quite well, even when the likelihood is clearly not Gaussian.
Having said that, this equivalence doesn't often present much benefit to someone with the fortitude to do Bayes, even just in terms of computational efficiency. In the general case, maximum likelihood involves doing a large number of partial minimizations of the likelihood, which means a large number of evaluations of the likelihood. In the same general case, using Bayes' Law will mean generating monte carlo samples from the posterior, which means a large number of evaluations of the likelihood. However, once we have our samples, we get to use them all to produce any credible intervals/regions we want to; on the other hand, likelihood evaluations contributing to finding a confidence interval for one parameter aren't generally useful for finding a confidence interval for a different parameter. Not to mention the potential complexity (and bookkeeping) involved in trying to efficiently map out the boundary of an arbitrarily shaped confidence region. The two approaches become equivalent in complexity if we go to the extreme of evaluating the likelihood over a many-dimensional grid, and using this grid to do our frequentist minimizations or Bayesian integrals, but this is usually the least computationally efficient method available.
Of course, there are specific problems where the MLE and confidence intervals can be found algebraically rather than by numerical minimization - the ordinary least squares method for fitting linear models is a good example of this. Similarly, there are cases where the MLE may not be algebraic, but can still be generated rapidly by some highly specialized and optimized code written to do the frequentist analysis. Every so often, a properly complex generative model turns out to factorize in such a way that we can take advantage of these tools for part of the Bayesian computation, if we choose to (we may see an example in later tutorials).
The role of several Names in establishing and legitimizing certain abhorrent theories of the late 19th and early 20th centuries should probably be mentioned somewhere. Not necessarily vis-à-vis any notions about prior assumptions or the availability of post-facto reasoning in their methodology. Just in general.
People sometimes get the impression that computational tractability explains the early adoption of frequentist methods (i.e. before personal computers were common). I personally don't think this holds up. Lack of computing power means that approximations need to be made regardless of whether one subscribes to frequentism or Bayesianism. We've already seen simple cases where Bayes' Law provides an analytic posterior, and the same can be said for the usual simple examples of frequentist analysis (i.e. those not requiring digital computing). The frequentist solution for complex models (Maximum Likelihood, discussed later) is no less computationally intensive than the Bayesian approach.
I can't resist sharing this tidbit from my grad school (i.e., once upon a time), when I took a few statistics courses in the actual Statistics department. In the first or second lecture of a Ph.D level course on the generalized linear model, having just finished a theoretical overview, the professor made remarks along the lines of the following (paraphrased from memory, with emphasis added):
So we have this very elegant framework that works for a model which is linear in the parameters, where the covariates are fixed, and where the responses are normally distributed about the curve. That's not to say that there aren't other approaches; some people would say that we should put all this aside and do Bayesian analysis all the time. But then we'd have to make lots of assumptions.
I don't believe he was intentionally being ironic. (Or, if he was, he managed to be perfectly deadpan. And then, I presume, went on to teach a whole class on the GLM anyway. I didn't hang around, having developed an allergy to problem sets by this time.)
Not for nothing did a T-shirt worn by a different statistics professor, teaching a course on ANOVA no less, read as follows:
This last statement may not be intuitive, but it's easy(ish?) to see in the case where the likelihood really is Gaussian. Consider the 1-parameter case, where the likelihood is a Gaussian with mean $\mu$ and width $\sigma$, or, equivalently, a standard normal PDF in $Z=(\theta-\mu)/\sigma$. Our 68.3% confidence interval will be from $Z=-1$ to $Z=+1$, since
$P(-1<Z<1) = F_\mathrm{Normal}(+1|0,1)-F_\mathrm{Normal}(-1|0,1) \approx 0.683$,
where $F_\mathrm{Normal}(Z|\mu,\sigma)$ is the Gaussian cumulative distribution function (CDF). Remembering that capital-$P$ probability transforms straightforwardly (unlike probability density), we can write
$P(-1<Z<1) = P(Z^2<1)$;
more generally, we could say
$P(-y<Z<y) = P(Z^2<y^2)$.
Because our MLE was normally distributed to begin with, $Z^2$ will be the square of a standard normal variable, by definition a $\chi^2$ distributed variable. In the more general case, $Z^2$ is replaced by $S(\theta) - S(\mu)$ (cf the definition of $S$), so one way of identifying a region where our likelihood integrates to a given value (lhs of the equation above) is to find the region where $S(\theta) - S(\mu)$ is less than an appropriately chosen threshold, $y^2$ (rhs above).
That is, the right-hand case illustrated in the Credible Regions notes. You can see this failure mode in action in the tutorial associated with these notes.