Tutorial: Bayes' Law¶
In this problem, we'll work through a simple application of Bayes' Law, both analytically and numerically (on a grid).
You will learn to:
- use conjugacy relations to find an analytic form for a posterior
- compute a posterior on a grid (brute force)
- apply Bayes law to incorporate new data
from os import getcwd
from yaml import safe_load
import numpy as np
import scipy.stats as st
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
thisTutorial = 'bayes_law'
if getcwd() == '/content':
# assume we are in Colab, and the user's data directory is linked to their drive/Physics267_data
from google.colab import drive
drive.mount('/content/drive')
datapath = '/content/drive/MyDrive/Physics267_data/' + thisTutorial + '/'
else:
# assume we are running locally somewhere and have the data under ./data/
datapath = 'data/'
Background¶
We'll re-examine the first example in the Generative Models tutorial. We're interested in knowing what fraction, $f$, of galaxy clusters in the Universe are morphologically relaxed (hence relatively close to equilibrium). We can think of $f$ equivalently as the a priori probability that a randomly chosen, or newly discovered, cluster is relaxed.
From X-ray imaging of 361 clusters, which we will assume are representative, 57 were found to be morphologically relaxed according to our metric. We will use this information to constrain $f$. Our data could be laboriously written as
$X_1,X_2,\ldots,X_{361}$,
where 57 $X$'s are 1 (relaxed) and 304 are 0 (unrelaxed). Equivalently we can just speak of $s=57$ "successes" out of $n=361$ trials.
But that was real life. To make things more fun, you will be using a randomly generated data set with different values on $n$ and $s$, as read in below.
data = safe_load(open(datapath+'data.yaml', 'r').read())
data
{'n': 341, 's': 158}
We write Bayes' Law for this application as
$p(f|s) = \frac{p(f)P(s|f)}{P(s)}$.
First, as always, we choose a prior. Clearly, by definition, $0\leq f \leq 1$. Without thinking too hard about it, let's take $p(f)$ to be uniform over this interval.
As the problem is defined, the sampling distribution, $P(s|f)$, is binomial:
$P(s|f) = {n \choose s} f^s (1-f)^{n-s}$.
Solution and implementation¶
It's always nice to have a simple, analytic way of expressing our solution, and this problem happens to provide one. Namely, the binomial distribution is conjugate to the beta distribution. Not only that, but the uniform prior chosen above happens to be reproduced for a particular choice of the beta distribution's parameters, $\alpha$ and $\beta$. You can consult the Wikipedia to find the conjugacy relation itself, and follow links from there to see the form of the beta distribution.
To be submitted on Canvas
Before going on, prove the conjugacy relation. Seriously, it's good practice. Multiply $p(f)$ and $P(s|f)$, and simplify until you obtain a normalized posterior distribution expressed in a standard form. Note that you are allowed to multiply by any constants you want - by definition whatever constant you need to multiply by to produce a normalized PDF must be $1/P(s)$. Put another way, you can ignore terms that don't depend on $f$, and just determine which distribution the functional form of the posterior corresponds to, and what it's parameters are.
Now that you're extremely familiar with the beta distribution, fill in the values of the prior hyperparameters that reproduce the uniform distribution on [0,1].
# prior_params = {'alpha':..., 'beta':...}
Prior¶
Fill in the cells below to provide the function that evaluates the prior (you can call the scipy.stats
implementation of the distribution, or spell out the density) and define prior hyperparameter values that make the prior uniform.
def prior(f, alpha, beta):
# return the prior density evaluated at x, p(f)
Excellent! Let's now plot the prior with your chosen parameters, to verify that it's the same as a uniform distribution.
fgrid = np.linspace(0.0, 1.0, 500)
plt.rcParams['figure.figsize'] = (7.0, 5.0)
plt.plot(fgrid, prior(fgrid, **prior_params), 'k-');
plt.xlabel('f');
plt.ylabel('p(f)');
Likelihood¶
Next, code up the likelihood function. Recall that this is just the sampling distribution evaluated as a function of the model parameters, with the data fixed.
# here f would be our model parameter, and s and n would be from our data dictionary
def likelihood(f, s, n):
Again, let's take a look.
plt.rcParams['figure.figsize'] = (7.0, 5.0)
plt.plot(fgrid, prior(fgrid, **prior_params), 'k-', label='p(f)');
plt.plot(fgrid, likelihood(fgrid, **data), 'b-', label='P(s|f)');
plt.xlabel('f');
plt.legend();
Hold on! You didn't (necessarily) make a mistake. Remember that the likelihood function is not a PDF - it is not normalized over $f$ for a given $s$. So there is no reason the normalizations of the two curves above should be comparable.
In fact, run the cell below to numerically verify that $P(s|f)$ is not normalized over $f$.
print('This does NOT need to be 1.0:', np.trapezoid(likelihood(fgrid, **data), x=fgrid))
This does NOT need to be 1.0: 0.002923976608187133
On the other hand, in its guise as the sampling distribution, it is normalized over $s$ for a given $f$. Verify this for the arbitrary choice of $f$ below, or some other value(s) of your choice.
sgrid = np.arange(0.0, data['n']+1) # all possible values of `s`: 0, 1, ..., n
test_f = np.pi/10.0 # an arbitrary choice of `f`
test_integral = likelihood(test_f, sgrid, data['n']).sum() # sum of P(s|f) over all s
print('This had better be exactly 1.0 (to within numerical error):', test_integral)
This had better be exactly 1.0 (to within numerical error): 1.0000000000000004
# test the above claim
assert np.isclose(test_integral, 1.0)
print('Yay!')
Yay!
Just for fun, here is what $P(s|f)$ looks like as a function of $s$ (i.e. as a sampling distribution) for a few different $f$'s.
plt.rcParams['figure.figsize'] = (7.0, 5.0)
plt.plot(sgrid, likelihood(0.01, sgrid, data['n']), 'b.', label='f=0.01');
plt.plot(sgrid, likelihood(0.1, sgrid, data['n']), 'r.', label='f=0.1');
plt.plot(sgrid, likelihood(0.7, sgrid, data['n']), 'k.', label='f=0.7');
plt.xlabel('s');
plt.ylabel('P(s|f)');
plt.legend();
Posterior¶
Moving on, code up the posterior distribution. Since its functional form is the same as that of the prior, we can go ahead and use the same Python function.
posterior = prior # you can have this one
Then all we need is a function that determines the posterior distribution's parameters given the prior and data.
def get_post_params(alpha, beta, n, s):
# Note: the first arguments are the prior's parameters.
# Return the posterior parameters as a dictionary.
# Since the posterior distribution (function) is the same as the prior distribution
# (function), this dictionary's keys should be the same as "prior_params" has.
See what this looks like:
post_params = get_post_params(**prior_params, **data)
plt.rcParams['figure.figsize'] = (7.0, 5.0)
plt.plot(fgrid, prior(fgrid, **prior_params), 'k-', label='p(f)');
plt.plot(fgrid, likelihood(fgrid, **data), 'b-', label='p(s|f)');
plt.plot(fgrid, posterior(fgrid, **post_params), 'r-', label='p(f|s)');
plt.xlabel('f');
plt.legend();
# hidden test of your posterior parameter values
Comparison with brute force¶
Just to drive the point home, let's imagine we didn't recognize this problem as conjugate. Evaluate the posterior over fgrid
by brute force (i.e. multiplying the likelihood and prior), and we'll compare. Don't forget to normalize.
# post_fgrid = ...posterior evaluated at f = "fgrid"
Plot it along with the conjugate solution:
plt.rcParams['figure.figsize'] = (7.0, 5.0)
plt.plot(fgrid, posterior(fgrid, **post_params), 'r-', label='analytic');
plt.plot(fgrid, post_fgrid, 'b.', label='grid');
plt.xlabel('f');
plt.ylabel('p(f|s)');
plt.legend();
# test whether these agree
assert np.allclose(posterior(fgrid, **post_params), post_fgrid)
print('Yay!')
Yay!
You might be thinking that all this messing around with conjugacy relations is silly, and in this simple case, which takes very little time to evaluate numerically, that's fair enough. While conjugacy doesn't apply to every problem, there are a few reasons it's worth knowing about for those times that it is a viable strategy:
- Since the posterior has a well known functional form, we instantly know its mean, median, mode, variance, skewness, kurtosis, etc., etc., etc. to arbitrary precision - things that we might be interested in that would be more annoying to estimate numerically. They are simple functions of the distribution's parameters that we can look up.
- When dealing with multi-parameter distributions and/or large amounts of data, leaping straight to the final answer (after, at most, some linear algebra) can sometimes represent a significant speed-up over more brute-force methods.
Summarizing the constraint on $f$¶
While the posterior for $f$, evaluated one of the ways above, in principle is the entire answer, we normally want to summarize the constraint in terms of a best value and an interval. You'll see more about this in the Credible Region notes and tutorial. For now, we'll go ahead and summarize the constraint according to one convention for doing so.
Write a function that finds the median, 15.85th percentile and 84.15th percentile of the posterior distribution for $f$. Do this based on the parameters of the conjugate posterior, which will let you take advantage of scipy
functions (e.g. scipy.stats.<distribution>.median
).
def summarize_posterior(post_params):
# Find the 50th, 15.85th and 84.15th percentiles of the posterior distribution.
# Return these as a numpy array with shape (3,), in that order.
print('My best fit and interval:', summarize_posterior(post_params))
print('**Checkpoint:** ', np.array(safe_load(open(datapath+'interval_checkpoint.yaml', 'r').read())))
My best fit and interval: [0.46348594 0.43661446 0.49049967] **Checkpoint:** [0.46348594 0.43661446 0.49049967]
Updating with new data¶
Suppose we get more data:
data2 = safe_load(open(datapath+'data2.yaml', 'r').read())
data2
{'n': 249, 's': 117}
Use your get_post_params
function to compute the posterior parameters for
- the second data set combined with the original prior (not using the first data set), and
- the combination of both data sets and the original prior, by using the posterior from the first data set as the prior distribution for an analysis of the second data set.
# post_params_2 = posterior from just the second data set
# post_params_both = posterior from both data sets
print('posterior parameters from just the second data set', post_params_2)
print('posterior parameters from both data sets', post_params_both)
posterior parameters from just the second data set {'alpha': 118.0, 'beta': 133.0} posterior parameters from both data sets {'alpha': 276.0, 'beta': 316.0}
Note that you should get the same result for (2) as you would by analyzing the original prior with the combined data set with $s_1+s_2$ successes in $n_1+n_2$ trials. This should be clear from the way the posterior parameters depend on $n$ and $s$ in this particular conjugacy relation, and is hopefully intuitive for a binomial experiment.
Here's how the posteriors from the individual data sets compare with the prior:
plt.rcParams['figure.figsize'] = (7.0, 5.0)
plt.plot(fgrid, prior(fgrid, **prior_params), 'k-', label='p(f)');
plt.plot(fgrid, posterior(fgrid, **post_params), 'r-', label='p(f|s_1)');
plt.plot(fgrid, posterior(fgrid, **post_params_2), 'b-', label='p(f|s_2)');
plt.xlabel('f');
plt.legend();
This visualizes the accumulation of data as we add first one and then the second data set to the prior information:
plt.rcParams['figure.figsize'] = (7.0, 5.0)
plt.plot(fgrid, prior(fgrid, **prior_params), 'k-', label=r'$p(f)$');
plt.plot(fgrid, posterior(fgrid, **post_params), 'r-', label=r'$p(f|s_1)$');
plt.plot(fgrid, posterior(fgrid, **post_params_both), 'b-', label=r'$p(f|s_1,s_2)$');
plt.xlabel('f');
plt.legend();
Finally, compare the constraints:
print('Just data1:', summarize_posterior(post_params))
print('Just data2:', summarize_posterior(post_params_2))
print('Both data1 and data2', summarize_posterior(post_params_both))
Just data1: [0.46348594 0.43661446 0.49049967] Just data2: [0.47004003 0.43859668 0.50164294] Both data1 and data2 [0.46617815 0.44570036 0.48673225]
Check the goodness of fit¶
Because we have absorbed the pointed lessons in the notes so well, we will finish by comparing posterior predictions of the fitted model with the data. That is, we want to make sure that the model, for the posterior distribution we've ended up with, could plausibly have produced the observed data. We'll use the posterior as updated by data2
above, i.e. parametrized by post_params_both
. Because the binomial distribution is additive (for the same $f$ parameter), we can compare this with the combined data:
data_both = {'n':data['n']+data2['n'], 's':data['s']+data2['s']}
What we want to do in this case is predict the distribution of $s'$ marginalized over the posterior distribution of $f$, where $s'$ represents an equivalent measurement to $s$ (i.e. from the same total number $n$, but obviously not actually the same clusters). In other words, we should compute
$P(s'|s) = \int df \, p(f|s) \, P(s'|f)$
and see how plausible $s$ is compared with this PDF. (What the expression above is and why we care is spelled out more in the Model Evaluation and Comparison I notes, though you don't need to read ahead to complete the rest of this notebook.)
In conjugate problems, this posterior predictive distribution is often also a standard PDF that can be arrived at analytically (it's on the Wikipedia page above it you want to add a comparison below), but let's instead go through a procedure that will work more generally. Namely, just like in the Bayes' Law notes, we will use the generative model as, well, a generative model, and produce a list of potential data $s'$ from an equally long list of values of $f$ drawn from the posterior distribution. Fill in the code below (you can use scipy.stats.<distribution>.rvs
functions to generate random numbers from a PDF).
Npredict = 1000000
# f_from_posterior = ...
# s_post_predicted = ...
Below we compare this distribution to the measured $s$:
plt.hist(s_post_predicted, density=True, label='your samples', bins=20);
plt.axvline(x=data_both['s'], label='measured s', color='C1')
plt.xlabel(r'$s\prime$', fontsize='x-large');
plt.ylabel(r'$P(s\prime|s)$', fontsize='x-large');
plt.legend(fontsize='x-large');
It should look very consistent. Of course, this was an extremely simple model - is it even possible for us to get a "bad fit" where the distribution of $s'$ does not comfortably include $s$? The only circumstance I can think of where that would happen is if data
and data2
had not come from the same model, that is if they had been generated by significantly different values of $f$ somehow. Then we might see the distribution above being less consistent or inconsistent with the value of $s$ in data_both
, and conclude that our model with a single $f$ doesn't adequately describe all the data. We would hopefully also be able to identify the issue when comparing the two individual posteriors above. We'll cover more quantitative ways to define "consistency" later on, but for now we have a simple, visual check of how acceptable the fit is.
As it happens, the distribution of $s'$ also happens to be a standard one in this case, so we can also do a quick quantitative check to ensure that everything holds together above. The cell below compares the mean of your predictions to the mean of the known distribution. (Due to the finite number of samples, we're allowing a healthy amount of error.)
assert np.isclose(s_post_predicted.mean(), data_both['n']*post_params_both['alpha']/(post_params_both['alpha']+post_params_both['beta']), atol=0.1)
print('Yay!')
Yay!