Tutorial: Approximate MethodsΒΆ
In this notebook, you'll practice using the approximate methods covered in the notes. To make it as straightforward as possible, we'll start with the contrived problem of fitting a line to data under conditions where Ordinary Least Squares (OLS) provides an exact solution to compare to. Then we'll move on the the contrived problem of fitting a line and scatter, where suddenly that simple exact solution no longer works.
You will
- implement the simplest possible version of Approximate Bayes,
- see how it works (or doesn't) for different choices of summary statistics and distance metrics,
- perform the Laplace Approximation for comparison,
- perform a frequentist bootstrap with the same model and data
from os import getcwd
from os.path import exists as file_exists
from yaml import safe_load
import numpy as np
from scipy.optimize import minimize
import scipy.stats as st
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
import incredible as cr
import lrgs
thisTutorial = 'approximate_methods'
if getcwd() == '/content':
# assume we are in Colab, and the user's data directory is linked to their drive/Physics267_data
from google.colab import drive
drive.mount('/content/drive')
datapath = '/content/drive/MyDrive/Physics267_data/' + thisTutorial + '/'
else:
# assume we are running locally somewhere and have the data under ./data/
datapath = 'data/'
Case: ordinary least squaresΒΆ
The classic OLS problem corresponds to the following scenario and assumptions:
- We have data in the form of a list of $(x,y)$ pairs.
- The $x$ values are fixed (we assume no uncertainty in their generation).
- Each $y$ values is independently generated from a linear model: $y_i = a + b x_i + \varepsilon_i$, where $\varepsilon_i$ follows the standard normal distribution (zero mean and unit variance).
- Priors on $a$ and $b$ are both uniform over the real line.
You've probably already done it in a previous exercise, but draw the PGM and write down the generative model for this setup.
Space for your PGM and expressions
As usual, read in yout very own data set to work with, stored below as 1D x
and y
arrays.
table = np.loadtxt(datapath+'data.txt')
x = table[:,0]
y = table[:,1]
They look like this:
plt.rcParams['figure.figsize'] = (4.0, 3.0)
plt.plot(x, y, '.');
plt.xlabel("x"); plt.ylabel("y");
As you might know, the strong assumptions in OLS make the calculation of the posterior algebraically solvable. This is convenient, as we can easily compute the exact posterior and compare our approximate results to it. The calculations are not especially hard to implement manually, but we will nevertheless let the statsmodels
package do them for us.
model = sm.OLS(y, sm.add_constant(x))
ols = model.fit()
The posterior, $p(a,b|x,y)$, is a 2D Gaussian with mean
ols.params
array([-2.75994154, 0.69626281])
and covariance matrix
ols.normalized_cov_params
array([[ 0.11651851, -0.01751249], [-0.01751249, 0.0038586 ]])
np.sqrt(ols.normalized_cov_params.diagonal())
array([0.34134808, 0.06211765])
Below we codify this by storing frozen st.norm
objects representing the 1D marginalized posteriors, and ellipses deliniating the 2D credible regions, for ease of comparison later.
param_names = ['a', 'b']
true_post = {'1D':{'a':st.norm(ols.params[0], np.sqrt(ols.normalized_cov_params[0,0])),
'b':st.norm(ols.params[1], np.sqrt(ols.normalized_cov_params[1,1]))},
'2D':{'68.3':cr.cov_ellipse(ols.normalized_cov_params, center=ols.params, level=0.68268949),
'95.4':cr.cov_ellipse(ols.normalized_cov_params, center=ols.params, level=0.95449974)}
}
Laplace ApproximationΒΆ
In this excrutiatingly simple example, the true posterior is Gaussian, and so the Laplace Approximation should give us the exact posterior! Let's see.
Recall that to implement the LA, you will need to
- find the maximum of the posterior, and
- estimate the matrix of second derivatives of the log-posterior at that point.
For OLS, statsmodels
effectively did this for us algebraically. Since that isn't an option in general, below you will use scipy.optimize.minimize
to accomplish the same thing. First, we will need a function returning $-\ln p(a,b|x,y)$; minimizing this will accomplish (1) above.
def minus_lnp(params, x, y):
# return -ln(posterior) given params, a 1D np.array [a,b]
We also need a guess at the parameter values, for the minimizer to use as a starting point. You could choose something by inspection of the plot above. Or, being confident that the fitter will manage this problem well enough, we could just start at $(0,0)$, which is not a million miles away.
guess = [0.0, 0.0]
Now we turn things over to scipy
. If you're not familiar, there are a handful of algorithms that minimize
can use. We'd like one that internally estimates and uses the second derivatives of the function being minimized, since it means we can just get those derivatives about the minimum from the fitter instead of estimating them ourselves. The default algorithm does this, though note that this is not true of all the options, and the specific method for accessing the derivatives varies (see the documentation for minimize
for details).
LA = minimize(minus_lnp, guess, args=(x,y))
LA
message: Optimization terminated successfully. success: True status: 0 fun: 43.40431004851482 x: [-2.760e+00 6.963e-01] nit: 3 jac: [ 0.000e+00 4.768e-07] hess_inv: [[ 1.165e-01 -1.751e-02] [-1.751e-02 3.859e-03]] nfev: 15 njev: 5
The posterior mode is stored in LA.x
, and should be extremely close to the algebraic solution.
assert np.allclose(LA.x, ols.params)
The inverse-Hessian, which is the posterior covariance matrix of the parameters, as helpfully provided in LA.hess_inv
, and shown above. It, too, should be extremely close to the OLS calculation.
assert np.allclose(LA.hess_inv, ols.normalized_cov_params)
Let's go ahead and visualize the usual 2D credible regions.
plt.rcParams['figure.figsize'] = (4.0, 3.0)
plt.plot(true_post['2D']['68.3'][0], true_post['2D']['68.3'][1], '-', color='b', label='OLS');
plt.plot(true_post['2D']['95.4'][0], true_post['2D']['95.4'][1], '-', color='b');
cr.cov_ellipse(LA.hess_inv, center=LA.x, level=0.68268949, plot=plt, fmt='--', color='r', label="LA");
cr.cov_ellipse(LA.hess_inv, center=LA.x, level=0.95449974, plot=plt, fmt='--', color='r');
plt.xlabel("a"); plt.ylabel("b");
plt.legend();
Approximate Bayesian ComputationΒΆ
Next, we turn to ABC. Below, you'll implement the simplest version of this algorithm:
- Choose a set of summary statistics that encode important features of the data.
- Choose a distance function providing a metric for comparing summary statistics computed from different data, and a distance threshold.
- For many iterations,
- sample a set of parameters from the prior;
- simulate a mock data set from those parameters using the generative model;
- compute the summary statistics from those mock data;
- if the distance between that summary and the summary statistics computed on the real data is smaller than the threshold, accept the parameter sample.
Needless to say, this is not the most efficient version of ABC, but it will serve. However, there is an immediate, practical issue, namely that we chose infinite, uniform priors above in order to make the problem equivalent to OLS. Clearly we can't sample from the full interval $(-\infty,\infty)$. In any case, the credible regions above are tiny compared with the range of finite numbers that we can, in principle, draw from. So a naive implementation can be expected to take an exceptionally long time to produce even a single acceptable sample. (This is a good argument for choosing sensible priors, if you needed another one.)
We will inelegantly work around this issue by remembering that, for parameter estimation, we only really need to consider the prior over the region of parameter space where the posterior will be non-tiny. Seeing how we have an approximate posterior from LA, let's use it to sneakily redefine the functional bounds of the priors for this exercise. Truncating $p(a)$ and $p(b)$ at the 0.0001 and 0.9999 quantiles of their respective marginalized posteriors seems to work well in this case.
As we've done before, store the priors for $a$ and $b$ as frozen scipy.stats
distributions in a dictionary.
# priors = {'a':..., 'b':...}
To make it simpler to try out different summary statistics or distance metrics, we'll use classes and inheritance to package up reusable code. The cell below has just one method for you to complete, namely the one that performs the loop in the pseudocode above. Do not include an accept/reject step here (that is, accept all samples); instead, record the distance for each sample. That way, we can easily look at how the choice distance threshold affects our results later.
Also, try to make your code general enough that we can reuse it with another generative model and data set later on; that is, don't reference variables at global scope if they're stored in the ABC
object already by its constructor.
class ABC:
def __init__(self, x, ytrue, priors):
self.x = x
self.y = ytrue
self.s_true = self.summaries(x, ytrue)
self.priors = priors
self.param_names = [p for p in priors.keys()]
# derived classes must define the following
def summaries(self, x, y):
# return summary statistics as a 1D np.array
raise Exception('ABC::summaries should be defined by a deriving class')
def distance(self, s1, s2):
# return scalar distance between 2 sets of summary statistics
raise Exception('ABC::distance should be defined by a deriving class')
def simulate(self):
# return a simulated 1D array y, given self.x and additional parameters arguments (with default values)
raise Exception('ABC::simulate should be defined by a deriving class')
# the important part
def run(self, Nsim):
self.samples = np.full((Nsim, len(self.param_names)), np.nan)
self.distances = np.full(Nsim, np.nan)
for i in range(Nsim):
params = {p:self.priors[p].rvs() for p in self.param_names}
ysim = self.simulate(**params)
# compute summaries and distance from self.s_true
# store params in the ith row of self.samples
# store distance in the ith entry of self.distances
# useful visualizations
def distance_hist(self):
plt.rcParams['figure.figsize'] = (4.0, 3.0)
plt.hist(self.distances);
plt.xlabel("d");
def select_distances(self, eps):
self.selection = np.flatnonzero(self.distances < eps)
self.samples2 = self.samples[self.selection,:]
print('Acceptance rate:', self.samples2.shape[0]/self.samples.shape[0], '(', self.samples2.shape[0], 'accepted samples )')
def plot_acceptances(self, truth=None, truth_tri=None, size=8.0, show_rejects=True):
plt.rcParams['figure.figsize'] = (size, size)
n = len(self.param_names)
fig = plt.figure()
axes = []
for i,p in enumerate(self.param_names):
axes.append([])
ax = fig.add_subplot(n, n, (i*n)+i+1)
ax0 = ax
ax.hist(self.samples2[:,i], density=True, label='ABC');
if truth is not None:
aa = np.linspace(*self.priors[p].ppf([0.0001,0.9999]), 100)
ax.plot(aa, truth['1D'][p].pdf(aa), label='truth');
if i==n-1:
ax.set_xlabel(p);
else:
ax.get_xaxis().set_ticklabels([])
ax.get_yaxis().set_ticklabels([])
for j,q in enumerate(self.param_names):
if j == i:
break
ax = fig.add_subplot(n, n, (i*n)+j+1)
axes[-1].append(ax)
if show_rejects:
ax.plot(self.samples[:,j], self.samples[:,i], ',', color='C2');
ax.plot(self.samples2[:,j], self.samples2[:,i], '.');
if truth is not None:
for k in truth['2D'].keys():
ax.plot(truth['2D'][k][0], truth['2D'][k][1], color='C1', label='truth');
if i==n-1:
ax.set_xlabel(q);
else:
ax.get_xaxis().set_ticklabels([])
if j==0:
ax.set_ylabel(p);
else:
ax.get_yaxis().set_ticklabels([])
axes[-1].append(ax0)
if truth_tri is not None:
cr.whist_triangle_plot(truth_tri, axes=axes, linecolor1D='C1', linecolor2D='C1', fill2D=False);
Arbitrary summary statisticsΒΆ
Next, we need to choose summary statistics and a distance function. Given that this problem has an algebraic solution, it's very tempting to use that solution to inform both of these choices, and we will do so below. First, however, pretend that we didn't have an exact solution to this problem to consult. In class, we will jointly decide on summary statistics and a distance function to use in this part. (The public solutions use Euclidean distance throughout this notebook.)
# Any notes about these decisions can go here
We can define a derived class of ABC
that implements these choices by specifying just the 3 functions that were left out above:
# class ABC_ours(ABC):
# def summaries(self, x, y):
# # return summary statistics as a 1D np.array
# def distance(self, s1, s2):
# # return scalar distance between 2 sets of summary statistics
# def simulate(self, a=0.0, b=0.0):
# # return a simulated 1D array y, given self.x and additional parameters arguments (with default values)
abc = ABC_ours(x, y, priors)
Check that it doesn't crash when trying to run:
abc.run(1)
print(abc.samples)
print(abc.distances)
[[-2.7244331 0.50411893]] [0.66515504]
If so, let's produce a good number of samples and associated distances (this takes <10s on my aging laptop):
%time abc.run(20000)
CPU times: user 7.41 s, sys: 40.4 ms, total: 7.45 s Wall time: 7.52 s
You might be wondering why we haven't said anything about the distance threshold for accepting samples yet. The reason is, simply, that the absolute magnitude of threshold that makes sense depends on both the choice of summary statistics (and their dimensionality) and the choice of distance function. For most sensible distance definitions, zero is the ideal value, but it makes sense to first take a look at the distribution of distances corresponding to our simulated data sets so far (using this handy provided function!).
abc.distance_hist();
The cells below visualize how the accepted samples change as we reduce the threshold from the median of the distribution above to the first percentile. Each one shows a "triangle" plot comparing the exact posterior (orange) to the distribution of accepted samples (blue). The rejected samples are shown in green in the off-diagonal panel, just to verify that they fill the space uniformly.
eps = np.quantile(abc.distances, 0.5)
print("eps =", eps)
abc.select_distances(eps)
abc.plot_acceptances(truth=true_post, size=4.0)
eps = 0.6959394234143433 Acceptance rate: 0.5 ( 10000 accepted samples )
eps = np.quantile(abc.distances, 0.1)
print("eps =", eps)
abc.select_distances(eps)
abc.plot_acceptances(truth=true_post, size=4.0)
eps = 0.14008944453157685 Acceptance rate: 0.1 ( 2000 accepted samples )
eps = np.quantile(abc.distances, 0.1)
print("eps =", eps)
abc.select_distances(eps)
abc.plot_acceptances(truth=true_post, size=4.0)
eps = 0.14008944453157685 Acceptance rate: 0.1 ( 2000 accepted samples )
eps = np.quantile(abc.distances, 0.01)
print("eps =", eps)
abc.select_distances(eps)
abc.plot_acceptances(truth=true_post, size=4.0)
eps = 0.038449764295854 Acceptance rate: 0.01 ( 200 accepted samples )
Depending on the choices of summary statistic (especially) and distance function, you may or may not see the ABC samples converging to the truth as the threshold is reduced. Comment on the outcome, and how it might motivate different choices.
Space to respond
I_have_recorded_my_thoughts = False # change to True when it is true
assert I_have_recorded_my_thoughts
Optimal summary statisticsΒΆ
Depending on what summaries we decided to use above, the previous section's ABC may have worked extremely poorly. If so, this hopefully impresses on you the importance of thinking these decisions through. If not, have a look at the web version of this solved notebook to see how a deliberately poor choice works out.
Either way, we will now run an ABC using the optimal summary statistics for this particular problem, which are the OLS estimators of $a$ and $b$ (surprise!). We will not change the distance function initially.
In code, we can derive a new class from ABS_ours
(which has appropriate distance
and simulate
methods already), and overload (i.e. replace) only the summaries
method.
# class ABC_OLS(ABC_ours):
# def summaries(self, x, y):
# # return summary statistics, the OLS estimators of a and b, as a 1D np.array
abc1 = ABC_OLS(x, y, priors)
As a quick check, this object should give us back the OLS estimators we got at the beginning of the notebook if we run summaries
on the real data.
assert all(ols.params == abc1.summaries(x,y))
Let's run and see how these results converge as the threshold is reduced.
%time abc1.run(20000)
CPU times: user 11.7 s, sys: 44.4 ms, total: 11.7 s Wall time: 11.8 s
abc1.distance_hist();
eps = np.quantile(abc1.distances, 0.5)
print("eps =", eps)
abc1.select_distances(eps)
abc1.plot_acceptances(truth=true_post, size=4.0)
eps = 0.6545126265864643 Acceptance rate: 0.5 ( 10000 accepted samples )
eps = np.quantile(abc1.distances, 0.1)
print("eps =", eps)
abc1.select_distances(eps)
abc1.plot_acceptances(truth=true_post, size=4.0)
eps = 0.19439434899714933 Acceptance rate: 0.1 ( 2000 accepted samples )
eps = np.quantile(abc1.distances, 0.05)
print("eps =", eps)
abc1.select_distances(eps)
abc1.plot_acceptances(truth=true_post, size=4.0)
eps = 0.1347688101494705 Acceptance rate: 0.05 ( 1000 accepted samples )
eps = np.quantile(abc1.distances, 0.01)
print("eps =", eps)
abc1.select_distances(eps)
abc1.plot_acceptances(truth=true_post, size=4.0)
eps = 0.05867312084143261 Acceptance rate: 0.01 ( 200 accepted samples )
Unless we've chosen a particularly strange distance function, you should see the ABC samples converging to the truth, but not equally quickly in both parameters. We can do better than this! If we have a guess of the width of the posterior in each parameter, it would make sense to incorporate those scales into the distance function. That is, we could redefine the distance such that
$d(\Delta \hat{a}, \Delta \hat{b}) \rightarrow d\left(\frac{\Delta\hat{a}}{\sigma_a}, \frac{\Delta\hat{b}}{\sigma_b}\right)$.
Here $\hat{a}$ and $\hat{b}$ are our summary statistics, and $\sigma_a$ and $\sigma_b$ are estimates of the width of the posterior in each parameter (the marginalized standard deviations of our posterior estimate). Even better would be to use the 2D covariance estimate, but this simpler modification will suffice (not to mention, we're already using a lot of information we might not have in general).
Put this modification into code by deriving a new class inheriting from ABC_OLS
and overloading the distances
method.
# class ABC_OLS2(ABC_OLS):
# def distance(self, s1, s2):
# # return scalar distance between 2 sets of summary statistics
abc2 = ABC_OLS2(x, y, priors)
Quick check for crashes:
abc2.run(1)
Again, let's produce a bunch of samples and see how the convergence looks as the distance threshold is reduced.
%time abc2.run(20000)
CPU times: user 11.5 s, sys: 45.5 ms, total: 11.5 s Wall time: 11.5 s
abc2.distance_hist();
eps = np.quantile(abc2.distances, 0.5)
print("eps =", eps)
abc2.select_distances(eps)
abc2.plot_acceptances(truth=true_post, size=4.0)
eps = 3.057937865615461 Acceptance rate: 0.5 ( 10000 accepted samples )
eps = np.quantile(abc2.distances, 0.1)
print("eps =", eps)
abc2.select_distances(eps)
abc2.plot_acceptances(truth=true_post, size=4.0)
eps = 1.3243872565570287 Acceptance rate: 0.1 ( 2000 accepted samples )
eps = np.quantile(abc2.distances, 0.05)
print("eps =", eps)
abc2.select_distances(eps)
abc2.plot_acceptances(truth=true_post, size=4.0)
eps = 0.9501546177624834 Acceptance rate: 0.05 ( 1000 accepted samples )
eps = np.quantile(abc2.distances, 0.01)
print("eps =", eps)
abc2.select_distances(eps)
abc2.plot_acceptances(truth=true_post, size=4.0)
eps = 0.4210760671844682 Acceptance rate: 0.01 ( 200 accepted samples )
You should now see a progression that plausibly looks like it's smoothly approaching the true posterior.
Yes_I_totally_see_that = False # change to True when true
assert Yes_I_totally_see_that
Case: linear model with unknown scatterΒΆ
Now that you have some experience with the nuts and bolts of LA and ABC, we'll apply them to a situation that's only slightly more complex, but doesn't have a pretty exact solution. Namely,
- We have data in the form of a list of $(x,y)$ pairs.
- The $x$ values are fixed (we assume no uncertainty in their generation).
- Each $y$ values is independently generated from a linear model: $y_i = a + b x_i + \varepsilon_i$, where $\varepsilon_i$ follows the a normal distribution with zero mean and unknown variance, $\sigma^2$, that we will fit for.
- Priors on $a$ and $b$ are both uniform over the real line; the prior on $\sigma^2$ is uniform over non-negative values.
Comparing with above, the key difference here is that the scatter of $y$ about the linear model is a free parameter that we will fit for, in addition to the slope and intercept.
param_names2 = param_names + ['s2']
As always, draw and write down the generative model.
Space for the generative model
To keep things from being too boring, we will read in a second data set for which $\sigma^2 \neq 1$.
table2 = np.loadtxt(datapath+'data2.txt')
x2 = table2[:,0]
y2 = table2[:,1]
plt.rcParams['figure.figsize'] = (4.0, 3.0)
plt.plot(x2, y2, '.');
plt.xlabel("x"); plt.ylabel("y");
While there isn't a simple exact solution for this scenario, it would still be nice to compare to a solution we're confident in. Fortunately, this model, for data like those we're using, is easily handled with conjugate Gibbs sampling. So we'll do that quickly before moving on.
We are extremely confident that LRGS
can handle this problem without oversight, but since your data are randomly generated there is no 100% guarantee. If things look weird below and you want to look closer, the commented cell below will show the traces. Without further commentary,
par = lrgs.Parameters(np.matrix([x2]).T, np.matrix([y2]).T, M_inv=[np.asmatrix(np.zeros(2)) for i in range(len(x2))], Sigma_prior=(-2., np.matrix(np.zeros((1,1)))))
chain = lrgs.Chain(par, 15000)
%time chain.run(fix='y') # <10s on an old laptop
CPU times: user 9.26 s, sys: 80 ms, total: 9.34 s Wall time: 9.47 s
chdict = chain.to_dict(['B','Sigma'])
charr = np.array([chdict[p] for p in ['B_0_0','B_1_0','Sigma_0_0']]).T
#plt.rcParams['figure.figsize'] = (12.0, 2.0*len(param_names2))
#fig, ax = plt.subplots(len(param_names2), 1);
#cr.plot_traces(charr, ax, labels=param_names2);
charr = charr[10:,:]
tri = cr.whist_triangle(charr, bins=50, smooth2D=1);
cr.whist_triangle_plot(tri, paramNames=param_names2);
Laplace ApproximationΒΆ
Now that we hopefully have a reliable posterior from MCMC to compare to, let's try the LA for this problem. As before, write a function to return minus the log-posterior:
def minus_lnp2(params, x, y):
# return -ln(posterior) given params, a 1D np.array [a,b,s2]
Again, a very crude guess should work for us.
guess2 = [0.0, 0.0, 1.0]
Find the posterior maximum! Note that we need to use the bounds
argument so that minimize
knows that the s2
parameter may not be negative. This causes it to internally choose a different algorithm than it would have used by default otherwise, with the upshot that we need to do something slightly different to access the inverse-Hessian (see below).
LA2 = minimize(minus_lnp2, guess2, bounds=[(None,None), (None,None), (0.0,None)], args=(x2, y2))
LA2
message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL success: True status: 0 fun: 59.8988630297158 x: [-2.718e+00 5.945e-01 2.791e+00] nit: 15 jac: [ 7.105e-07 0.000e+00 -7.105e-07] nfev: 72 njev: 18 hess_inv: <3x3 LbfgsInvHessProduct with dtype=float64>
LA2_mean = LA2.x
LA2_cov = LA2.hess_inv.todense()
print("mean:", LA2_mean)
print("cov:", LA2_cov)
mean: [-2.71795238 0.59450388 2.7914622 ] cov: [[ 0.19321949 -0.0404238 -0.05339119] [-0.0404238 0.02055936 0.05755932] [-0.05339119 0.05755932 0.80633199]]
The cell below will compare the LA (red/dashed curves) with our MCMC posterior (blue/solid curves):
fig,ax = cr.whist_triangle_plot(tri, paramNames=param_names2, linecolor1D='b', linecolor2D='b', fill2D=False);
for i,p in enumerate(param_names2):
aa = np.linspace(LA2_mean[i]-4*np.sqrt(LA2_cov[i,i]), LA2_mean[i]+4*np.sqrt(LA2_cov[i,i]), 100)
ax[i][i].plot(aa, st.norm.pdf(aa, LA2_mean[i], np.sqrt(LA2_cov[i,i])), '--', color='r');
ax[i][i].set_xlim(min(ax[i][i].get_xlim()[0], LA2_mean[i]-4*np.sqrt(LA2_cov[i,i])), max(ax[i][i].get_xlim()[1], LA2_mean[i]+4*np.sqrt(LA2_cov[i,i])))
ax[i][i].set_ylim(0.0, max(ax[i][i].get_ylim()[1], st.norm.pdf(aa, LA2_mean[i], np.sqrt(LA2_cov[i,i])).max()*1.1))
for j,q in enumerate(param_names2):
if i==j:
break
cr.cov_ellipse(LA2_cov[np.ix_([j,i],[j,i])], center=LA2_mean[[j,i]], level=0.68268949, plot=ax[i][j], fmt='--', color='r')
cr.cov_ellipse(LA2_cov[np.ix_([j,i],[j,i])], center=LA2_mean[[j,i]], level=0.95449974, plot=ax[i][j], fmt='--', color='r')
ax[i][j].set_xlim(min(ax[i][j].get_xlim()[0], LA2_mean[j]-4*np.sqrt(LA2_cov[j,j])), max(ax[i][j].get_xlim()[1], LA2_mean[j]+4*np.sqrt(LA2_cov[j,j])))
ax[i][j].set_ylim(min(ax[i][j].get_ylim()[0], LA2_mean[i]-4*np.sqrt(LA2_cov[i,i])), max(ax[i][j].get_ylim()[1], LA2_mean[i]+4*np.sqrt(LA2_cov[i,i])))
For our data, the LA gets close to the posterior mode, but its estimate of the covariance leaves something to be desired. Note that, for this setup, we have found something of a phase change when the data set gets small enough that the scatter is tough to distinguish from zero - LA's covariance in that case suddenly becomes much larger. For bigger data sets that the one here, the approximation, including the covariance, gets better. We've deliberately aimed in the middle for this demonstration, which is to say that you might see any of this behaviour in your own, random data sets. (It's also possible that some of this is due to the specific way the Hessian is estimated within minimize
, and a more careful estimate of the second derivatives would do better; we haven't investigated.)
Approximate BayesΒΆ
As before, let's functionally redefine the range of our priors based on what you see above. In real life, you can imagine doing this based on the LA in the first place, and iterating if you see the resulting approximate posterior doesn't become tiny before hitting the edge of the prior. We won't mind if you cheat a little and use the MCMC result to motivate the refined priors in this case, but keep in mind that they should include at least the entire volume of parameter space where the chain has samples.
# priors2 = {'a':..., 'b':..., 's2':...}
What summary statistics and distance function would you use for this problem? Note: in class we will reach a consensus for what to implement below.
# class ABC_scat(ABC):
# def summaries(self, x, y):
# # return summary statistics as a 1D np.array
# def distance(self, s1, s2):
# # return scalar distance between 2 sets of summary statistics
# def simulate(self, a=0.0, b=0.0, s2=1.0):
# # return a simulated 1D array y, given self.x and additional parameters arguments (with default values)
abc3 = ABC_scat(x2, y2, priors2)
As always, we next check for obvious code bugs:
abc3.run(1)
print(abc3.samples)
print(abc3.distances)
[[-2.94627433 0.86675959 12.19002481]] [9.7587713]
Given the additional free parameter, we expect to need to run for longer to get a useful number of accepted samples.
%time abc3.run(20000)
CPU times: user 15.1 s, sys: 95.3 ms, total: 15.2 s Wall time: 15.5 s
Let's see what that gave us:
abc3.distance_hist();
eps = np.quantile(abc3.distances, 0.5)
print("eps =", eps)
abc3.select_distances(eps)
abc3.plot_acceptances(truth_tri=tri, size=6.0, show_rejects=False)
eps = 5.974735324540064 Acceptance rate: 0.5 ( 10000 accepted samples )
eps = np.quantile(abc3.distances, 0.1)
print("eps =", eps)
abc3.select_distances(eps)
abc3.plot_acceptances(truth_tri=tri, size=6.0, show_rejects=False)
eps = 2.8341680811343344 Acceptance rate: 0.1 ( 2000 accepted samples )
eps = np.quantile(abc3.distances, 0.05)
print("eps =", eps)
abc3.select_distances(eps)
abc3.plot_acceptances(truth_tri=tri, size=6.0, show_rejects=False)
eps = 2.183490628701815 Acceptance rate: 0.05 ( 1000 accepted samples )
eps = np.quantile(abc3.distances, 0.01)
print("eps =", eps)
abc3.select_distances(eps)
abc3.plot_acceptances(truth_tri=tri, size=6.0, show_rejects=False)
eps = 1.2939580058803228 Acceptance rate: 0.01 ( 200 accepted samples )
Hopefully you can see indications that ABC is working, although chances are you would also need to run longer and decrease the acceptance threshold even more before the comparison to the MCMC results really starts to look good. We stress that there are more intelligent, adaptive implementations of ABC than our simplistic version, which one would turn to in practice.
Yes_this_all_checks_out = False # change to True when true
assert Yes_this_all_checks_out
BootstrapΒΆ
We conclude this notebook with a quick implementation of the bootstrap. Recall that this is a frequentist method for "robustly" estimating uncertainties in estimates when the sampling distribution is not well known enough to do maximum likelihood (for example). The procedure is
- randomly sample from the data with replacement,
- compute the estimator(s) from the sampled data,
- repeat many times, and look at the distribution of estimates.
Let's go ahead and do this using the summary statistics from the above exercise (with free scatter) as our estimators. In this context, to be clear, resampling the data means resampling the pairs $(x,y)$, not sampling $x$s and $y$s independently.
Below, complete a function that resamples the given data and returns it in a tuple. (This hardly seems worth making an exercise, which says something about the simplicity of implementing the bootstrap.)
def bootstrap(x, y):
# i = list of indices into x and y encoding the resampling; see np.random.choice
return x[i],y[i]
Let's run it...
%%time
boot = np.full((20000,3), np.nan)
for i in range(boot.shape[0]):
xb,yb = bootstrap(x2, y2)
boot[i,:] = abc3.summaries(xb, yb)
CPU times: user 9.47 s, sys: 56.3 ms, total: 9.53 s Wall time: 9.77 s
... and compare to the MCMC posterior for this problem. Below, the bootstrap distribution is red/dashed curves, and the MCMC posterior is blue/solid.
boot_tri = cr.whist_triangle(boot, bins=50, smooth2D=1);
fig,ax = cr.whist_triangle_plot(boot_tri, paramNames=param_names2, linecolor1D='r', linecolor2D='r', linestyle1D='--', linestyle2D='--', fill2D=False);
cr.whist_triangle_plot(tri, linecolor1D='b', linecolor2D='b', fill2D=False, axes=ax);
With our data, the bootstrap was a little too optimistic for $a$ and $b$, but not bad, while it produces a much tighter distribution for $\sigma^2$. In principle, it should do better for larger data sets, where more information is available from resampling, but we haven't explicitly checked (exercise for the reader!). On the other hand, for our particular data set, the bootstrap actually looks to be doing a slightly better job that the Laplace Approximation for the intercept and slope parameters.
Parting thoughtsΒΆ
We hope you have gained an appreciation for and understanding of some of the approximate methods covered in the notes. Possibly, you're now more motivated to avoid using them whenever possible. Still, while a little ingenuity can go a long way in making MCMC and related methods more tractable, it's nice to know that there are approximate methods that still live in the realm of principled inference.