Tutorial: Essential Probability¶
In this notebook, we will get some practice with simple probabilistic manipulations and calculations. None of it is terribly complicated in the grand scheme of things, either conceptually or numerically. On the other hand, we will be speaking the language of probability and doing calculations similar to these throughout the course, so it's worth spending a little time to make sure we have the fundamentals down. In particular, if you are already proficient with Python and NumPy, the mechanical aspects may seem very easy. Conversely, if you feel at sea working through the steps below, this is an indication that the programming learning curve later on will be a real challenge.
Below, you will
- implement functions that evaluate the marginal and conditional distributions of a multivatiate Gaussian
- numerically approximate those distributions on a grid, and compare
- approximate the same distributions using random samples, and compare
from os import getcwd
from yaml import safe_load
import sys
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
thisTutorial = 'essential_probability'
if getcwd() == '/content':
# assume we are in Colab, and the user's data directory is linked to their drive/Physics267_data
from google.colab import drive
drive.mount('/content/drive')
datapath = '/content/drive/MyDrive/Physics267_data/' + thisTutorial + '/'
else:
# assume we are running locally somewhere and have the data under ./data/
datapath = 'data/'
Background¶
As you have seen in the notes, the univariate Gaussian or normal PDF is
$\mathrm{Normal}(x|\mu,\sigma) \equiv p_\mathrm{Normal}(x|\mu,\sigma) = \frac{1}{\sqrt{2\pi}\sigma}\exp \left[-\frac{(x-\mu)^2}{2\sigma^2}\right]$.
The multivariate version of this distribution is
$\mathrm{Normal}_n({\bf x}|{\bf \mu},{\bf \Sigma}) = \frac{1}{\sqrt{(2\pi)^n|{\bf \Sigma}|}}\exp \left[-\frac{1}{2}({\bf x}-{\bf \mu})^{\mathrm{T}}{\bf \Sigma}^{-1}({\bf x}-{\bf \mu})\right]$,$^1$
where ${\bf x}$ and ${\bf \mu}$ are now vectors of length $n$, and ${\bf \Sigma}$ is an $n \times n$ positive definite matrix. In the $n=2$ case, we have ${\bf x}=(x,y)$ and ${\bf \mu}=(\mu_x,\mu_y)$, and ${\bf \Sigma}$ is usually decomposed as
${\bf \Sigma} = \left( \begin{array}{cc} \sigma^2_x & \rho\sigma_x\sigma_y \\ \rho\sigma_x\sigma_y & \sigma^2_y \end{array} \right)$.
With a little patience, you can expand the PDF into this form
$\mathrm{Normal}_2({\bf x}|{\bf \mu},{\bf \Sigma}) = \frac{1}{2\pi\sigma_x\sigma_y\sqrt{1-\rho^2}} \exp\left\{-\frac{1}{2(1-\rho^2)}\left[ \left(\frac{x-\mu_x}{\sigma_x}\right)^2 + \left(\frac{y-\mu_y}{\sigma_y}\right)^2 - 2\rho \left(\frac{x-\mu_x}{\sigma_x}\right)\left(\frac{y-\mu_y}{\sigma_y}\right) \right]\right\}$.
With slightly more patience, and some algebra, it's also possible$^2$ to rearrange the PDF into a factored form
$\mathrm{Normal}_2({\bf x}|{\bf \mu},{\bf \Sigma}) = \mathrm{Normal}\left(x|\mu_x,\sigma_x\right) \, \mathrm{Normal}\left[y\left| \mu_y+\frac{\rho\sigma_y}{\sigma_x}(x-\mu_x), \sigma_y\sqrt{1-\rho^2} \right.\right] = \mathrm{Normal}\left(y|\mu_y,\sigma_y\right) \, \mathrm{Normal}\left[x\left| \mu_x+\frac{\rho\sigma_x}{\sigma_y}(y-\mu_y), \sigma_x\sqrt{1-\rho^2} \right.\right]$.$^3$
Implementing analytically derived PDFs¶
Although we won't need it until the next section, our first task is to implement a function evaluating the bivariate Gaussian PDF, as a function of the parameters in the last equation.
In this notebook, please implement everything straightforwardly from (any of) the equations above rather than calling pre-packaged functions from scipy
or elsewhere.$^4$
def p_xy(x, y, mu_x, mu_y, sigma_x, sigma_y, rho):
Next, let's determine the marginal distribution of $x$, $p(x)$, and the conditional distribution of $y$, $p(y|x)$. We won't bother with $p(y)$ and $p(x|y)$, given the evident symmetry of the expressions.$^3$ Recall that, by definition,
$p(x) = \int dy \, p(x,y)$
(where $p(x,y)$ is shorthand for the bivariate normal density as a function of $\mu_x$, $\mu_y$, etc.), and the marginal and conditional distributions jointly satisfy
$p(x,y) = p(x) \, p(y|x)$.
Based on the previous section, write down $p(x)$ and $p(y|x)$. Both of these are (spoiler) Gaussians, and should therefore respectively be put in the simplified forms $\mathrm{Normal}(x|\ldots)$ and $\mathrm{Normal}(y|\ldots)$. This can (and should!) be accomplished by inspection,$^3$ although you can work things through if you have any doubts. Refer to the "most convenient mathematical shortcuts ever" section in the Essential Probability notes if this seems weird.
Space for your analytical derivations/arguments
Note: Equations in markdown cells are coded in LaTeX, between $'s. Examine the cells above in edit mode (double click them) to see. If you're not yet a LaTeX magician, don't worry; you'll probably see everything you need for the moment as we go. When in doubt, the appendices of this guide list many helpful math commands.
Implement Python functions for $p(x)$ and $p(y|x)$, analogous to the one for $p(x,y)$ above.
def p_x(x, mu_x, mu_y, sigma_x, sigma_y, rho):
# note that `y' is not an argument of this function!
def p_y_given_x(y, x, mu_x, mu_y, sigma_x, sigma_y, rho):
# note the order of `y' and `x' above!
In the remaining sections, we'll compare these functions to numerical approximations for a particular choice of ${\bf \mu}$ and ${\bf \Sigma}$, and a value of $x$ to condition on. Let's find out what they are!
test_values = safe_load(open(datapath+'data.yaml', 'r').read())
params = test_values['params']
params
{'mu_x': -0.6622828704260115, 'mu_y': 2.3888101527152896, 'rho': -0.7217394324343991, 'sigma_x': 0.9267964821658732, 'sigma_y': 1.0079054376556293}
fixed_x = test_values['fixed_x']
fixed_x
-1.86
While we're here, we may as well check that your functions satisfy the joint definition of marginal and conditional probabilities for these parameter values and some arbitrary choices of $x$ and $y$.
assert np.isclose(p_xy(0.0, 0.0, **params), p_x(0.0, **params) * p_y_given_x(0.0, 0.0, **params))
assert np.isclose(p_xy(1.0, -1.0, **params), p_x(1.0, **params) * p_y_given_x(-1.0, 1.0, **params))
Python trivia: The **params
in the function calls above passes the entries of the params
dictionary as named arguments of the function, i.e. it is equivalent to the very tedious mu_x=params['mu_x'], mu_y=params['mu_y'], sigma_x=params['sigma_x'], ...
. This is a nice convenience, allowing us to write functions with meaningfully named arguments, while also keeping and passing sets of parameter values in organized structures. We'll use this functionality often in these notebooks.
Numerical comparison: on a grid¶
In this section we'll check your analytical functions against numerical calculations on a grid, the latter being a reasonable way to proceed in some other 2-dimensional case that didn't happen to have simple analytic solutions.
Of course, a grid-based calculation will be always be somewhat approximate, since
- the grid can't extend to infinity in $x$ and $y$, and therefore can't account for 100% of the probability in $p(x,y)$, and
- the grid's finite spacing can't perfectly reproduce the shape of $p(x,y)$.
Nevertheless, it's possible for grid calculations to be extremely accurate, for sensible choices of bounds and spacing. Go ahead and make some choices below, with the aim of
- containing most of the probability within the grid bounds, and
- resolving the shape of $p(x,y)$ reasonably with the grid spacing.
You will be able to come back and adjust these if they don't seem to be doing the job, of course.
Tip: We recommend defining things such that the number of grid points in $x$ will be different from the number of grid points in $y$. There is no deep reason for this, but it makes it much harder to get confused about which array axis corresponds to which variable.
# x bounds and spacing
# xmin = ...
# xmax = ...
# dx = ...
# y bounds and spacing
# ymin = ...
# ymax = ...
# dy = ...
The cell below defines vectors that hold the $x$ and $y$ values that will be gridded, as well as arrays (using the handy numpy.meshgrid
function) where each entry holds the $x$ and $y$ value corresponding to that grid point. (If that was confusing, the plot below should clear it up.)
xvalues = np.arange(xmin, xmax+dx, dx)
yvalues = np.arange(ymin, ymax+dy, dy)
grid_x, grid_y = np.meshgrid(xvalues, yvalues, indexing='ij')
Aside: Note the indexing='ij'
option to meshgrid
. This makes the function return arrays where the first index corresponds to $x$ and the second to $y$ instead of vice versa. The result is a much better convention in terms of generalizing to higher dimensions (and it corresponds to our usual mathematical notation, $x,y$), but it means we will need to transpose arrays before plotting them if we want $x$ to appear on the horizontal axis and $y$ to appear on the vertical axis.
The cell below illustrates the contents of grid_x
and grid_y
.
plt.rcParams['figure.figsize'] = (10.0, 8.0)
fig, ax = plt.subplots(1,2);
ax[0].imshow(grid_x.T, cmap='gray', origin='lower', extent=[xmin, xmax, ymin, ymax]);
ax[0].set_xlabel('x'); ax[0].set_ylabel('y'); ax[0].set_title('grid_x');
ax[1].imshow(grid_y.T, cmap='gray', origin='lower', extent=[xmin, xmax, ymin, ymax]);
ax[1].set_xlabel('x'); ax[1].set_ylabel('y'); ax[1].set_title('grid_y');
Why would we want 2D arrays that just repeat the information we already have in xvalues
and yvalues
? Just for convenience. Because numpy
can do most mathematical operations on arrays transparently, and assuming the functions you wrote above straightforwardly implement the relevant equations, we should be able to pass grid_x
and grid_y
in place of the arguments x
and y
and get sensible results, without explicitly writing a loop over every array element.
Let's see how that works by evaluating $p(x,y)$ on this grid and visualizing it.
grid_p_xy = p_xy(grid_x, grid_y, **params)
plt.rcParams['figure.figsize'] = (7.0, 5.0)
plt.imshow(grid_p_xy.T, cmap='gray', origin='lower', extent=[xmin, xmax, ymin, ymax]);
plt.xlabel('x'); plt.ylabel('y');
If the grid you defined was sensible, the bright part of grid_p_xy
(containing most of the probability) should be comfortably within the array, and should be well resolved (not look too pixilated).
Marginalization¶
Next, we'll get more quantitative and check your function for $p(x)$ against the equivalent numerical calculation using grid_p_xy
. This means numerically carrying out the integral $p(x) = \int dy \, p(x,y)$, so the result (grid_p_x
) should be a 1D array whose values are our numerical approximation of $p(x)$ at the xvalues
. Numerical integration techniques could easily be a course on their own, but in situations like this it frankly isn't worth doing anything more complex than a Riemann sum. After all, at least in this case, it's practically painless for us to re-create the grid with different bounds and/or spacings if we decide that's needed to make the sum accurate. That said, you can perform the integral however you want, provided it only relies on the grid_p_xy
array and the variables defining its bounds/spacing.
# grid_p_x = ...
Before moving on, check that your numerical approximation to $p(x)$ on a grid is normalized, as it should be, that is that $\int dx\, p(x) = 1$. This can be accomplished with essentially the same technique you used to integrate over $y$ above. If the assertion fails, either your integration is suspect, the grid bounds/spacing are insufficient, or something is fishy in your implementation of p_xy
.
# grid_p_x_integral = ...
print(grid_p_x_integral)
assert np.isclose(grid_p_x_integral, 1.0)
0.9999992350574406
Good! Now let's compare this with your analytical function for $p(x)$.
plt.rcParams['figure.figsize'] = (6.0, 4.0)
plt.plot(xvalues, grid_p_x, 'b.', label='grid');
plt.plot(xvalues, p_x(xvalues, **params), 'r-', label='analytic');
plt.xlabel('x'); plt.ylabel('p(x)'); plt.legend();
If the plot above looks ok, but the following numerical check fails, you might be not quite as accurate as the atol
tolerance below. It should be possible to correct this by changing the grid bounds/spacing appropriately. Alternatively, double-check your p_x
function. (Also note that, occasionally, the points can look a little bit offset just because matplotlib
is silly.)
assert np.allclose(grid_p_x, p_x(xvalues, **params), atol=1e-6)
Conditioning¶
Numerically checking $p(y|x)$ is more straightforward, since there is no integration involved. We don't even need a 2D grid, but can instead directly evaluate $p(x,y)$ as a function of $y$, with $x$ fixed at some value. Recall that, for our tests, this value is
fixed_x
-1.86
As a test of your analytical p_y_given_x
, let's numerically evaluate $p(x=$ fixed_x
$,y)$ at $y=$yvalues
. Don't forget that you will also to divide by $p(x)$ to normalize!
# grid_p_y_given_x = ...
Again, let's explicitly verify the normalization by directly integrating grid_p_y_given_x
over $y$.
# grid_p_y_given_x_integral = ...
print(grid_p_y_given_x_integral)
assert np.isclose(grid_p_y_given_x_integral, 1.0)
0.9999999994211752
We'll also do visual and quantitative comparison with p_y_given_x
, analogously to above.
plt.rcParams['figure.figsize'] = (6.0, 4.0)
plt.plot(yvalues, grid_p_y_given_x, 'b.', label='grid');
plt.plot(yvalues, p_y_given_x(yvalues, fixed_x, **params), 'r-', label='analytic');
plt.xlabel('y'); plt.ylabel('p(y|x='+str(fixed_x)+')'); plt.legend();
assert np.allclose(grid_p_y_given_x, p_y_given_x(yvalues, fixed_x, **params))
Numerical comparison: monte carlo¶
To finish this notebook, we'll look at how to perform conditioning and marginalization when we have samples from a PDF rather than a simple analytical expression or an evaluation of the PDF over a grid. This will be our most common mode of operation in the latter part of this course, since analytical and grid-based methods tend to become intractable once we're working in more than a few dimensions.
Randomly sampling from general PDFs is a topic that we will cover later on. For the moment, we're using a simple PDF, so we'll use simple methods. In fact, we can take advantage of the marginal and conditional distributions that you've already worked out to produce samples from our 2D PDF using 2 calls to a univariate random number generator. Namely, below,
- Generate samples of $x$ from its marginal PDF, $p(x)$. Then,
- Generate corresponding samples of $y$ from its conditional PDF, $p(y|x)$.
Since each of these steps samples a 1D Gaussian, you can just use the numpy.random.randn
function; np.random.randn(Nsamples)
will produce an Nsamples
-length array distributed normally with a mean of zero and a standard deviation of 1. To get samples from some other normal distribution, just multiply these by the desired standard deviation, and add the desired mean.
For organization's sake, store these samples in a dictonary with x
and y
keys, i.e. samples={'x':..., 'y':...}
. Corresponding elements of samples['x']
and samples['y']
should then be valid draws from $p(x,y)$.
Nsamples = 100000 # this is more than we will typically have in practice, but should be good for illustrating how things work here
# samples = ... (a dictionary as described above)
Let's check visually whether these samples have the correct 2D distribution by plotting some of them over the gridded evaluation of $p(x,y)$.
plt.rcParams['figure.figsize'] = (5.0, 5.0)
plt.imshow(grid_p_xy.T, cmap='gray', origin='lower', extent=[xmin, xmax, ymin, ymax]);
plt.plot(samples['x'][:5000], samples['y'][:5000], 'r,');
plt.xlabel('x'); plt.ylabel('y');
The red points should be densest where the background image is brightest, and their distribution should conspicuously have the same overall orientation. The following sections will give us additional (if still not comprehensive) double-checks that the samples come from the intended distribution.
Marginalization¶
When dealing with samples, marginalization is the simplest operation imaginable. To estimate the distribution $p(x)$, we just... look at the distribution of samples['x']
, paying no attention at all to samples['y']
. Although there are other methods, the simplest way to estimate a distribution from samples is by making a histogram, as we do below. The density
argument to plt.hist
divides the number of samples in each histogram bin by Nsamples
times the bin width, making it the fractional density of samples in the bin (our estimate of the PDF).
plt.rcParams['figure.figsize'] = (6.0, 4.0)
plt.hist(samples['x'], density=True, bins=50);
plt.plot(xvalues, p_x(xvalues, **params), 'r-');
plt.xlabel('x'); plt.ylabel('p(x)');
Although the binning and the finite number of samples can make the comparison less than perfect, the histogram should agree with your analytic $p(x)$ very well in this case. If we use a smaller number of samples to make the histogram while keeping the bin width similar, as below, you'll see the estimate become increasingly noisy. The solution to this, if it matters, is to do some smoothing of the histogram or use wider bins (which is the same thing)... or ideally to get more samples.
plt.rcParams['figure.figsize'] = (12.0, 4.0)
fig, ax = plt.subplots(1,2);
ax[0].hist(samples['x'][:10000], density=True, bins=50);
ax[0].plot(xvalues, p_x(xvalues, **params), 'r-');
ax[0].set_xlabel('x'); ax[0].set_ylabel('p(x)'); ax[0].set_title('10000 samples');
ax[1].hist(samples['x'][:1000], density=True, bins=50);
ax[1].plot(xvalues, p_x(xvalues, **params), 'r-');
ax[1].set_xlabel('x'); ax[1].set_ylabel('p(x)'); ax[1].set_title('1000 samples');
Conditioning¶
In principle, conditioning with samples is also completely straightforward. We simply select the subset of samples that satisfy the condition, in our case $x=$ fixed_x
, and make a histogram of the corresponding $y$ values. In practice, however, the number of samples that satisfy the condition exactly is probably zero. We will need to compromise, and instead, for example, select those samples where $x$ is within some distance of fixed_x
(i.e., we will have to relax the condition). The numpy.flatnonzero
function can accomplish this; below we use it to find the indices for which |samples['x']-fixed_x|
is less than a tenth of the standard deviation of samples['x']
. This is not necessarily a smart definition of "close to fixed_x
" in general, but at least it means "close relative to the overall variation in $x$". You are encouraged to play around with this selection to see how it changes things.
j = np.flatnonzero(np.abs(samples['x'] - fixed_x)<0.1*samples['x'].std())
An obvious downside to this approach is that we are throwing away many, probably most, of the original samples, which in general might have been very difficult to produce in the first place. Let's see how many are left.
len(j)
3435
Presuming that this is a small fraction of Nsamples
, we can also expect the resulting histogram to be noisy, although the resemblance to the analytic result should still be clear.
plt.rcParams['figure.figsize'] = (6.0, 4.0)
plt.hist(samples['y'][j], density=True, bins=50);
plt.plot(yvalues, p_y_given_x(yvalues, fixed_x, **params), 'r-');
plt.xlabel('y'); plt.ylabel('p(y|x='+str(fixed_x)+')');
So, while it's indeed possible to condition samples in principle, in practice it's often better to simply generate samples that satisfy the desired condition by construction. Again, this is straightforward in this case using your analytic results. Below, generate Nsamples
samples of $y$ for $x=$ fixed_x
using numpy.random.randn
modulated by the appropriate mean and standard deviation.
# samples_y_given_x = ...
The histogram of these samples should look much better compared with the analytic curve, mainly because there are more of them, and partly because they exactly satisfy the desired conditioning rather than approximately.
plt.rcParams['figure.figsize'] = (6.0, 4.0)
plt.hist(samples_y_given_x, density=True, bins=50);
plt.plot(yvalues, p_y_given_x(yvalues, fixed_x, **params), 'r-');
plt.xlabel('y'); plt.ylabel('p(y|x='+str(fixed_x)+')');
Parting thoughts¶
In this tutorial, you've practiced a little bit of probabilistic reasoning, and worked through some of the common mathematical operations in a simplified setting. While we may not see precisely these implementations again, we'll be building on the concepts used here for the remainder of the course.
Endnotes¶
Note 1¶
About the notation: It's a little weird that in the multivariate case we write $p(x|\mu,\Sigma)$, with the covariance $\Sigma$ as a parameter, while in the univariate case we write $p(x|\mu,\sigma)$, with the standard deviation $\sigma$ rather than the variance $\sigma^2$. In statistics literature, it's more common to be consistent, i.e. to write the univariate PDF as $p(x|\mu,\sigma^2)$. However, most code implementations of the univariate normal distribution (including scipy
) take the standard deviation as an argument rather than the variance, so we'll stick with our odd, inconsistent notation.
Note 2¶
And highly recommended practice! But also a bit tedious.
Note 3¶
You're welcome.
Note 4¶
numpy
functions are fine. We mean it about scipy
, though.
assert('scipy' not in sys.modules)