Tutorial: Working with Samples

In practice, we almost always work with samples from probability distributions rather than analytic or on-a-grid evaluations. Here we'll see how to do all the fun probability manipulations you've previously done (analytically and with grids) in the Essential Probability tutorial, this time with Monte Carlo samples.

Specifically, you will learn to:

Warning: This notebook comes a little out of order. We will be doing things that are covered conceptually in the Monte Carlo Sampling notes, so you may want to jump ahead and read their short Motivation section. Or you could just take our word for a lot of things. This isn't ideal, but we think it's important to get you working with samples relatively early, even if the reasons why aren't fully apparent until later.

1. Produce some samples

Recalling our Essential Probability tutorial, let's jump right into the case of 2 correlated Gaussian variables.

First we need some samples to work with. In this bivariate Gaussian case, scipy has a function that could generate samples for us, but instead let's test your understanding of what we've already covered.

First, let's specify the parameters of the distribution, as before:

Now, make an $N\times2$ table with each row being an $(x,y)$ pair, and $N$ being some large number of samples. Instead of using some fancy function to directly obtain samples of $(x,y)$, do this:

  1. Fill the first column of the table ($x$) with samples from the marginal distribution, $p(x)$.
  2. Fill the second column ($y$) with samples from the conditional distribution, $p(y|x)$.

You should have expressions for these two distributions already from the previous notebook.

Let's have a look. Qualitatively compare the plot below with the heatmap of $p(x,y)$ you made in the previous tutorial, for the case of correlated variables.

2. Marginalization

Next, we'll look at estimates of the 1D marginal distributions from our samples. First, copy your implementations of the analytic solutions for these distributions from the previous notebook, so we have a known answer to compare to.

If you read ahead to the Monte Carlo Sampling notes, you've seen that the way we estimate a PDF from samples is by making a histogram (which simply records the density of the samples). Furthermore, the way we marginalize over a variable is incredibly simple - we just ignore it. So, estimating the marginal distribution of $x$ (or $y$) is as simple as making a histogram of the first (or second) column of samples.

To get a normalized histogram, we use the density=True option below - this simply divides the number of samples in each histogram bin by the bin width and the total number of samples. Notice that we don't have one of the sanity checks here that we did previously, namely the ability to numerically check that our expression for the marginal PDF was normalized. (We can, however, plot it over our normalized histogram to check that it looks ok.)

Below we plot the density of samples generated above, using hist, and compare with your analytic solution from the previous notebook.

3. Conditioning

Again, pull in your analytic solutions from the previous notebook.

Conditioning is a little less straightforward than marginalization. In principle, if we want to condition on $x=x_0$, we would want to make a histogram of $y$ values for samples that have $x=x_0$. But we'd have to be incredibly lucky for any of our samples of $p(x,y)$ to satisfy $x=x_0$ exactly!

One natural (and necessarily approximate) solution is to work with samples that are close to $x=x_0$, within some window. To that end, store in j_fixed_x a list of indices into samples[:,0] (i.e. row numbers) where $|y-y_0|<\epsilon$, where $y_0$ is fixed_y, above, and $\epsilon$ is a threshold of your choice. Do the equivalent for $x$ and fixed_x also.

Now let's see how histograms of the samples you selected compare with the analytic solution in each case. Feel free to fiddle with the value of $\epsilon$ (and also the bins option to hist, below). How does it look?

One obvious issue with this approach is that we end up with potentially many fewer samples in our estimate of the conditional distribution than we started with, so these histograms above will not look as pretty as the ones above. However, in this case, there are more than enough samples that the agreement with the analytic curve should be clear. Checkpoint: if the agreement is not clear, go back and check your work.

Let's see what fraction of the samples are actually used in the conditional histograms:

There's not much one can do about this "waste" of samples, other than to take the time to generate samples directly from the conditional distribution, if we care that much.

However, we might get slightly better (or smoother) results by changing the nature of the window used to select samples. For example, instead of completely throwing away samples that are farther than $\epsilon$ from a value we want to condition on, we could use all the samples, weighted in such a way that the samples far from $x_0$ or $y_0$ contribute much less than those that are nearby. To that end, compute Gaussian weights for each sample based on their distance from $y_0$ (or $x_0$), with a standard deviation that you again get to pick.

Again, fiddle with the standard deviation and/or display binning to see what you can do. (Note the use of the weights option.)

Chances are that neither of these options looks great, so if we really cared about having the conditional PDF mapped well we would either want more samples, or we would need to sample from the conditional PDF directly instead of dealing with the conditioning this way.

On the other hand, if we just wanted to estimate something simple about the conditional PDF, say its mean, this might be good enough. Using the Gaussian-weighting method, the estimated mean of $x|y=3.8$ is:

... compared with the exact mean of $-0.5$.

4. Importance weighting

Let's go a little farther and think more generally about the marginal distribution of $y$ from the product of $p(x,y)$ and some other PDF, $q(x)$. Imagine that we have samples of $p(x,y)$ that were expensive to get, while $q(x)$ is straightforward to evaluate for any $x$. Then, instead of investing a lot of time in generating new samples from $p(x,y)\,q(x)$, we might want to do something like the weighting procedure above, which is called importance weighting.

We can think of conditioning as importance weighting with a PDF that says that $x$ must be really close to $x_0$, for example. But now, let's consider a different case. To keep it simple, let's say that $q(x) = \mathrm{Normal}(x|4,1)$. The weights for each sample are:

Looking at the marginal distributions below, you can see how the marginal distribution of $x$ is, naturally, pulled to larger $x$, but also the PDF of $y$ is pulled to lower $y$ due to the negative correlation of $p(x,y)$.

The plots above also include analytic versions of the weighted distributions. We won't derive them here, but you can work it out with a little patience (see the Gaussians and Least Squares notebook). If you do, note that we've thrown away a constant, since we want to compare to the weighted histogram of samples, which is normalized when plotted above, whereas e.g. $\int dy \, p(x,y)\,q(x)$ is not automatically normalized the way that $\int dy \, p(x,y)$ is.

It looks like the weighting of the samples worked just about perfectly, to the extent that we can tell by eye. This is because the original set of $p(x,y)$ samples has good coverage of the region where the product $p(x,y) \, q(x)$ is large (i.e. there are a lot of samples there, thanks to producing a much larger number of samples initially). If we changed things so that the $q(x)$ pulled us farther into the tails of $p(x,y)$, the PDF estimate would get noisier, and eventually there would be no samples in the relevant region left to re-weight! On the other hand, if $q(x)$ had been skinnier, or if its mean had been closer to the mean of $p(x)$, we would still see the weighted samples matching the analytic solution well.

With this in mind, contrive a $q_2(x)$ such that the procedure above fails... badly.

The lesson: Importance weighting should only be used if the samples you have are sufficient to cover the region where the resulting PDF is significantly non-zero!