Tutorial: Generative Models¶

In which you will:

identify the quantities comprising the measured data and the parameters of the model;
organize the data and parameters in terms of their conditional dependences into a probabilistic graphical model;
express the dependences as specific probabilistic or deterministic relationships among variables;
implement a model in code, and generate mock data.

Preliminaries¶

This exercise is mostly to practice going from a real-world(ish) problem described in words to an actionable model. It will also get you working a little with scipy.stats, if it's unfamiliar.

To be explicit, by model, we mean

a list of quantities comprising your data and parameters from which predicted data can be produced;
a PGM representing the conditional dependences of the parameters and data;
a list of expressions containing the same information as the PGM, with the added specification of what probability distributions are involved.

"Expressions" in this context are of the form you saw in the reading, and need not be fully spelled-out equations, for example: translates to

$a \Leftarrow b,c$
$x \sim \mathrm{Normal}(\mu, \sigma)$
$\mu \sim \mathrm{Uniform}(0, 10)$

$a$ is a deterministic function of $b$ and $c$,
$x$ is normally distributed with mean $\mu$ and standard deviation $\sigma$,
$\mu$ is uniformly distributed between 0 and 10.

Every parameter and datum in the model must have such a rule for how it depends on other quantities. The result is a recipe for generating mock data, and also contains all the information needed to do inference given real data that we've collected.

There is no set rule saying that it's better to draw the PGM first and write the expressions second, or vice versa; different people find each approach more or less natural.

I strongly suggest inserting PGMs into your copy of the notebook for ease of reference, although it isn't strictly required. You can either make them digitally to begin with or draw them (neatly) on paper and take a photo. These days, I use Google Drawings or Keynote.

In [1]:

import numpy as np
import scipy.stats as st
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

1. Relaxed cluster fraction¶

Scenario: X-ray imaging data for $n=361$ galaxy clusters were analyzed, and $s=57$ of them were found to be morphologically "relaxed" according to some metric. We want to constrain the fraction of relaxed clusters in the Universe (the probability that a randomly chosen cluster is relaxed), $f$, assuming that the data set is representative.

Model specification¶

Enumerate the model parameters, draw a PGM and write down the corresponding probability expressions for this problem. Be explicit about the form of the sampling distribution (see Essential Probability), and choose some prior distribution for $f$ (it need not be defensible to an expert on galaxy clusters, but should be consistent with the interpretation of $f$ as a probability). You can assume that the total number of clusters, $N$, is given by fiat and doesn't need to be generated by the model. You can also assume that the label "relaxed or not" for a given cluster counts as data; that is, we are not extending the model all the way down to the imaging data on which this decision is based.

Space for PGM and expressions

Remember that, even though a real measurement was given above, the generative model may not depend on the measured values; it's a conceptualization of how those values did or could come to exist. That is, the data are the end point of the PGM flowchart, not the beginning.

Note: even though your answer will be checked on Canvas rather than here, we recommend you add it to the notebook for your future reference. The same goes for future "open ended" questions throughout the course.

Generate mock data¶

Next we'll through the process of generating mock data from the model. But first...

Aside: using `scipy.stats`¶

When dealing with standard probability distributions (that is, those that have names like "Poisson"), scipy.stats (imported as st above) is often the best option. Code you write yourself for a specific calculation can sometimes be faster, but tends to be less clear and flexible, so we recommend using the built-in scipy.stats functionality unless speed is absolutely essential. Here is a very brief introduction, which you should supplement be reading the documentation as needed.

Each built-in distribution has its own list of "shape" parameters, which you can find out from its docstring, e.g. st.poisson? reveals that the Poisson distribution has one "shape" parameter, which is the mean. In the scipy implementation, every distribution also has loc and scale parameters, which are needlessly confusing and usually meaningless. But, for example, the normal distribution (st.norm) doesn't have "shape" parameters for its mean and standard deviation; instead the mean is loc and the standard deviation is scale. For the most part this isn't an issue, but it does mean that you should take care to check the definition of the "shape" parameters and the intended meaning of loc and scale before using a distribution for the first time. (For example, the two parameter arguments to st.uniform are not the lower and upper limits of a uniform distribution, as any sensible person would assume.)

Most of the time, one should directly call functions associated with each distribution, as in

st.norm.pdf(0.5, 3.14, 2.72) - evaluate the PDF of a normal distrbution with mean of 3.14 and standard deviation of 2.72 at $x=0.5$;
st.poisson.cdf(3, 5.0) - evaluate the CDF of a Poisson distribution with mean 5.0 at $x=3$;
st.uniform.rvs(10.0, 90.0) - generate 1 random draw from a uniform distribution between 10 and 100 (yes, really 100). Add the size= keyword argument to generate more than one at a time, returned as an array.

If (and only if) we are going to be making many calls to a distribution without changing the values of its parameters, it can make sense to store a "frozen" (that is, with fixed parameter values) distribution object. Depending on the distribution, this might gain us some efficiency, or it might simply be a convenience. For example,

some_pdf = st.uniform(10.0, 90.0)
for i in range(bignumber):
  u = some_pdf.rvs()
  ... some calculations involving u ...

Ok, now on to generating mock data¶

Produce an array of mock data from your model. Note that there are multiple ways to do this, depending on whether your model represents the data as a list of "relaxed/not" labels or simply as the number that were relaxed. Either way, provide your answer below in mocks as a 1D array of size nsim, with each entry corresponding to the number of relaxed clusters from a given simulation.

In [2]:

nsim = 10000

# some definitions/calculations, perhaps...?
# mocks =

Below we compare the distribution of mock data with the single real measurement (they may or may not look anything alike, depending on what prior you chose).

In [3]:

plt.hist(mocks, label='mock data'); # show histogram of simulations
plt.axvline(x=57, color='C1', label='real data'); # compare with data
plt.xlabel('relaxed number');
plt.ylabel('frequency');
plt.legend();

No description has been provided for this image

2. Linear regression¶

In this second exercise, we'll look at the more generic scenario of fitting a line to $(x,y)$ data with "measurement errors" on $y$. Later on, you will be subjected to a long rant about the misleading and meaningless nature of the phrase "measurement error", but nevertheless this is a very common task with a long history.

More oncretely, we have a list of $\{x_k,y_k,\sigma_k\}$ triplets, where $\sigma_k$ is the estimate of the "error" on $y_k$. You think a linear model, $y(x)=a+bx$, might explain these data.

In the absence of any better information, it's common to assume that $\vec{x}$ and $\vec{\sigma}$ are somehow known precisely, and that the "error" on $y_k$ is Gaussian (with mean $a+b x_k$ and standard deviation $\sigma_k$). This is the very common set of assumptions underlying the method of weighted least squares, which you've likely seen before.

As before, enumerate the model parameters, draw a PGM and write down the corresponding probability expressions for this problem. You will need to come up with priors for the model parameters; these can be anything reasonable, given the vague problem definition.

Space for PGM and expressions

Now we'll generate some mock data, as before. The magically known values of $x$ and $\sigma$ are provided below.

In [4]:

x = np.array([3.806556, -4.417116, -2.834024, -2.402881, -1.96018, 6.492918, 2.198745, 7.444938, -6.675794, 3.92799])
sigma = np.array([0.7740595, 0.3079592, 2.211754, 0.9071412, 1.129049, 1.762186, 1.275478, 0.7902194, 1.255513, 1.201411])

Generate nsims mock data sets, $y$. The result, stored in the creatively named mocks2, should be an nsims by len(x) array.

In [5]:

# mocks2 = ...

Below we visualize 3 of them, with tiny $x$ offsets to keep things from overlapping too much.

In [6]:

plt.errorbar(x, mock2[0], sigma, fmt='.');
plt.errorbar(x-0.1, mock2[1], sigma, fmt='.');
plt.errorbar(x+0.1, mock2[2], sigma, fmt='.');
plt.xlabel('x');
plt.ylabel('y');

Parting thoughts¶

Getting used to thinking generatively - from the model to the data rather than backwards from the data to estimates of model parameters - is a fairly common stumbling block. So, while it might have seemed very simple, it's worth spending some time on these simple examples to get us on the same page. The connection between these models and actually doing inference from data may still be murky, and that's ok. We'll get to that soon, and it will be much easier if you've gotten used to thinking about data in generative terms.