Tutorial: Generative Models¶

In which you will:

identify the quantities comprising the measured data and the parameters of the model;
organize the data and parameters in terms of their conditional dependences into a probabilistic graphical model;
express the dependences as specific probabilistic or deterministic relationships among variables;
implement a model in code, and generate mock data.

Preliminaries¶

This exercise is mostly to practice going from a real-world(ish) problem described in words to an actionable model.

To be explicit, by model, we mean

a list of quantities comprising your data and parameters from which predicted data can be produced;
a PGM representing the conditional dependences of the parameters and data;
a list of expressions containing the same information as the PGM, with the added specification of what probability distributions are involved.

"Expressions" in this context are of the form you saw in the reading, and need not be fully spelled-out equations, for example:

$a \Leftarrow b,c$
$x \sim \mathrm{Normal}(\mu, \sigma)$
$\mu \sim \mathrm{some~prior}$

translates to

$a$ is a deterministic function of $b$ and $c$
$x$ is normally distributed with mean $\mu$ and standard deviation $\sigma$
$\mu$ is distributed according to some prior that I would need to specify in practice, but don't necessarily have to bother with for this exercise. Here "prior" implies no dependence on other listed parameters.

Every parameter and datum in the model must have such a rule for how it depends on other quantities (or priors). The result is a recipe for generating mock data, and also contains all the information needed to do inference given real data that we've collected.

There is no set rule saying that it's better to draw the PGM first and write the expressions second, or vice versa; different people find each approach more or less natural.

To turn in a PGM, you could for e.g.

draw on paper and include a photo in the notebook
do the digital equivalent with a tablet and stylus
use some other simple drawing tool and include that graphic in the notebook
use the daft package to produce a PGM graphic directly in Python

Personally, I find the current version of daft extremely ugly and have taken to using Google Drawings or old-fashioned scribbling, but whatever works.

Finally, note that some of the situations described below are intentionally ambiguous. Expect to have to make some assumptions in order to fully specify the model, and note what they are.

In [ ]:

TutorialName = 'genmod'
exec(open('tbc.py').read()) # define TBC and TBC_above
import numpy as np
import scipy.stats as st
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
%matplotlib inline

1. Relaxed cluster fraction¶

X-ray imaging data for 361 galaxy clusters were analyzed, and 57 of them were found to be morphologically "relaxed" according to some metric. We want to constrain the fraction of relaxed clusters in the Universe (the probability that a randomly chosen cluster is relaxed), $f$, assuming that the data set is representative.

1a. Model specification¶

Enumerate the model parameters, draw a PGM and write down the corresponding probability expressions for this problem. Be explicit about the form of the sampling distribution (see Essential Probability), and arbitrarily choose some prior distribution for $f$ for the second part, below. You can assume that the total number of clusters, 361, is given by fiat and doesn't need to be generated by the model. You can also assume that the decision "relaxed or not" for a given cluster comes from an algorithm that's given to you, without your having to spell out how it might work or specifically what data might be fed into it.

Remember that, even though a real measurement was given above, the generative model may not depend on the measured values; it's a conceptualization of how those values did or could come to exist.

Note: you will want to change the mode of the cell below, and in subsequent non-coding responses, to Markdown to use LaTeX mode and/or insert images.

In [ ]:

TBC()

1b. Generate data¶

Go through the process of generating mock data from the model. Produce a visualization that compares the an ensemble of mock data sets (say 1000) for

(1) model parameters fixed at some fiducial value(s)

In [ ]:

TBC()

(2) model parameters varying according to the PGM/expressions you write down above

In [ ]:

TBC()

2. Linear regression¶

Your data is a list of $\{x_k,y_k,\sigma_k\}$ triplets, where $\sigma_k$ is some estimate of the "error" on $y_k$. You think a linear model, $y(x)=a+bx$, might explain these data.

In the absence of any better information, assume that $\vec{x}$ and $\vec{\sigma}$ are (somehow) known precisely, and that the "error" on $y_k$ is Gaussian (mean of $a+bx_k$ and standard deviation $\sigma_k$). This is the very commoon set of assumptions underlying the method of weighted least squares, which you've likely seen before.

Enumerate the model parameters, draw a PGM and write down the corresponding probability expressions. Optionally, generate and visualize mock data for this problem.

In [ ]:

TBC()

3. Exoplanet transit photometry¶

You've taken several images of a particular field, in order to record the transit of an exoplanet in front of a star (resulting in a temporary decrease in its brightness). Some kind of model, parametrized by $\theta$, describes the time series of the resulting flux. Before we get to measure a number of counts, however, each image is affected by time-specific variables, e.g. related to changing weather. To account for these, you've also measured 10 other stars in the same field in every exposure (you can assume that the weather affects everything in the same image equally). The assumption is that the intrinsic flux of each of these stars should be constant in time, so that they can effectively be used to correct for photometric variations, putting the multiple measurements of the target star on the same scale.$^1$

Enumerate the model parameters, draw a PGM and write down the corresponding probability expressions.

Please note that this scenario is (intentionally) much more complex than the others. Based on the information here, it would be impossible to specify everything fully, including the equations linking different parameters. Instead, focus on the relationships among parts of the model - what depends on what? And remember that we're thinking generatively, as in the model generates the data. It may be helpful to produce your own narrative, more detailed than the one above, that describes how your model encodes the physical model of various stars eventually producing our measurements. In the interest of retaining some degree of simplicity, there is no need to drill down the level of pixels in an image; let's say instead that the number of counts received from a given source in a given image is something we can officially call data.

In [ ]:

TBC()

Note 1¶

Thanks to Anja von der Linden for inspiring (and then correcting) this problem.