Tutorial: Demo¶
This notebook introduces the tutorial notebook format we use in this class.
The first things to say is that you cannot complete one of these tutorials if you are viewing it as HTML in a web browser! You will need to be able to make changes and run the notebook in Jupyter, either in Google Colab, on your own computer or on some remote system (see Getting Started).
It's worth knowing that there are "solved" notebooks on the course website. "Solved" means that the outputs from a correct solution are shown, though not the solution code itself. This can be extremely useful for checking that you are getting something reasonable at each step (though note that you will be given different data, so the correct answers may be numerically different). It's also a good idea to look at the solution if you suspect a given cell will take a while to run - in these cases we put a %%time
command in the code, and the solution will show how long that cell took on the instructor's 10 year old laptop. (We aim to keep total run time for these notebooks at less than a minute, although some of the later ones take a few minutes.)
The typical tutorial will contain a short description of the problem and a list of specific tasks you will be practicing. For example, in this demo, you will
- be introduced to the general format of a Physics 267 tutorial;
- read about the ways that you might need to provide code, and how that code can be checked; and
- learn more about how the evaluation process works.
This will be followed immediately by a code cell where we import all the packages that will be needed. In Colab, you will need to uncomment the !pip install
lines at the top. We don't need any of this in the demo, but for example:
# !pip install bcgs
from os import getcwd
from yaml import safe_load
from numba import njit
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
%matplotlib inline
import bcgs
Next will be a cell that crudely figures out whether the notebook is being run in Colab. If so, it attempts to mount your Google drive, and sets datapath
to the location where the data for this tutorial is expected to be. If not, we assume any data is in data/
relative to the notebook's location. You can change this if need be, but otherwise should not normally change anything here.
thisTutorial = 'demo'
if getcwd() == '/content':
# assume we are in Colab, and the user's data directory is linked to their drive/Physics267_data
from google.colab import drive
drive.mount('/content/drive')
datapath = '/content/drive/MyDrive/Physics267_data/' + thisTutorial + '/'
else:
# assume we are running locally somewhere and have the data under ./data/
datapath = 'data/'
Background¶
Normally, this setup will be followed by a more detailed exposition of the problem to be solved, including reading in the data and looking at it in some way. We generally provide different data sets to each student, though in general a complete notebook will work with any of them. Most often, these are simulated data, so the "correct" answer can differ from student to student. This is just for fun. If you end up really struggling, you can also download the "public" data, from which the solutions are generated, to more closely compare what you're getting with the solutions.
Our usual mode will involve reviewing the model under consideration in class and jointly deciding on what priors to adopt for a given analysis. Since we don't know what those will be ahead of time, this is another reason the solutions on the web might differ from what you end up with.
There is generally also a Markdown cell reserved for you to specify the model in pseudocode and as a PGM. This is not required, but it's a very good idea to keep that information right in the notebook. Keep in mind that if you ask for help, this it is literally the first thing you will be asked for. (In Colab and classic Jupyter Notebook, importing an image that is saved in the notebook file itself is straightforward. Unfortunately, this seems to not be a thing in Jupyter Lab yet.)
Code completion¶
Most of the code that doesn't particularly reflect what this class is meant to teach will be provided, but there will be places where you need to provide code, usually the definition of important variables or the guts of a function. The former case might look like this:
# prior_params = {'alpha':..., 'beta':...}
# YOUR CODE HERE
raise NotImplementedError()
--------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last) Cell In[3], line 4 1 # prior_params = {'alpha':..., 'beta':...} 2 3 # YOUR CODE HERE ----> 4 raise NotImplementedError() NotImplementedError:
Where, in context, it should be clear that you are asked to define a dictionary prior_params
with keys alpha
and beta
, and are expected to insert the correct values of those entries (whatever they are). The raise
statement should be deleted when you're ready to test. So after you've done your thing, in the original cell rather than a new one, it would look something like this:
prior_params = {'alpha':0.125, 'beta':0.561}
The case of being asked to provide a function might look like:
def prior(f, alpha, beta):
# return the prior density evaluated at x, p(f)
# YOUR CODE HERE
raise NotImplementedError()
and, when completed:
def prior(f, alpha, beta):
return f * alpha / beta # To be clear, I made this up out of nowhere.
Checkpoints¶
You are expected to check your own results for reasonableness, both on your own, by comparing with the solutions and through peer feedback in class. In addition, where it's straightforward, we will sometimes provide built-in checks, in the form of assert
statements like the following:
# some pointless code
x = 3.14159
assert np.isclose(x, np.pi)
The above statement would throw an exception if the value of x
is not $\pi$ to within some tolerance defined by the numpy.isclose
function. In real life, such a test would be intended to identify a bug before moving on, or ensuring that a monte carlo process has converged sufficiently.
The automatic checks can't foresee all possible circumstances, and, as noted above, are not the only mechanism by which you should check your results. Between our extensive use of monte carlo methods in this class and the different data sets distributed to each student, it is possible that an assertion may end up being too stringent as coded. Obviously, if this happens, you will not be penalized for failing the check. But it should not be presumed, in general.
Please keep in mind that these checks are intended to help you by identifying possible bugs in real time, without having to turn in the notebook, wait for feedback, revise, etc. You might be tempted to simply circumvent them, but please note that this is not actually possible in the grading phase (see below).
Open-ended questions¶
Sometimes you will find an open-ended question in a notebook. You do not need to answer these in place (although, like the model specification, it is encouraged). We will discuss these in class, so you are discouraged from simply jumping over them without thinking. To assist in this discouragement, you may see assertions used as a speed bump in this context as in:
What do you think would happen if we changed this or that?
I_have_thought_deeply_about_this = False # change to True when true
# YOUR CODE HERE
raise NotImplementedError()
--------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last) Cell In[9], line 4 1 I_have_thought_deeply_about_this = False # change to True when true 3 # YOUR CODE HERE ----> 4 raise NotImplementedError() NotImplementedError:
assert I_have_thought_deeply_about_this
--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In[10], line 1 ----> 1 assert I_have_thought_deeply_about_this AssertionError:
You should, naturally, change the value to True
after having thought about the question. Deeply.
Before turning in your notebook¶
- Delete outputs,
- restart the Python kernel, and
- run the notebook from top to bottom.
As detailed below, if your notebook cannot run completely and successfully under these circumstances, it will be returned to you for revision, so it pays to check yourself. We recommend saving the outputs from this run in the file you turn in; this isn't required, but might help identify the issue of our independent run of your notebook fails for some reason.
What happens after you turn in a notebook¶
The first thing that will happen is a hands-off process in which
- a copy of your notebook is made,
- existing outputs are deleted,
- cells containing checkpoints are reverted to their original content, if they had been changed,
- additional secret checkpoints are added (of the kind that could not be revealed without giving too much away),
- the copy is run from top to bottom.
Note that this run will use the data that were assigned to you. You are always free to use different data (whether the "public" data, or some other student's) when working, but our expectation is that your notebook will work when run using your own data.
The upshot of all this is an automatic check that your notebook functions, in the sense of not crashing due to a bug or the failure of an assert
statement. If the notebook does crash, it will be returned to you with a note about what the issue appeared to be. We will try to be prompt with this, but cannot guarantee it. It is therefore to your advantage to seek help more directly if you already know there is a problem, rather than submitting the assignment and waiting for feedback. As noted above, we will obviously make allowances if a checkpoint assertion turns out to be too stringent for your particular data set for some reason, but please don't just assume that this is the case.
A consequence of this system is that is may be possible to alter a notebook so much that it cannot be successfully reconciled with the original, and might fail for that reason. We have not seen this, but also haven't really tried to break things. If this happens, you will be congratulated on thinking outside the box, and asked to resubmit the assignment with a notebook that follows the tutorial structure more closely. We're pretty sure that adding new code cells in places where you are asked to complete things is ok. It goes without saying that deliberate attempts to circumvent these checks will be considered cheating and dealt with under the applicable University policies.
Please also keep in mind that "not crashing" is necessary but not sufficient for your notebook to be complete. The hope is that all of these code checks will catch many or most mistakes that might be made, ideally with that feedback being immediate, but they aren't sophisticated enough on their own to say whether everything has been done correctly. The purpose of the automatic code-checking framework is not to replace human grading, but to make it possible for the human grader to focus more on the substance of the work than on debugging.
On computational efficiency and parallelization¶
Here we share a few words on computing details, which are also touched on in the Generative Models tutorial, and illustrate them (which is not done elsewhere).
On the subject of parallelization, this is straightforward. Parallelization is an excellent way to get things done in less human time, and many of the methods we will use are straightforwardly parallelizable. Unfortunately, taking advantage of this in a way that will work on anyone's machine plus Colab, in a notebook, is not really feasible. So in this class we will not really use parallelization, but you should know that it's there, often easily, for your own work. With rare exceptions, run time will not be much of a concern in the assignments for this class regardless (i.e., once complete a notebook will usually take a couple of minutes to run, at most, even without parallelization).
More generally, writing efficient code in Python is a helpful thing to know how to do. Like any interpreted (as opposed to compiled) language, Python suffers a bit from the fact that exactly the same calculation coded in different ways can take a vastly different amount of time to run - like orders of magnitude, in some cases. This stuff may be familiar to you already; if not, read on.
Let's consider the task of computing the log of a product of Gaussian terms with known means and variances. That is,
$R = \ln\left( \prod_i \frac{1}{\sqrt{2\pi}\sigma_i} \exp\left[ -\frac{(x_i-\mu_i)^2}{2\sigma_i^2} \right] \right)$.
Below is the slowest way we can (quickly) think of implementing this. We'll repeat the calculation a number of times and time it too see how long it takes.
x = np.arange(100.) # just some x values that we will use in the example
mu = np.zeros(x.shape) # similarly, just some values of mu and sigma
sigma = np.ones(x.shape) # (they could differ across the array, but this doesn't really matter for the example)
%%time
for j in range(100):
R = 1.0
for i in range(len(x)):
R *= st.norm(mu[i], sigma[i]).pdf(x[i])
R = np.log(R)
<timed exec>:5: RuntimeWarning: divide by zero encountered in log
CPU times: user 9.29 s, sys: 197 ms, total: 9.49 s Wall time: 9.4 s
Ok, so something like 100ms per evaluation on my machine. We can do much better that this.
However, our first change will be more about accuracy than efficiency. Namely, taking the product of many terms like this is an excellent way to end up either exactly zero or infinity, given limited numerical precision (note the warning above). If we're going to take the log anyway, it's much wiser to compute the sum of logs rather than the log of products. If fact, the scipy
distribution objects that we used above have a method logpdf
in addition to pdf
, anticipating just this situation.
%%time
for j in range(100):
R = 0.0
for i in range(len(x)):
R += st.norm(mu[i], sigma[i]).logpdf(x[i])
CPU times: user 10.7 s, sys: 173 ms, total: 10.9 s Wall time: 11.6 s
In theory we should have a more accurate answer now, but it didn't save any time. So here's an important thing to note that may not be obvious: within the loop, we are creating a new object, st.norm(mu[i], sigma[i])
, with every iteration, and never using it again. There is a lot of overhead associated with this, which has no benefit to us in this particular situation. Usefully, the scipy
distribution classes provide class methods (which do not require an object to be created) that can accomplish the same thing.
%%time
for j in range(100):
R = 0.0
for i in range(len(x)):
R += st.norm.logpdf(x[i], mu[i], sigma[i]) # subtly different syntax, big difference in efficiency
CPU times: user 1.43 s, sys: 101 ms, total: 1.53 s Wall time: 1.44 s
That bought us about a factor of ~10 in speed, to ~10ms per evaluation, but we can definitely do better. Here's the cardinal rule of efficient code in Python: you almost never have to write a for
loop to iterate over an array. Most numpy
/scipy
functions can operate over arrays transparently, and usually this will be much faster (at least, it should be no slower). In this case, it's such a big deal that we will need to increase the number of times we evaluate the calculation in order to keep the runtime macroscopic.
%%time
for j in range(10000):
R = np.sum(st.norm.logpdf(x, mu, sigma))
CPU times: user 1 s, sys: 61.1 ms, total: 1.06 s Wall time: 1.02 s
We are now about 100x faster again, at ~100$\mu$s per evaluation. This is usually efficient enough for us in this class, and using scipy
functions is a nice way to avoid accidental bugs, so we'll usually stick with code like this. However, if need be, it is sometimes even faster to directly code the math we are trying to do (this isn't obvious, and depends on what the scipy
and numpy
functions are doing internally).
%%time
for j in range(100000):
R = np.sum( -0.5*((x-mu)/sigma)**2 - np.log(sigma) ) - 0.5*len(x)*np.log(2*np.pi)
CPU times: user 1.42 s, sys: 4.91 ms, total: 1.43 s Wall time: 1.43 s
Above, we have cleverly moved the constant term outside the sum, which makes a small difference. In practice, it will often (not always) be the case that we only actually need the sum above over terms that include $x$, with the $\ln(\sigma)$ part being extraneous. If we go ahead and make that change...
%%time
for j in range(100000):
R = -0.5 * np.sum( ((x-mu)/sigma)**2 )
CPU times: user 844 ms, sys: 3.3 ms, total: 847 ms Wall time: 850 ms
... we have sped up a little more to ~10$\mu$s per evaluation, although comparing the last two cells you can see that we're starting to get diminishing returns by trying to optimize the Python code. If we were going to repeat this calculation so many times that we still need more efficiency, we would turn to numba
, which streamlines the process of providing compiled code for Python to run. This requires more explanation than we are inclined to delve into at the moment, but you can see the implementation for our example below.
@njit
def f(x, mu, sigma):
R = 0.0
for i in range(len(x)):
R += ((x[i]-mu[i])/sigma[i])**2
return -0.5 * R
# numba compiles the function the first time it's called, so it takes more time than subsequent calls.
# therefore, let's do this outside of the timing loop.
f(x, mu, sigma);
%%time
for j in range(1000000):
R = f(x, mu, sigma)
CPU times: user 954 ms, sys: 3.53 ms, total: 958 ms Wall time: 957 ms
You can see that this has bought us another order of magnitude, bringing us down to ~1$\mu$s per evaluation. Again, we will usually be content to use scipy
functionality for the readability it provides, despite the factor of 100 slowdown it apparently implies. (We are now deliberately not rethinking the wisdom of this...) But it's good to know how to get more effiency if it's needed.
Parting thoughts¶
We will typically try to summarize what the tutorial covered and its context in the course at the end of the notebook. You will read these notes and be fascinated.
In this notebook, we outlined the typical structure of a tutorial, introduced the mechanics of completing one, and described the automatic verification that is part of their evaluation. We also digressed into the realm of code optimization, which will hopefully be helpful as we go forward.