Notes: Overview of Statistical Methods in Astrophysics¶

Containing:

What the course consists of, and what it's all about
A brief overview of what's coming

Who "we" are¶

Dr. Adam Mantz is a Research Scientist in the Kavli Institute for Particle Astrophysics and Cosmology (KIPAC). He has a background in observational astrophysics and cosmology, specializing in studies of clusters of galaxies and the intracluster medium. Has has also, within recent memory, worked on medical imaging research in the Division of Oncology of the Stanford School of Medicine. He has been involved in the development and teaching of this course since its inception, and has adapted parts of it for the LSSTC Data Science Fellowship Program. Any time the first-person "I" appears in these notes, you may blame him. Any time the first-person "we" appears, it's probably still him being presumptuous.$^1$

What the course is¶

Learning goals¶

Our goal is that students taking this course will:

Understand and apply the principles underlying statistical inference from data, and the role of probabilistic modeling in inference
- Design and interpret generative models relevant to real astrophysical inference problems
- Use Bayesian analysis to infer posterior distributions and compare competing models, and produce appropriate quantitative summaries of such results
- Understand the distinction between Bayesian and frequentist approaches, and the circumstances in which they are approximately equivalent
Gain familiarity with various numerical algorithms used in statistical inference
- Implement, evaluate and compare brute-force and simple Markov chain monte carlo approaches to carrying out inference
- Understand and use diagnostics of the convergence and quality of posterior samples
- Practice using multiple software packages providing a wider range of sampling methods
Apply the above principles and skills to realistic astrophysical inference problems
- Design and use complex generative models including features such as hierarchical structures and selection effects
- Implement and evaluate combinations of sampling strategies, approximations and other practical tricks for solving challenging inference problems
- Work with real astrophysical data in several forms, performing inference to answer a variety of questions

What that means¶

In a nutshell, this class is about how data are turned into conclusions. The examples and tutorials are taken from astrophysics, but otherwise the content is extremely general.

To be a little more concrete, consider this cartoon of the scientific process:

Propose observations/experiment
Collect and "reduce" data
Explore and summarize the data
Hypothesize and test
Interpret, conclude, speculate
Report

"Turning data into conclusions" broadly refers to the bold items. In particular, we will not spend time on "data reduction", by which we mean the initial processing steps applied to raw data recorded by some instrument. These details tend to be highly specific to a given detector (or detector type), wavelength range, telescope system, and so on. On the other hand, we will be introducing and working with real astronomical data, so you'll get some experience with it if this is new to you. Even so, there's far too much variety for us to touch on every type of data you might encounter!

This course will also be focused on a particular approach to drawing conclusions from data, known as Bayesian analysis. This is because, as we'll see, Bayes' Law provides a rigorous basis for formalizing what the phrase "turning data into conclusions" (more properly called inference) actually means. The upshot is that we will not be marching through a survey of ad hoc methods and the problems they might be applied to. Instead, the goal will be to carefully define the question we are asking of the data, and then decide what techniques and/or (justifiable) approximations will get us to an answer. This, we hope, will be appealing to students with a physics background.

Finally, this course aims to be practical. So, lest the previous paragraph give you the wrong idea, we will not have problem sets where you are asked to prove theorems or show that such-and-such is true. (No more than one, anyway.) With the exception of a bit of review at the start, the assigned problems will be tutorials that walk through a real-life analysis, albeit usually a simplified one, so that you gain experience putting the methods we will discuss into practice. The goal is that, by the end, you will be able to apply the reasoning and methods of Bayesian analysis in your own work.

Aside: What this course is not¶

Strictly speaking, any procedure applied to data could be called a "statistical method". However, not all such methods allow us to rigorously draw conclusions of interest - in full generality, all we can conclude from the application of some method is what the output of that method is. In contrast, our goal with inference is usually to learn something about a model from the data, not just the number(s) that an algorithm turns the data into. This requires quantifying the probability of various possibilities in light of the data, and this in turn means applying a principled, mathematical framework - or, at least, starting from one, even if we end up making approximations. That isn't to say that there's no place for other methods, but we think it's important and generally useful to understand Bayesian reasoning, regardless. Not to mention, it's enough to fill a quarter on its own.

So, to be perfectly upfront, this class will not significantly delve into

frequentist statistical methods (though they will be covered somewhat)
so-called "machine learning" algorithms

Again, these methods have their uses (and can be interpreted in the Bayesian framework in the right limits), but will not be a major focus.

Prerequisites¶

We use the Python programming language extensively, and it is not generally feasible to substitute a different language. Some previous experience with Python, or the capacity and desire to pick it up really, really fast, is therefore required. You do not need to have any particular experience beyond the basics, or with any specific packages like numpy or scipy, though it doesn't hurt. Prior exposure to mathematical calculations using numpy is particularly particular.

Previous exposure to the concepts and calculus of probability would be very helpful, though we will review them briefly. In this context, undergraduate quantum mechanics or statistical mechanics will do, as would (naturally) an undergraduate statistics/probability course. Again, this is a case where the amount of effort it takes to get going at the beginning of the course will depend on how much of the learning curve you've already climbed.

Course materials¶

Notes: What were once lecture materials for this class have been expanded and turned into (we hope) readable notes. If you're reading this, then you know where to find them already. There is no textbook associated with the class, so the notes (in combination with the tutorials) are intended to be comprehensive.

Tutorials refer to Python-language Jupyter notebooks where you will put the material covered in the notes into practice. Often these will consist of a partially worked inference problem, where the code required to illustrate an understanding of the subject at hand is left out, and most other "background noise" code (the kind that doesn't demonstrate an understanding of statistics, yet is annoying to write) is provided. Normally, a tutorial notebook will have several "checkpoints" that test whether some portion of your code is correct.

Project: Optionally, you can complete a project the form of a tutorial of your own design, allowing you to explore a data set of your choice and/or analysis methods beyond what the existing tutorials cover.

Things this course doesn't have:

Textbook: None, as noted above.
Exams: Hahahahahaha. No exams.
Solutions: You will not at any point be provided with solutions to the tutorials. Please don't ask. This doesn't mean that you're on your own if you get stuck, since your fellow students and the instructor can be resources to help you. More immediately, you will know if your code is working properly by comparing your results with the checkpoints provided in the tutorials, as well as by comparing with the outputs of a compete notebook (these are linked from the front page).

Some key ideas¶

Here is a very abbreviated summary of some of the critical ideas in this course. If they don't make sense now, don't worry; we'll be unpacking them much more later on.

i) All data we collect include some degree of randomness

This can be intrinsic to the source, a result of the measurement process, or effective randomness due to the involvement of some process we don't know or understand perfectly.

ii) Any conclusions we draw must therefore incorporate uncertainty

This means we should describe both the data and conclusions in the language of mathematical probability.

Our conclusion will take the form: the probability that something is true in light of (given) the data we collected,

$P(\mathrm{something}|\mathrm{data})$.

By the basic laws of probability, this can be written

$P(\mathrm{something}|\mathrm{data}) = \frac{P(\mathrm{data}|\mathrm{something}) \, P(\mathrm{something})}{P(\mathrm{data})}$.

We'll spend much more time understanding this later, but, importantly, it means that

iii) There is a correct answer

As in physics, the theory tells us the solution. The challenge is in evaluating it.

In other words, we aren't free to pick or dream up a procedure that kinda, sorta feels like it should be about right. The equation above is indisputably correct, but how to use it, and when/how to approximate it, is something to be learned.

Within this framework,

iv) Data are constants

Even though they are generated randomly by the Universe, data that we have already collected are fixed numbers.

Much of our job boils down to building a model that predicts, probabilistically, what data we might have measured instead.

v) Things we don't know with perfect precision can be mathematically described as "random"

That is, we use probabilities to model things that are uncertain, even if they are not "truly" random.

As the quarter unfolds, we will cover

Fundamentals of probability theory (a brief review)
Bayesian analysis, and how it relates to other approaches
Algorithmic tools used for Bayesian inference
Ways to evaluate and compare models
Applications to several types of astronomical data
Advanced modeling strategies
Other topics, as time permits

Endnotes¶

Note 1¶

However, we note that SLAC Sr. Staff Scientist Phil Marshall was heavily involved in the early years. So it remains possible that a particular "we" actually was plural when originally written.