A Bayesian modeling cheat sheet

Please pick up Statistical Rethinking by Richard McElreath for a full introduction to this topic

Bayesian modeling consists of specifying models in terms of distributions of priors and likelihoods in order to calculate (or approximate) a posterior distribution of some set of data.

That sound like gibberish? Here's an example, straight out of Statistical Rethinking:

What percent of the earth's surface is water?

Say you want to know how much of the earth's surface is water. You have a globe handy, and you decide to toss it up and down (chaotically) and keep track of how often your right index finger lands on water. In this way, you're getting a random sample of whether or not specific points on the globe are surface water.

Say you get: land water water water land water land water water

Given a hypothesis about the percent of the earth's surface that is water, and the appropriate likelihood function for your data generation process, you can calculate the likelihood of obtaining the data you did under the hypothesis you have.

The appropriate likelihood for our bernoulli process is the binomial, calculated here the long way as well as using the indispensible scipy library. The likelihood calculations assume the world is half land, half water, which seems like a safe initial guess.

First, here are the libraries we'll use:

import emcee
import corner
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize, basinhopping
from scipy import stats
from scipy.special import factorial

plt.rcParams['figure.figsize'] = 14,3

And here's our first likelihood:

def binomial_likelihood(w, n, p):
    f = factorial
    return (f(n) / (f(w)*f(n-w))) \
            * p**w * (1 - p)**(n-w)

# Probability of six times on water out of nine tosses, assuming 50% 
# of the surface is water.
dbinom(6, 9, .5), stats.binom.pmf(6, 9, .5)
(0.1640625, 0.16406250000000006)

The posterior

The posterior tells us how likely each of a range of possible values is. To calculate the posterior given a prior and a likelihood, we have a few options which basically boil down to different ways to optimize our likelihood function. Here's the brute force "grid approximation" way; we just calculate the likelihood of seeing our data for each possible value of p.

You can see the likelihood peaks at around .7, which makes sense given our data of 6/9 observations being water. We would conclude that a world that is 70% water is most consistent with our data.

In:
# Grid approximation

p = np.linspace(0, 1, 50)
plt.plot(p, dbinom(6, 9, p))
plt.title('Computing posterior distribution by grid approximation')
plt.show()

To do more complicated models, we'll have to encode our uncertainty about the true proportion of water on earth as a distribution all its own. Above we're assuming each p between 0 and 1 is equally likely. As code, that might look like:

def log_prior(p, w):
    if 0 < p < 1: return 1.0
    return -np.inf

We could also return 1.0 here; it doesn't really matter, because we're just seeing a function which is proportional to the true probability.

Note that we're using the logarithm of the likelihood here. Likelihoods multiply, so that the likelihood of seeing two independent events each with likelihood .5 is .25. By using logs, we can do addition instead of multiplication, which is easier on the computer.

In the below code we have separate functions representing our uncertainty about our prior (our existing information of what portion of the earth is water) and our likelihood function (binomial, as above). We combine them as mentioned in the last paragraph: by adding the log prior and log likelihood, we obtain the overall probability.

This is a Bayesian model! The only thing left to do is optimize it to discover which unknown parameters maximize the likelihood of our data, given our prior.

In:
## Model:
# w ~ binomial(9, p)
# p ~ uniform(0, 1)
# data = [[6,7,5,6,5]

def log_prior(p, w):
    if 0 < p < 1: return 1.0
    return -np.inf

def log_likelihood(p, w):
    if (0 < p < 1): 
        return sum(stats.binom.logpmf(w, 9, p))
    return -np.inf
    

def log_probability(p, w):
    lp = log_prior(p, w)
    if not np.isfinite(lp):
        return -np.inf
    return lp + log_likelihood(p, w)

nll = lambda *args: -log_likelihood(*args)
w = np.array([*map(int, np.random.normal(7, 2, 700))]).clip(0, 9)

# Note we minimize the negative log likelihood here
soln = minimize(nll, .5, args=(w), method = 'Nelder-Mead')
print('p: {0:.3f}'.format(soln.x[0]))
p: 0.707

Above, we obtain a point estimate for the most likely value of p, the proportion of water on earth. As we saw above visually, it's very close to 70%.

Sampling from the posterior

The true power of Bayesian modeling is in our ability to quantify our uncertainty. Sure, the most likely proportion of water is 70%, but how unlikely is 80%? Or 50%? In other words, how sure are we?

To explore this question, we take advantage of a class of algorithms designed to sample from the posterior distribution of our model. Once we have a large enough sample, we can rely upon the law of large numbers to answer questions about our posterior, such as generating credible intervals for individual parameters like p.

Here, we use the fantastic emcee library, which provides a fast sampling algorithm in pure python.

Note: Statistical Rethinking relies on R bindings to Stan, which can be a pain to install and familiarize yourself with, while the popular pymc3 relies on somewhat slow symbolic logic libraries, so I recommend emcee. It may not scale as well as pymc3 to more complicated models.

As you'll see if you're following along, this sampling can take some time, particularly if you're analyzing over a lot of data. Fortunately emcee ships with a progress bar.

In:
pos = soln.x + 1e-4 * np.random.randn(32, 1)
nwalkers, ndim = pos.shape
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_probability, args=(w,))
sampler.run_mcmc(pos, 5000, progress=True);

flat_samples = sampler.get_chain(discard=100, thin=15, flat=True)

plt.hist(flat_samples, bins=100)
plt.title("Posterior distribution of $p$")
plt.show()
100%|██████████| 5000/5000 [00:52<00:00, 95.23it/s]

Using the samples generated by emcee, we generate a histogram of values, showing the distribution of our posterior. We could answer our earlier question of how unlikely a proportion of 80% or more might be by simply calculating the proportion of samples that exceed .8:

(flat_samples > .8).mean()

Regression

Above we explore a toy example of a Bayesian model with very simple data. A more common statistical need is to regress one value onto another. This is extremely intuitive as a Bayesian model. We write

y ~ Normal(mu, sigma)
mu ~ m + b * x
m ~ Normal(0, 20)
b ~ Normal(0, 10)
sigma ~ Uniform(0, 50)

to represent our regression. Here the tildas mean that a variable on the left is distributed according to a distribution on the right. Note that some distributions can rely on other parameters!

The above model says that a variable y will be distributed normally with mean mu and variance sigma, where sigma is itself uniformly distributed between 0 and 50. mu is a deterministic combination of m and b, each distributed normally around 0 with a large variance. Together, these distributions fully specify a univariate regression.

In:
# Regression time!

# y ~ Normal(mu, sigma)
# mu ~ m + b * x
# m ~ Normal(0, 20)
# b ~ Normal(0, 10)
# sigma ~ Uniform(0, 50)

# Generate some fake data
x = np.linspace(0, 100, 50)
m = 1.05
b = 3
y = m*x + b + np.random.normal(0,15,len(x))

def log_likelihood(theta, x, y):
    # Unpack array-like
    m, b, s = theta
    
    # Priors
    ll_m = stats.norm.logpdf(m, 0, 20)
    ll_b = stats.norm.logpdf(b, 0, 10)
    
    if not 0 < s < 50: return -np.inf
    ll_s = 1.0

    # Likelihood
    mu = m * x + b
    ll_y = sum(stats.norm.logpdf(y, mu, s))
    ll = sum([ll_m, ll_b, ll_s, ll_y])
    return ll

nll = lambda *args: -log_likelihood(*args)

soln = minimize(nll, (1, 1, 1), args=(x, y), method = 'Nelder-Mead', )
print("m: {:.2f}, b: {:.2f}, s: {:.2f}".format(*soln.x))
m: 1.06, b: 2.03, s: 15.20

Note that above we've condensed our likelihood function down into one function, but it still contains the same logic. We calculate log likelihoods for each of our priors and our main likelihood given the data we've observed, and add them together to get a single joint likelihood. When we optimize over this, we've performed a regression!

In:
plt.plot(x, y, lw=0, marker='.')

mhat, bhat, shat = soln.x
fit = mhat*x + bhat
plt.plot(x, fit)
plt.fill_between(x, fit - 2*13.132, fit + 2*13.132, alpha=.1)
plt.title("Linear regression fit with credible intervals")
plt.show()

Above, because we've got an estimate for sigma, we've generated a confidence interval for our data. However, we don't have any insight into our uncertainty around individual parameters. For that matter, we don't have any insight into our our parameters covary. What is the relationship between m and b, for example?

A corner plot, generated with ease from the handy corner library, gives us an easy way to visualize answers to these questions. For example, we can see that m and b are negatively correlated.

In:
pos = soln.x + 1e-4 * np.random.randn(10, 3)
nwalkers, ndim = pos.shape
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_likelihood, args=(x,y))
sampler.run_mcmc(pos, 5000, progress=True);

flat_samples = sampler.get_chain(discard=100, thin=15, flat=True)
100%|██████████| 5000/5000 [00:21<00:00, 231.29it/s]
In:
labels = ["m", "b", "sigma"]

fig = corner.corner(flat_samples, labels=labels)

That's it for Part 1! In a subsequent post, we'll dig into meatier topics like multilevel modeling that will enable more practical data analysis.