In this post, I introduce the emerging area of **probabilistic programming**, showing how probabilistic programs will hopefully make it easier to perform Bayesian-style machine learning, among other applications. Probabilistic programming is an exciting, and growing, area of research, with fantastic people in both AI/ML and PL working together and making big strides. PL methods — including formal semantics, optimization techniques, and forms of static analysis — have proven very useful in advancing this area forward.

## Bayesian-style Machine Learning

Let’s start by introducing Bayesian-style machine learning. It works by reducing uncertainty about a particular hypothesis.

For example, a spam filter often works in two steps. First, it extracts various features X1, X2, … of a message (total length, the frequency of particular words or symbols, whether the message is HTML, etc). Then, it applies a function F(X1, X2, …), rejecting the message if F returns true.

How can we learn, or *infer*, what F should be? We can imagine starting with a joint *probability distribution *over feature values X1, X2, … which expresses our uncertain hypothesis about the importance of each feature in determining whether a message is spam. Then, we can provide examples of feature vectors for messages that we know *are*, and *are not*, spam, and use Bayes’ rule to revise this distribution according to the added information. With enough examples, we can hopefully reduce the uncertainty about the feature values to the point that we define some F that can distinguish spam from non-spam.

While the concept for such a filter is relatively straightforward, it’s not so easy to code up. The hope is that this situation will change thanks to the development of probabilistic programming languages. These languages treat *normal-looking programs as if they were distributions*, and in the process simplify the expression of Bayesian-style learning tasks.

In the next section, I’ll make the previous sentence more precise, and then in the following section we’ll look at how probabilistic programming can do machine learning.

## Probabilistic programs *are* probability distributions

If you’ve ever written a function that calls a random number generator, you’ve written a probabilistic program. When run by a normal programming language, such a function can be viewed as sampling from a probability distribution. A *probabilistic* programming language, however, will interpret the function *as the distribution itself*, and will give the programmer various tools for asking questions about the distribution. Let’s make this idea precise by first representing a distribution as a data structure, and then showing a correspondence between that data structure and the probabilistic program.

### Distributions, directly

A way we might normally think to code up a distribution is as a function that takes a value and returns its probability; those values whose probability is non-zero make up the sample space of the distribution. For example, the distribution for a fair coin could be represented as (in Ruby syntax):

def faircoindist(x) = if x == “heads” then 0.5 elsif x == “tails” then 0.5 else 0.0 end end

Here, the sample space of the distribution is **{ “heads”, “tails” }**. Passing either of these values to **faircoindist** will return their probability, which in both cases is 0.5 because the coin is unbiased. We can generalize this code to also support biased coins, where the probability of heads can vary, as follows

class CoinDist def initialize(p) @p = p end def dist(x) if x == "heads" then @p elsif x == "tails" then 1.0-@p else 0.0 end end end

A distribution for a fair coin would be created using **CoinDist.new(0.5)**, but a coin that nearly always came up heads would be **CoinDist.new(0.95)**.

### Distributions as probabilistic programs

A different way to represent a distribution is as a probabilistic, or *sampling*, function.

class CoinSample def initialize(p) @p = p end def run s = rand if s < @p then "heads" else "tails" end end end p = CoinSample.new(0.4) puts p.run # prints either heads or tails puts p.run # prints either heads or tails puts p.run # prints either heads or tails

Here, we can view **p** as the sampling function for a biased coin: each time I call **p.run** ^{1} I have a 4/10 chance of getting “heads”.

The key idea in probabilistic programming is to *view the sampling function CoinSample.run as being synonymous with the distribution*. Why does this make sense? Because, intuitively, if we ran the function many times, we could recover the distribution function. The following code does exactly this:

def makecoindist(s,n) c = 0.0 for i in 1..n x = s.run if x == "heads" then c = c + 1.0 end end p = c / n CoinDist.new(p) end

This function takes a parameter **s**, which is the sampling function, and a number **n** to indicate how many times to call the sampling function. It then counts the number of times that “heads” comes up, and from that information estimates the probability of getting heads. It uses that probability to initialize a distribution function.

d = makecoindist(p,1000) puts d.dist("heads") # prints something close to 0.4, like 0.38

The **makecoindist** function is only producing an approximation of the actual distribution, but we approach the real distribution as we increase the value of **n** toward infinity. Mathematically, we *define* the distribution induced by a probabilistic function in this way, even if we end up using sampling to compute an approximation of that definition.

However, in some cases we can compute the distribution exactly, using a form of static analysis. A *value flow analysis* is a kind of static analysis that attempts to determine all possible values a function could return, for all possible runs of the function. A value flow analysis of the **CoinSample.run** function would say that it will always produce either “heads” or “tails”. We can extend value flow analysis to also compute frequency information by understanding the semantics of the underlying probabilistic operators, like **rand**. For our running **CoinSample.run** example, such an analysis can easily compute the precise distribution.

## Probabilistic Programming for Machine Learning

In a normal programming language, a probabilistic program is like a sampling function for a distribution, and all you can do with it is run it, i.e., to get a sample. As we have just shown, in a probabilistic programming language *the meaning of the program **is** the distribution it samples from*. So, while you can still generate samples, if you like, you can also do more interesting things. In particular you can perform *inference *on unknown variables, which is tantamount to *machine learning*.

### Machine learning by Bayesian Inference

Rather than go back to the spam filter example, we’ll use a simpler example: learning the bias of a coin. Suppose we knew that we had a biased coin, but didn’t know how biased it was. In a sense, we can suppose we have our hands on a **CoinSample** object, but don’t know the bias **b** it was created with. This is like knowing the message features used by a spam filter, but not knowing which are most important. We express the situation as follows.

b = rand p = CoinSample.new(b)

This program expresses the coin’s bias **b** as just a random floating point number chosen uniformly, indicating we are uncertain about what **b** might be. In general, probabilistic languages often support sampling from other sorts of distributions, like Gaussian distributions or Beta distributions, and not just the uniform distribution [0,1].

The meaning of this program is the joint distribution of **b** and **p**. We can reduce our uncertainty about **b** by conditioning the distribution based on observations of the actual coin flip, using Bayesian inference. Here is how we might express this in a probabilistic programming language (extending Ruby with some new notation, *observe*):

b = rand p = CoinSample.new(b) r1 = p.run # run it once r2 = p.run # run it again r3 = p.run # run it again observe(r1 == “heads”) observe(r2 == “heads”) observe(r3 == “heads”)

After constructing the sampling function, we observe that each of three runs produce “heads”. These observations serve to condition the distribution of **b**, reducing our uncertainty about it. Since we see “heads” three times, we have some reason to believe the coin is biased. In fact, the expected value of **b** is about 0.8. More runs, some of which might produce “tails”, will push our estimate the other way. A spam filter would work the same way, where instead of observing coin flips, we observe that certain spam messages are, or are not, spam, and as such revise our uncertainty about the importance of the features of those messages.

Making the analogy to static analysis again, the *observe*() notation is similar to the notation *assume*(). It allows us to ignore certain possibilities “after the fact” rather than have to anticipate them when defining the distribution. In short, the semantics of *observe* is like running the program normally, and if you reach *observe*(**p**) and **p** is false, you re-run the program to try again until all observations are valid. (Roughly speaking, this is rejection sampling.)

### Applications

There are many possible applications of probabilistic programming. Primarily, probabilistic programming makes applications of Bayesian machine learning more convenient, especially in the hands of non-experts. As one final example in this space, consider Borgström et al’s encoding (first presented in their ESOP’11 paper) of a Bayesian formulation of TrueSkill in on-line gaming. Each contest between participants shifts our estimate of each one’s “true skill.” Here is a snippet of an encoding with two players (among many others, not shown):

skillA = Gaussian.new(100,10).run skillB = Gaussian.new(100,10).run perfA = Gaussian.new(skillA,15).run perfB = Gaussian.new(skillB,15).run observe(perfA > perfB) ## A wins

Each player’s true skill is represented as drawn from a Gaussian distribution (**skillA** and **skillB**). A more skilled player should beat a less skilled player. Simplistically, one player will defeat another when the former’s skill is greater than the latter’s. However, sometimes a strong player will have a bad day or a weak player will get lucky, and so we encode a particular performance (**perfA** and **perfB**) as drawn from a Gaussian with the player’s skill as the mean. In the program above, we observe that player A beats player B, which should serve to improve A’s skill rating. If player C later beat A, then the transitive update to the model will suggest that C is a stronger player than B, too.

Many other encodings have been done too, e.g., to perform computer vision, distinguish seismic anomalies (earthquakes as opposed to nuclear tests), and to determine more reliable medical tests.

Probabilistic programming has also been used to model the change in an adversary’s knowledge as a result of an interaction between that adversary and a system hiding a secret (e.g., a login system which requires a password). And it has been used as a kind of verification technique for programs that operate on uncertain data. Basically, whenever we are reasoning about noisy data, or data about which we are uncertain, probabilistic programming can play a role.

### Algorithms

We have hinted at two ways to compute distributions from probabilistic programs (a process termed *inference*): sampling, and using a kind of static analysis. Implementations of probabilistic programming languages may use either, or both. For example, Church is a probabilistic programming language based on Scheme that uses Metropolis-Hastings sampling to compute distributions. DrBayes (which is the work of Toronto and McCarthy) is also based on Scheme and uses a combination of sampling and abstract interpretation to compute distributions. My own prior work with Piotr Mardziel and Stephen Magill relies entirely on abstract interpretation, augmenting the standard polyhedral and powerset domains with information for tracking probabilities. In general, sampling is very expressive but has poor performance and few guarantees of precision, while analysis is less expressive but may perform better and provide stronger guarantees.

Languages also work by compiling programs to other representations for which good algorithms already exist. For example, Fun is a probabilistic language based on F# and it works by compiling to a program in Infer.NET, which computes distributions by converting programs to what are called factor graphs and using inference algorithms developed for them. A system recently developed by Adrian Sampson, Todd Mitkowicz (linked further below), and others, performs inference by first compiling to Bayesian networks.

Many other interesting algorithms and techniques have been, or are being, explored.

## Beyond the basics

At this point there are also many probabilistic programming languages, at varying levels of maturity. Rather than list them all here, I’ll refer you to probabilistic-programming.org, which collects most of them, and is a good repository of information in general. One interesting system not listed is Tabular, whose input language is based on spreadsheets. This language makes machine learning available even to non-programmers; there is now an Excel extension to support it.

Despite tremendous progress, there is still a ways to go. Most notably, systems must improve their expressiveness (what sorts of probabilistic models can be expressed) and performance (e.g., how many random variables can be supported efficiently). Nevertheless, the state of the art was deemed promising enough a year or so back that DARPA began a research program called PPAML (“Probabilistic Programming for Machine Learning”) aiming to bring the vision to democratize machine learning closer to reality.

We can all hope, looking ahead, that this vision will be successful in the same way that high-level programming languages have been successful — programmers used to be very concerned with low-level, non-portable details that compilers now routinely handle, and we can hope the many details that concern machine learning users today will soon be handled automatically too.

Fulfilling this vision opens up lots of exciting research challenges. The programming languages community has much to offer here. As two recent examples, papers presented at PLDI’14 — Expressing and Verifying Probabilistic Assertions, and Slicing Probabilistic Programs — show how PL techniques like symbolic execution and program slicing can be used to support efficient inference. Additional interesting connections remain to be uncovered and exploited.

### Learning more

A fantastic overview of probabilistic programming appeared earlier this year, titled Probabilistic Programming, by Gordon, Henzinger, Nori, and Rajamani. Since you made it this far in my post, I am confident you’d enjoy reading this paper. It covers the similar ground, but goes into greater depth in every way. I’d also recommend the Probabilistic Models of Cognition Tutorial, by Goodman and Tenenbaum, which uses Church to walk you through the development of probabilistic models, interactively. Finally, check out this short interview with Daniel Roy, one of the leaders of the field. Probabilistic programming is a fascinating area with great potential for impact; look into it!

Thanks to Neil Toronto who provided feedback and organizational suggestions on an early draft of this post (and still needs to install his UMD page!).

Notes:

- In Ruby, a no-argument function
**p.run**can be called with syntax**p.run**, as opposed to**p.run()**. ↩

Thanks, Mike, for the great intro and round-up. I’d add that there are slides available here on some cutting-edge progress in both approximate and probabilistic programming from the APPROX 2014 workshop (co-located with PLDI 2014 this year) – http://approx2014.cs.umass.edu/

I am interested in doing a postdoc related to probabilistic programming (don’t have experience in the field so far, but do have experience in data analysis and applied machine learning). Are there any suggestions you can please provide regarding how to find such a position? Thanks

I don’t have any particular insight, other than to find people working in the area to see if they need to hire someone. You could look at the bibliography of the Gordon paper cited in my post. You could also look at the attendee list of the PPAML DARPA program’s proposer’s day, and look for them on-line, http://www.darpa.mil/WorkArea/DownloadAsset.aspx?id=2147486442

Pingback: The State of Probabilistic Programming « Some Thoughts on a Mysterious Universe

Pingback: BibSonomy :: url :: What is probabilistic programming? - The PL Enthusiast

Pingback: Software Security Ideas Ahead of Their Time - The PL EnthusiastThe Programming Languages Enthusiast

Pingback: Interview with Matt Might, Part 2 - The PL Enthusiast

Pingback: Confluences in Programming Languages Research

Thank you for nice explanation. It would have good to include how it different from other language and Statistical packages (R / SAS)