How do we know what we know? That question is the subject of study for the field of epistemology. Per Wikipedia, “Epistemology studies the nature of knowledge, justification, and the rationality of belief.” Science is one powerful means to knowledge. Per the linked Wikipedia article, “Science is viewed as a refined, formalized, systematic, or institutionalized form of the pursuit and acquisition of empirical knowledge.”
Most people are familiar with the basic scientific method: Pose a hypothesis about the world and then carry out an experiment whose empirical results can either support or falsify the hypothesis (and, inevitably, suggest additional hypotheses). In PL research we frequently rely on empirical evidence. Compiler optimizations, static and dynamic analyses, program synthesizers, testing tools, memory management algorithms, new language features, and other research developments each depend on some empirical evidence to demonstrate their effectiveness.
A key question for any experiment is: What is the standard of empirical evidence needed to adequately support the hypothesis?
This post is a summary of a paper co-authored by George T. Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and me that will appear in the Conference on Computer and Communications Security, this Fall, titled “Evaluating Fuzz Testing.”[ref]I also presented our work at the ISSISP’18 summer school, though the work has matured a bit since then.[/ref] It starts to answer the above question for research on fuzz testing (or simply, fuzzing), a process whose goal is to discover inputs that cause a program to crash. We studied empirical evaluations carried out by 32 research papers on fuzz testing. Looking critically at the evidence gathered by these papers, we find that no paper adheres to a sufficiently high standard of evidence to justify general claims of effectiveness (though some papers get close). We carry out our own experiments to illustrate why failure to meet the standard can produce misleading or incorrect conclusions.
Why have researchers systematically missed the mark here? I think the answer owes in part to the lack of an explicit standard of evidence. Our paper can be a starting point for such a standard for fuzz testing. More generally, several colleagues and I have been working on a checklist for empirical evaluations in PL/SE-style research. We welcome your feedback and participation!
What is Fuzz Testing?
Fuzz testing is a process of repeatedly generating random inputs in an attempt to find one or more that cause a target program to fail, usually via a crash. Crashes due to segmentation faults and illegal instructions are signs of memory corruption, which usually arises from bugs that are potential vectors of attack. As such, finding and fixing crashing bugs is important for security.
Modern fuzz testing tools (“fuzzers”) generate new inputs by repeatedly and randomly mutating prior inputs, starting with one or more initial seeds. Which inputs to retain, and which mutations to apply, may depend on measurements gathered during the target program’s execution. A common measurement is the set of basic blocks that were executed. Inputs that execute new basic blocks are “interesting,” and so used as the start of future mutations.
Fuzz testing is effective and its use in industry in popular. Two well-known fuzzers used at Google are AFL and libfuzzer. Fuzz testing is also a popular topic for research. Several research tools exist, such as (in no particular order) AFLFast, T-Fuzz, VUzzer, and Driller. Over the last several years, fuzzing papers have regularly appeared in top security conferences.
Evaluating Fuzz Testing Evaluations
A good fuzz tester will find bugs in many programs, quickly. A research paper on a new fuzz testing algorithm (call it A) will attempt to show this by carrying out an empirical evaluation with the following elements:
- a compelling baseline fuzzer B to compare against;
- a sample of target programs—the benchmark suite;
- a performance metric to measure when A and B are run on the benchmark suite; ideally, this is the number of (possibly exploitable) bugs identified by crashing inputs;
- a meaningful set of configuration parameters, e.g., the seed file (or files) to start fuzzing with, and the timeout (i.e., the duration) of a fuzzing run.
An evaluation should also account for the fundamentally random nature of input generation, which means different runs could produce different results. As such, an evaluation should measure sufficiently many trials to sample the overall distribution that represents the fuzzer’s performance, using a statistical test to compare the performance of A and B.
We looked at 32 research papers written during the last five years, located by perusing top-conference proceedings and other quality venues. We studied their experimental evaluations and found that no fuzz testing evaluation carries out all of the above steps properly (though some get close).
This is bad news in theory, and after carrying out more than 50,000 CPU hours of experiments, we believe it is bad news in practice, too. Using AFLFast (as A) and AFL (as baseline B), we carried out a variety of tests of their performance. We chose AFLFast as it was a recent advance over the state of the art; it is based on AFL which is a common baseline in fuzzing research (14 of 32 papers we studied used it); AFLFast’s code was publicly available; and we were confident in our ability to rerun the experiments described by the authors in their own evaluation and expand these experiments by varying parameters that the original experimenters did not. We found that experiments that deviate from the above recipe could easily lead one to draw incorrect conclusions.
Multiple Trials and Statistical Significance
Because fuzz testing is a random process, we can view each fuzzing run as a sample from a probability distribution. Simply put, each fuzzing run is like rolling a (very complicated, many-sided, possibly biased) die. Just as we cannot roll a die once to understand its tendencies (is it biased toward a particular numbers/sides, or is it fair?), we cannot run a fuzzer just once to assess its effectiveness at finding bugs. Running a fuzzer one time might involve a series of random choices that uncover 5 crashing inputs. Running it another time will involve different random choices that could uncover 10 crashes, or none.
And yet, what we found was that nearly 2/3 of the 32 papers we looked at did not report running fuzzers A and B more than once when assessing whether A was better than B. Of the papers that did report more than one run, the number of runs varied a fair bit. Only three papers described the measured variance of those multiple runs (e.g., as a standard deviation).
Perhaps variance in fuzzing is low, and multiple runs are not necessary? According to our experiments with AFL and AFLFast, that need not be the case. The graph to the left plots the performance of AFL and AFLFast over time while fuzzing the program cxxfilt. The X-axis is seconds (totaling up to 24 hours) and the Y-axis counts the number of “unique crashes” found, which are crash-inducing inputs that AFL deems to be different from one another (more on this later). We ran each fuzzer 30 times. The solid lines represent the median, the inner dashed lines represent 95% confidence intervals, and the outermost dashed lines indicate the maximum and minimum.
Looking at the graph, we can see significant variance in performance. AFLFast finds anywhere from 800 to 1700 unique crashes and AFL finds between 350 and 800. Moreover, we see that the lines overlap: The minimum AFLFast line is below the maximum AFL line at the 12-hour mark. As such, even though the overall trend is in favor of AFLFast, an unlucky pair of measurements (stopping at 12 hours) might have led us to the opposite conclusion.
While it is good to look at the medians when assessing the overall trend, it may be that a higher median is still due to luck. The chances that a performance difference is not real are given by the p-value produced by a statistical test. A p-value of less than 0.05 is traditionally taken to be low enough that the trend is not due to chance. Following the advice of Arcuri and Briand, we use the Mann Whitney U test and find that for cxxfilt, sure enough, the p-value is 10-10, and thus the difference is very unlikely to be due to chance. However, other experiments (on other targets) we carried out yielded p-values greater than 0.05, indicating that an apparent difference in median performance may well be due to chance. No fuzzing paper of the 32 we looked at used a statistical test.
I have necessarily kept this discussion brief; please see the paper for more nuanced discussion and detail. The key takeaway is: We need to be carrying out multiple trials when assessing fuzzer performance, and we should be using statistical tests to compare the performance of a proposed fuzzing algorithm against that of the baseline.
Performance and Seeds
An important starting point for mutation-based fuzzers is the seed input. Conventional wisdom says a fuzzer performs better with small, valid seeds. These drive the program to execute more of its intended logic quickly, rather than cause it to terminate at its parsing/well-formedness tests. For example, if we are fuzzing a PDF viewer, conventional wisdom would indicate valid PDF files be used as seeds. Only 2 of 32 papers we looked at used an empty file as the seed. Of the rest, 10 said they used non-empty seeds, and 9 more specified the use of valid (non-empty) seeds, but none of these 19 papers said anything about the seeds themselves. It is as if they are assuming the particular seed does not matter.
To test this assumption, we ran several experiments with varying seeds. The adjacent three graphs show the performance of AFL and AFLFast on FFmpeg, a program for working with streaming media (e.g., video and audio) files. The three graphs show the performance when setting the seed to an empty file, a sample video present on the FFmpeg web site, and a video constructed by stitching together random frames (respectively). As another point of comparison, we ran AFL in a mode that does not gather code coverage information while running, effectively making it a “black box” fuzzer. We call this version AFLnaive.
The clear takeaway is that the performance of the fuzzer can change dramatically when we change the seed. When using the empty seed, AFL and AFLFast find hundreds of crashes, but when using the sample and random seed, they find far fewer. Conversely, AFLnaive finds no crashes with the empty seed, but thousands of them with the randomly constructed one! The paper has additional data and experiments that show the same thing: Changing the seed can dramatically change performance.
Assuming that an evaluation is meant to show that fuzzer A is superior to fuzzer B in general, our results suggest that it might be prudent to consider a variety of seeds when evaluating an algorithm. Papers should be specific about how the seeds were collected, and better still to make available the actual seeds used. We also feel that the empty seed should be considered, despite its use contravening conventional wisdom. In a sense, it is the most general choice, since an empty file can serve as the input of any file-processing program. If a fuzzer does well with the empty seed across a variety of programs, perhaps it will also do well with the empty seed on programs not yet tested. And it takes a significant variable (i.e., which file to use as the seed) out of the vast configuration space.
Bugs vs. Crashes
The ultimate measure of a fuzzer’s success is the set of bugs (i.e., code defects) that it finds. But fuzzing papers often don’t measure “bugs found” directly. Rather, they measure “unique crashes” which we hope will correlate with bug counts, and are based on heuristics. Two common heuristics are AFL’s notion of “coverage profile” (used in 7 papers we looked at) and stack hashes (used in 7 others). A key question is: How good are these heuristics at approximating ground truth? If they dramatically over- or under-approximate the truth, then our empirical measurements could lead us astray.
To answer this question, we triaged the crashing inputs that were produced by AFL and AFLFast for cxxfilt. Using AFL’s notion of coverage profile, 57,142 “unique” inputs were generated across the 60 runs. To triage them, we systematically applied the code updates that have occurred since the two year-old version of cxxfilt that we fuzzed. Many of these updates fix bugs that were found by fuzzing. We know this because after applying an update and re-running the program, some fuzzer-discovered inputs no longer cause cxxfilt to crash. All those inputs that no longer crash after a particular update correspond to the bug fixed by that update. (We took steps to confirm that the fix is deterministic and to ensure that each update corresponded to a single bug, though this is not actually so straightforward.)
The result of this process identified only 9 distinct bug-fixing updates! Treating crashing inputs as bugs thus massively overestimates the effectiveness of the fuzzer. The figure above organizes these results. Each bar in the graph represents a 24-hour fuzzing trial carried out by either AFL or AFLFast. For each of these, the magnitude of the bar on the y axis is the total number of “unique” (according to coverage profile) crash-inducing inputs, while the bar is segmented by which of these inputs is grouped with a bug fix discovered by our ground truth analysis. Above each bar is the total number of bugs discovered by that run (which is the number of compartments in each bar). The runs are ordered by the number of unique bugs found in the run.
We can see that there is at best a weak correlation between the number of bugs found during a run and the number of crashing inputs found in a run. Such a correlation would imply a stronger upward trend of crash counts when moving left to right. We also see that some bugs were found repeatedly (red, orange, and green bars), while others were much harder to find. Indeed, no single run found all 9 bugs. So: Not all crashing inputs are “equal.” We can also see that AFLFast generally found many more “unique” crashing inputs than AFL but the number of bugs found per run is only slightly higher. Mann Whitney finds that the difference in crashes is statistically significant, with a p-value of 10−10, but the difference in bugs is not (but is close)—the p-value is 0.066. The paper contains much more discussion, and an assessment of stack hashes.
While this is just one experiment, it does call into question the use of bug heuristics when assessing fuzzing performance. A better practice is to evaluate using ground truth, either determined as we have done or by using a benchmark suite for which bugs are known. Examples of the latter are the Cyber Grand Challenge suite or LAVA-M (though these have their own problems). In general, the fuzzing community could really use a solid benchmark suite.
There is a lot more in the paper. There is more experimental data on the questions considered above. There is data and analysis on the questions of the efficacy of stack hashes and how long to run the fuzzer. We also pose open questions, and sketch steps toward their resolution. Most importantly, we look carefully at the choice of target programs used by fuzzing researchers and identify the path forward toward a more robust benchmark suite.
Why did this happen?
One question we don’t answer in our paper is, “Why do so many papers fail to follow the best practices?”
One possible answer is ignorance: The authors just didn’t know better. This may be true in some cases, e.g., that “unique crashes” are a pretty bad stand-in for “bugs”. But in other cases it rings false. It’s “science 101” to know that you need multiple samples of a non-deterministic/randomized process to understand expected behavior.
Another explanation is that there simply aren’t better alternatives than what was done. For example, there does not exist a great benchmark suite for fuzzing — that’s a work in progress.
I also think part of the issue is a combination of uncertain standards and incentives. Papers are accepted by the peer review system. This means that if three (or so) reviewers approve of the paper it’s published. Authors are evaluated by publishing, so they are incentivized to meet the bar set by the reviewers. The more (top) papers you publish, the more “successful” you are. So you are not incentivized to get far over the bar; just the other side of it suffices.
But the bar set by reviewers is not particularly sharp. Reviewers rate a paper based on whatever criteria they have in mind when they review it. If there is no clear, written, organized standard, reviewers may miss or underrate important flaws. For example, in terms of the presentation, reporting on an experiment with a single trial (bad!) is not much different than reporting on multiple trials. The latter may simply have an extra sentence saying, “We ran 11 trials and here report the median measurement.” It might also have graphs with error bars. These things are easy to miss. And yet, they are potentially crucial for believing the evidence supports the hypothesis. A reviewer, when asked explicitly, will agree that multiple trials are important. But that reviewer may forget to confirm that multiple trials were actually carried out, when reviewing the paper.
As such, I hope our paper can serve as a start toward that organized, written standard. The conclusions of the paper give explicit advice that can be adopted now. They also enumerate problems that require attention so as to improve the quality of evidence of future papers.
Guidelines for Empirical Evaluations
Motivated by what I was seeing in fuzzing, last year (while still SIGPLAN Chair) I formed a committee whose goal was to develop guidelines for carrying out empirical evaluations in PL research generally. This committee, comprised of Steve Blackburn (Chair), Emery Berger, Matthias Hauswirth, and myself, examined a sample of papers published in top PL venues. Based on what we saw, we developed an empirical evaluation checklist that organizes and categorizes elements of a good empirical evaluation. The hope is to assist researchers and reviewers by making a standard that is clear and actionable, as per the checklist manifesto. The advice we provide in Evaluating Fuzz Testing is an instance of the broader advice encapsulated in this checklist.
Science is one of the best sources of knowledge that we have, but only if we do it right. As scientists, we need to be vigilant to use the most appropriate methods, and to insist that others do the same. Only then can we start to feel confident that what we think we know is the truth.
Foremost, I must thank George Klees, Andrew Ruef, Benji Cooper, and Shiyi Wei for their contributions to the work reported here. I also thank Marcel Böhme and Abhik Roychoudhury for their help with AFLFast. I thank Michelle Mazurek, Cornelius Aschermann, Luis Pina, Jeff Foster, Ian Sweet, the participants of the ISSISP’18 summer school, and Mathias Payer for helpful comments and suggestions on the work, and on drafts of our paper.