How do we know what we know? That question is the subject of study for the field of epistemology. Per Wikipedia, “Epistemology studies the nature of knowledge, justification, and the rationality of belief.” Science is one powerful means to knowledge. Per the linked Wikipedia article, “Science is viewed as a refined, formalized, systematic, or institutionalized form of the pursuit and acquisition of empirical knowledge.”
Most people are familiar with the basic scientific method: Pose a hypothesis about the world and then carry out an experiment whose empirical results can either support or falsify the hypothesis (and, inevitably, suggest additional hypotheses). In PL research we frequently rely on empirical evidence. Compiler optimizations, static and dynamic analyses, program synthesizers, testing tools, memory management algorithms, new language features, and other research developments each depend on some empirical evidence to demonstrate their effectiveness.
A key question for any experiment is: What is the standard of empirical evidence needed to adequately support the hypothesis?
This post is a summary of a paper co-authored by George T. Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and me that will appear in the Conference on Computer and Communications Security, this Fall, titled “Evaluating Fuzz Testing.”[ref]I also presented our work at the ISSISP’18 summer school, though the work has matured a bit since then.[/ref] It starts to answer the above question for research on fuzz testing (or simply, fuzzing), a process whose goal is to discover inputs that cause a program to crash. We studied empirical evaluations carried out by 32 research papers on fuzz testing. Looking critically at the evidence gathered by these papers, we find that no paper adheres to a sufficiently high standard of evidence to justify general claims of effectiveness (though some papers get close). We carry out our own experiments to illustrate why failure to meet the standard can produce misleading or incorrect conclusions.
Why have researchers systematically missed the mark here? I think the answer owes in part to the lack of an explicit standard of evidence. Our paper can be a starting point for such a standard for fuzz testing. More generally, several colleagues and I have been working on a checklist for empirical evaluations in PL/SE-style research. We welcome your feedback and participation!