Buggy software doesn’t work. According to wikipedia:
A software bug is an error … in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways. Most bugs arise from mistakes and errors made by people in either a program’s source code or its design ...
When something is wrong with a program, we rarely hear of it having one bug — we hear of it having many bugs. I’m wondering: Where does one bug end and the next bug begin?
To answer this question, we need an operational definition of a bug, not the indirect notion present in the Wikipedia quote. 1
This post starts to explore such a definition, but I’m not satisfied with it yet — I’m hoping you will provide your thoughts in the comments to move it forward.
My interest in distinguishing one bug from the next started with our build-it, break-it, fix-it (BIBIFI) programming contest. In the contest’s first round, builder teams write what they hope will be high quality software. In the second round, breaker teams look for bugs, including security bugs, in builders’ software. If they find a bug, they submit a test case that demonstrates it: The test case should succeed when run on our canonical “oracle” implementation, but fail on the buggy implementation. Builders lose points for each bug found, and while breakers gain points. 2
There’s one problem here: Failing test cases are evidence of bugs, but not on a one-to-one basis, since several test cases may tickle the same bug. For example, if I wrote a sorting function that fails to sort the last element of a provided list (e.g., perhaps due to an off-by-one error in the iterator), then the test that takes in=[2,1] produces out=[2,1] (when the output should be [1,2]). Another test revealing the same bug takes in=[4,3,2] produces out=[3,4,2] (when the output should be [2,3,4]). Many others tests, too, can serve as evidence of the bug. Therefore we cannot fairly equate a test case with a bug as that would over-penalize builders and over-reward breakers.
To solve this problem, I’d love some clear-cut method to determine that the two test cases are “the same” in that they are evidence of the same bug. Even if I can’t get a computer to do it, clear guidelines for humans would also be great.
But to do this I need to come up with a definition of “bug” that allows me to distinguish one bug from another.
The BIBIFI contest is not the only context in which you’d like to identify equivalent test cases. A fuzz testing tool (like AFL or CSmith) will generate many tests — we might like to weed out duplicate tests to minimize the number of bug reports, to increase the chances they are acted upon. 3 A bug bounty program (like Google’s) must determine when two submitted reports/tests identify the same bug, to reward only one submitter.
Towards a formal definition of bug
Here’s an attempt to state formally what a bug is, for the purposes of providing some insight into the question of how to distinguish bugs.
A bug is a code fragment (or lack thereof) that contributes to one or more failures. By “contributes to”, we mean that the buggy code fragment is executed (or should have been, but was missing) when the failure happens. A failure is an incorrect input/output pair (e.g., generated by a test case).
Example: Recall the sorting function that fails to sort the last element of the provided list because the iterator has an off-by-one error (on the looping condition); this bug induces the failures listed above (in=[2,1], out=[2,1], and in=[4,3,2], out=[3,4,2]).
A failure could be due to multiple bugs. This basically gets back to the main point of the post: An incorrect program might have many bugs, not just one.
Example: Suppose our sorting program has a printing routine that drops the first element of the list, when printed. Combined with the sorting off-by-one bug, we would have the failures in=[3,1,2], out=[3,2] and in=[1,3,2], out=[3,2]. (That is, we sort the first two elements of the list, leaving the third alone, and the print only the last two.)
Sometimes, the presence of one bug affects the failures or another, in which case we say the one bug interacts with the other bug.
Example: The printing bug interacts with the sorting bug, because its presence affects all failures due to (only) the sorting bug. For example, in=[4,3,2], out=[3,4,2] is masked; rather, in=[4,3,2], out=[4,2] is present, which is also wrong, but different. Conversely, the presence of the sorting bug hides some of the failures of the printing bug. For example, in=[4,4], out= is a failure that is still present, but in=[5,4], out= is not.
We can visualize this interaction with a Venn diagram depicting sets of failures owing to a particular bug. When bugs’ sets of failures overlap, then those bugs interact.
On the other hand, a bug does not interact with another bug when the failures due to the one bug are not affected by the other bug, and vice versa. In this case, the bugs’ sets of failures are non-overlapping.
A fix for a bug is a change to the problematic fragment (or is the addition of the missing fragment) that corrects the bug’s failures. However, when there are multiple bugs that interact, it may be hard to see that a fix to a single bug is efficacious, depending on the order that fixes are applied.
Example: If we first fixed the printing bug, then the failure in=[4,4], out= would go away (yielding correct behavior in=[4,4], out=[4,4]), but we would still have the failure in=[4,3,2], out=[3,4,2] until we fixed the sorting bug. On the other hand, if we fixed the sorting bug first, the existence of the printing bug would still mask the improved behavior, i.e., all of our sorting test cases would still fail. That said, the failures would be different, and more illuminating about the reason for the remaining bug.
Criteria for separating bugs
Supposing that a bug is defined by wrong or missing code that results in a set of failures (i.e., test cases involving wrong input/output pairs), we still must determine which code/failures belong to one bug and which belong to another. This is where things get murky.
Fixes reveal bugs?
We might imagine that a fix could serve as evidence that multiple failures, produced by different test cases, are due to the same bug. In particular, if a single fix makes many failing test cases pass, perhaps that means these test cases should be viewed as due to the same bug?
Unfortunately, this argument is circular. It assumes that a fix applies to only one bug; but why should we assume it does? The code change could have fixed multiple bugs at once, meaning the test cases are not, in fact, the same. So this just kicks the can down the road: To determine if a fix covers multiple bugs, we need a definition of a bug that allows us to distinguish one bug from another.
Bug distinctions by code or specification granularity
I believe one important idea for defining bugs is to identify a level of granularity that serves as a frame of reference. Since bugs are in code, the level might be a code unit, like a procedure, or group of units, like a procedure and all of the procedures it calls. Another kind of granularity is not at the level of code units, but rather at the level of fragments of the specification.
Example: I could have implemented sorting as an insertion sort, with a sort routine and a insert subroutine. A bug in sorting might be viewed as a bug of insert, or a bug in the sort function; the choice might matter if there are several fragments of incorrect code in sort, in which case we might lump them all together as a single bug, or not. On the other hand, the specification of sorting might be the output array is a permutation of the input array, and the output array is ordered. It’s possible that a bug could fall into one half of this spec, or both. If the latter, then even if all of the code is in one unit (e.g., a single sort function) we might say there are two bugs. (Note that our sorting bug example — missing the last element — falls into the latter half of the above spec.)
Of course, code and spec units are somewhat arbitrary: We could write a program as one big function, perhaps with substantial redundancy, or we could write it as many smaller functions. Logical specifications could also be tight or redundant. Nevertheless, I think there is some “essence” of functionality that defines the boundaries of a bug, even if these boundaries might be a fuzzy, and depend on the beholder’s perspective.
Bug interactions complicate the situation because the visible outcome is due to multiple bugs, not one. Therefore it’s hard to tease apart the contributions of each bug. On the other hand, we can confidently say we have two different bugs if they do not interact. Even if the failures do interact, it may be that a potential bug has some non-interacting failures — i.e., its set of failures overlaps with those of another bug, but the set does not contain those failures and is not contained by it. As such, if we fix just that bug, at least these failures would go away, while the dependent failures would go away after fixing the other bug(s).
Bugs could be identified by some other element of their character, as opposed to related code or specification units. For example, there are different categories of coding mistakes, such as race conditions, off-by-one errors, div-by-zero, infinite loops, buffer overflows, etc. Even if something like a race condition involves code fragments throughout the program, we probably would consider the race a single bug, and therefore fix all of those fragments at once.
Conclusions and Questions
At this point in my thinking, what constitutes a single bug seems to be a subjective judgment whose basis roughly corresponds to program “features” as defined by distinct portions of code or the specification, or features of the coding error.
If you disagree, or see important subtleties that I have missed, please comment!
If I am right, then one question is how to nevertheless develop automated techniques for bug localization or bug triage given sets of failing and passing tests. For example, we might like to develop techniques that
- Cluster failing tests into groups that correspond to distinct bugs.
- Given a fix and a set of failing tests, determine whether the fix covers a single bug, or multiple bugs (which is another way of clustering the test cases).
What semantic information would be required for such computations? Traces of execution? AST-level diffs between the buggy and fixed versions?
Another question is how empirical analysis might help us refine the definition of a bug. For example, if we analyzed issue reports and resolutions (bugfixes) on Github, what would we find? Would most fixes be localized into a single code unit, supporting the idea that a bug tends to be local to a particular code unit? How often would we see duplicate or overlapping bug reports, suggesting that an initial report really involved multiple bugs, and not a single bug?
This exploration has been very interesting to me — I never would have thought that an idea as pervasive and well-known as “software bug” would be so hard to pin down!
- Andreas Zeller, in his book Why Programs Fail, prefers the term defect to bug since the latter term is sometimes used to refer to erroneous behavior, rather than erroneous code. I stick with the term bug, in this post, and use it to mean the problematic code (only). ↩
- Read more about the contest in this post. ↩
- One way to identify test cases that are evidence of the same bug is to minimize those tests, using something like delta-debugging or C-reduce, and then see if any are structurally identical. I posit that this might work for simple tests but will not work in general. ↩