It was recently reported that the Heartbleed OpenSSL bug has still not been patched on over 300,000 servers. This is down from 600,000 that were discovered two months ago, when the bug was first publicized. Add to this delay the nearly two years that the bug went undetected, and that’s a lot of exposure.
In this blog post, I wonder about how programming languages and tools might have played a role in finding Heartbleed before it got into production code. Then I explore possible approaches to improving this state of the affairs. I welcome your thoughts and ideas!
The Heartbleed bug, when exploited, allows an attacker to illicitly access certain contents in the memory of a buggy server. (How the bug allows this is explained well by Matthew Green.) The leaked memory may contain things like secret keys and previously entered passwords, and so administrators of vulnerable servers were advised to change these keys (which creates its own problems) and users were advised to change their passwords. The memory disclosed may also reveal the locations in memory of control-flow related data, which are useful in remote exploits, and would negate the protection conferred by ASLR. As such, while the Heartbleed bug itself cannot be exploited to cause remote code execution, it could enable the exploitation of another bug in the same server.
It is particularly disturbing that so many hosts remain unpatched, because it would be hard to tell if Heartbleed is being exploited to acquire sensitive information. Part of the difficulty is that data can be exfiltrated back to the attacking host as part of an encrypted connection and therefore invisible to packet monitoring software looking for evidence of exploitation. System administrators assuming that the bug is only being exploited using packets transmitted “in the clear” rather than as part of an established secure connection have a false sense of security.
Finding the Hearbleed with static analysis
When this bug came out, colleagues asked me whether state-of-the-art static analysis tools (which analyze code for problems before that code is run) should have found Heartbleed sooner. To paraphrase: You guys in programming languages have been working on tools to find bugs like this for a long time. I presume that companies/developers should have been using these tools, and if so, they would have found the bug, right?
Unfortunately, I suspect the answer is ‘no’. In particular, at least four commercial tools that perform static analysis on code — Coverity’s Code Advisor, Grammatech’s Code Sonar, Klocwork’s Insight, and Veracode’s code scanning services — would not have found the bug. In fact, Coverity didn’t find the bug: OpenSSL has been part of Coverity’s Open Scan project since 2006. The Scan project uses Coverity tools to analyze open source projects for free, and would have scanned OpenSSL after the Heartbleed bug was introduced in 2012. Only after the bug was publicly announced did Andy Chou of Coverity suggest, in a blog post, a (clever) way that Coverity’s tool could be made to find the bug. Grammatech and KlocWork quickly showed that their tools could play the same trick (here and here). My student, Andrew Ruef, wrote a Clang static analyzer plugin that can find the bug in the same way. Veracode simply hunts for the offending SSL code, not for the root cause of the bug.
Why couldn’t tools find the bug?
You would hope/expect that these tools would have found the bug: A significant motivator for developing static analysis tools is to find critical bugs that testing, code reviews, and other methods would miss. This bug seems to be a perfect storm of challenging features, and a recent whitepaper by James A. Kupsch and Barton P. Miller covers the details well. The basic explanation is that this bug involves a lot of complicated code and indirection through pointers, and as such confounds the reasoning of most tools. (The fix that Andy Chou suggested was to circumvent these complexities with a heuristic that identifies likely “tainted”, or attacker-originating, input with an idiomatic code pattern.)
A higher-level explanation of why tools couldn’t find the bug is that commercially viable static analysis is unsound, with the goal of being more complete. This is a technical way of saying that commercial tools deliberately choose to miss bugs so as to reduce the rate of false alarms.
A sound analysis is one that, if there exists an execution that manifests a bug at run-time, then the analysis will report the bug. Unfortunately, a sound analysis may also claim to find bugs that do not actually manifest at run-time; these are called false alarms. On the flip slide, a complete analysis is one that, if it reports a bug, then that bug will surely manifest at run-time. Ideally, we would have an analysis that is both sound and complete, so that it reports all true bugs, and nothing else. Unfortunately, such an analysis is impossible for most properties of interest, such as whether a buffer is overrun (the root issue of Heartbleed). This impossibility is a consequence of Rice’s theorem, which states that proving nontrivial properties of programs in Turing-complete languages is undecidable. So we will always be stuck dealing with either unsoundess or incompleteness.
Having a high rate of false alarms is a tool killer. A rule of thumb I have heard is that users are willing to tolerate no more than 50% of the alarms being false. Bessey et al from Coverity wrote a great piece for Communications of the ACM on their experience constructing a commercially viable static analysis tool. They report the toleration limit is 30%, not 50%, and in fact they aim for 20% for their “stable” checkers. They achieve this rate by looking for the most simplistic bugs, and avoiding sophisticated analysis that, while it finds more bugs, yields more false alarms. One amazing (to me) line in the piece is
the commercial Coverity product, despite its improvements, lags behind the research system in some ways because it had to drop checkers or techniques that demand too much sophistication on the part of the user.
In particular, users were labeling true bugs as false alarms, and had a harder time diagnosing false alarms, because the bugs being uncovered were very complicated and the error reports were hard to understand.
Where does this leave research, and practitioners?
How can we balance both the need of better security and the expectations of software developers and their employers? Answers to this question are important for getting things done today, and for setting an agenda for impactful research.
Of course we can and should develop better analysis algorithms, i.e., those that are closer to being sound while not having too many false alarms (and not being too slow). An important question is how we evaluate these algorithms: beautiful theorems and simple benchmarks are not enough. We need realistic empirical assessments, both in terms of the power and precision of the tool in the hands of an expert, and in the hands of more ordinary developers. Assessing the latter may require studying real users, something the static analysis community rarely does.
Fully sound tools and processes may be appropriate for code that is security-critical, like OpenSSL, despite the greater demand and sophistication required of developers. For example, we could imagine applying full formal verification, the end result of which is a proof that the code will always behave as it should. Such an approach is being increasingly viewed as viable. For example, DARPA’s HACMS program has been pushing research to improve the scalability, applicability and usability of formal verification. A recent research project has looked at formally verifying SSL.
Another approach is simply to use type-safe languages, like Java and Haskell, which rule out pernicious bugs like buffer overruns by construction. Rust is a promising new language, since it aims to provide high performance by providing programmers low-level control like they have in C while retaining type safety. (I note that several ideas in Rust were inspired by the research language Cyclone, which was co-developed by Trevor Jim, Greg Morrisett, Dan Grossman, Nikhil Swamy, and myself, among others.) I believe Google’s Go is also type-safe. That said, type safety does not rule out other important security-critical bugs, like information disclosures.
Rather than try to (only) reduce the number of false alarms, we might imagine trying to reduce the time to triage an alarm, e.g., by making it easier to understand. Improved user interfaces, e.g., those that support a directed code review in response to an alarm, might provide some help, as one of prior study showed.
But such tools require more sophisticated users who may need to understand better how a tool works, as the Coverity paper suggested. One approach is the business model of “static analysis as a service” as provided by Veracode. In this model, a user submits their code for analysis, and engineers very familiar with the analysis triage its warnings, providing a report back the user. The issue here is that the static analysis company’s engineers don’t understand the code. So the question is, what is more important for triaging: knowing the analyzed codebase, or knowing the analyzer itself?
There must be a role for education, too, with the aim of fostering more sophisticated users. By better understanding what’s going on, such users can use static analysis or a sophisticated language as a tool, rather than as a magic trick. It is my intuition that PL and static analysis courses that would help this understanding are becoming more common, even at the undergraduate level. This is a good thing.
Stepping back: Heartbleed has raised our awareness that the importance of our software being free of errors is not just something that is nice to have, or a value-add for a business, but is necessary for maintaining a free and open society in which privacy and integrity are respected and maintained. We in the PL community must continue to work hard, building on many past successes, to build better tools, techniques, languages, and methodologies that can improve the quality of software. Where do you think we should go next?