How did Heartbleed remain undiscovered, and what should we do about it?

It was recently reported that the Heartbleed OpenSSL bug has still not been patched on over 300,000 servers. This is down from 600,000 that were discovered two months ago, when the bug was first publicized. Add to this delay the nearly two years that the bug went undetected, and that’s a lot of exposure.

In this blog post, I wonder about how programming languages and tools might have played a role in finding Heartbleed before it got into production code. Then I explore possible approaches to improving this state of the affairs. I welcome your thoughts and ideas!

Heartbleed

The Heartbleed bug, when exploited, allows an attacker to illicitly access certain contents in the memory of a buggy server. (How the bug allows this is explained well by Matthew Green.) The leaked memory may contain things like secret keys and previously entered passwords, and so administrators of vulnerable servers were advised to change these keys (which creates its own problems) and users were advised to change their passwords. The memory disclosed may also reveal the locations in memory of control-flow related data, which are useful in remote exploits, and would negate the protection conferred by ASLR. As such, while the Heartbleed bug itself cannot be exploited to cause remote code execution, it could enable the exploitation of another bug in the same server.

It is particularly disturbing that so many hosts remain unpatched, because it would be hard to tell if Heartbleed is being exploited to acquire sensitive information. Part of the difficulty is that data can be exfiltrated back to the attacking host as part of an encrypted connection and therefore invisible to packet monitoring software looking for evidence of exploitation. System administrators assuming that the bug is only being exploited using packets transmitted “in the clear” rather than as part of an established secure connection have a false sense of security.

Finding the Hearbleed with static analysis

When this bug came out, colleagues asked me whether state-of-the-art static analysis tools (which analyze code for problems before that code is run) should have found Heartbleed sooner. To paraphrase: You guys in programming languages have been working on tools to find bugs like this for a long time. I presume that companies/developers should have been using these tools, and if so, they would have found the bug, right?

Unfortunately, I suspect the answer is ‘no’. In particular, at least four commercial tools that perform static analysis on code — Coverity’s Code Advisor, Grammatech’s Code Sonar, Klocwork’s Insight, and Veracode’s code scanning services — would not have found the bug. In fact, Coverity didn’t find the bug: OpenSSL has been part of Coverity’s Open Scan project since 2006. The Scan project uses Coverity tools to analyze open source projects for free, and would have scanned OpenSSL after the Heartbleed bug was introduced in 2012. Only after the bug was publicly announced did Andy Chou of Coverity suggest, in a blog post, a (clever) way that Coverity’s tool could be made to find the bug. Grammatech and KlocWork quickly showed that their tools could play the same trick (here and here). My student, Andrew Ruef, wrote a Clang static analyzer plugin that can find the bug in the same way. Veracode simply hunts for the offending SSL code, not for the root cause of the bug.

Why couldn’t tools find the bug?

You would hope/expect that these tools would have found the bug: A significant motivator for developing static analysis tools is to find critical bugs that testing, code reviews, and other methods would miss. This bug seems to be a perfect storm of challenging features, and a recent whitepaper by James A. Kupsch and Barton P. Miller covers the details well. The basic explanation is that this bug involves a lot of complicated code and indirection through pointers, and as such confounds the reasoning of most tools. (The fix that Andy Chou suggested was to circumvent these complexities with a heuristic that identifies likely “tainted”, or attacker-originating, input with an idiomatic code pattern.)

A higher-level explanation of why tools couldn’t find the bug is that commercially viable static analysis is unsound, with the goal of being more complete. This is a technical way of saying that commercial tools deliberately choose to miss bugs so as to reduce the rate of false alarms.

A sound analysis is one that, if there exists an execution that manifests a bug at run-time, then the analysis will report the bug. Unfortunately, a sound analysis may also claim to find bugs that do not actually manifest at run-time; these are called false alarms. On the flip slide, a complete analysis is one that, if it reports a bug, then that bug will surely manifest at run-time. Ideally, we would have an analysis that is both sound and complete, so that it reports all true bugs, and nothing else. Unfortunately, such an analysis is impossible for most properties of interest, such as whether a buffer is overrun (the root issue of Heartbleed). This impossibility is a consequence of Rice’s theorem, which states that proving nontrivial properties of programs in Turing-complete languages is undecidable. So we will always be stuck dealing with either unsoundess or incompleteness.

Having a high rate of false alarms is a tool killer. A rule of thumb I have heard is that users are willing to tolerate no more than 50% of the alarms being false. Bessey et al from Coverity wrote a great piece for Communications of the ACM on their experience constructing a commercially viable static analysis tool. They report the toleration limit is 30%, not 50%, and in fact they aim for 20% for their “stable” checkers. They achieve this rate by looking for the most simplistic bugs, and avoiding sophisticated analysis that, while it finds more bugs, yields more false alarms. One amazing (to me) line in the piece is

the commercial Coverity product, despite its improvements, lags behind the research system in some ways because it had to drop checkers or techniques that demand too much sophistication on the part of the user.

In particular, users were labeling true bugs as false alarms, and had a harder time diagnosing false alarms, because the bugs being uncovered were very complicated and the error reports were hard to understand.

Where does this leave research, and practitioners?

How can we balance both the need of better security and the expectations of software developers and their employers? Answers to this question are important for getting things done today, and for setting an agenda for impactful research.

Of course we can and should develop better analysis algorithms, i.e., those that are closer to being sound while not having too many false alarms (and not being too slow). An important question is how we evaluate these algorithms: beautiful theorems and simple benchmarks are not enough. We need realistic empirical assessments, both in terms of the power and precision of the tool in the hands of an expert, and in the hands of more ordinary developers. Assessing the latter may require studying real users, something the static analysis community rarely does.

Fully sound tools and processes may be appropriate for code that is security-critical, like OpenSSL, despite the greater demand and sophistication required of developers. For example, we could imagine applying full formal verification, the end result of which is a proof that the code will always behave as it should. Such an approach is being increasingly viewed as viable. For example, DARPA’s HACMS program has been pushing research to improve the scalability, applicability and usability of formal verification. A recent research project has looked at formally verifying SSL.

Another approach is simply to use type-safe languages, like Java and Haskell, which rule out pernicious bugs like buffer overruns by construction. Rust is a promising new language, since it aims to provide high performance by providing programmers low-level control like they have in C while retaining type safety. (I note that several ideas in Rust were inspired by the research language Cyclone, which was co-developed by Trevor Jim, Greg MorrisettDan GrossmanNikhil Swamy, and myself, among others.) I believe Google’s Go is also type-safe. That said, type safety does not rule out other important security-critical bugs, like information disclosures.

Rather than try to (only) reduce the number of false alarms, we might imagine trying to reduce the time to triage an alarm, e.g., by making it easier to understand. Improved user interfaces, e.g., those that support a directed code review in response to an alarm, might provide some help, as one of prior study showed.

But such tools require more sophisticated users who may need to understand better how a tool works, as the Coverity paper suggested. One approach is the business model of “static analysis as a service” as provided by Veracode. In this model, a user submits their code for analysis, and engineers very familiar with the analysis triage its warnings, providing a report back the user. The issue here is that the static analysis company’s engineers don’t understand the code. So the question is, what is more important for triaging: knowing the analyzed codebase, or knowing the analyzer itself?

There must be a role for education, too, with the aim of fostering more sophisticated users. By better understanding what’s going on, such users can use static analysis or a sophisticated language as a tool, rather than as a magic trick. It is my intuition that PL and static analysis courses that would help this understanding are becoming more common, even at the undergraduate level. This is a good thing.

Stepping back: Heartbleed has raised our awareness that the importance of our software being free of errors is not just something that is nice to have, or a value-add for a business, but is necessary for maintaining a free and open society in which privacy and integrity are respected and maintained. We in the PL community must continue to work hard, building on many past successes, to build better tools, techniques, languages, and methodologies that can improve the quality of software. Where do you think we should go next?

14 Comments

Filed under Formal verification, Software Security

14 Responses to How did Heartbleed remain undiscovered, and what should we do about it?

  1. Pascal Cuoq

    Hello, Michael.

    Formal methods are not restricted to the “full formal verification” of
    newly written SSL libraries. You mention the unsound, best-effort
    static analysis of existing C code. You note the possibility of
    verifying functional properties for new implementations,
    in a formalism intended for expressing and checking these properties.
    In-between these two approaches, sound static analyzers also exist
    that can both be applied to the source code of the widely deployed
    components that matter and report all Heartbleed-like memory errors.

    The fundamental reason why this intermediate way is necessary is that
    in most of the contexts where OpenSSL is used, it is not currently
    possible to use miTLS. Microsoft’s fully formally verified SSL stack
    miTLS is developed in a formalism the underlying implementation
    language of which is F# and drags with it the .NET platform. You
    mention the 300 000 IP addresses that are still vulnerable to
    Heartbleed. Many of these addresses belong to NASes and other devices
    with minimal computing power and passive cooling (source:
    https://twitter.com/ErrataRob/status/481246897164652544 ).

    One simply cannot run miTLS on these devices. (Also it is
    not clear how one can be certain to be protected against timing
    attacks in a high-level, garbage-collected language. It is tricky
    enough in C because preservation of execution time is not a property
    the compiler or the processor usually have to ensure. The situation
    can only be worse in a higher-level language.)

    Can these automatic sound static analyzers be applied to a
    full-fledged SSL implementation and report only the undefined
    behaviors that are there, without any false positive? I do not know of
    any that can. However, I have spent time as one of the researchers
    working on the static analysis framework Frama-C.
    In Frama-C, program properties that are out of reach for
    one analysis technique can be verified with
    another, with the full continuum being available from robust,
    inexpensive abstract interpretation all the way to SMT solvers and Coq
    verification. And I am a founder at a company the first product of which
    is the verification report for PolarSSL, a pre-existing SSL
    implementation written in C, in a specific configuration. Anyone who
    wants to be rid of Heartbleed-like security vulnerabilities in SSL
    only needs to pick PolarSSL and to contract with TrustInSoft for the
    verification perimeter to be extended to cover their usage of the
    library: http://trust-in-soft.com/polarssl-verification-kit/ .

    Unsound static analysis is not the answer to Heartbleed, because it is
    and it has to remain optimized for scenarios other than the
    security-critical foundations of the Internet. Despite great advances
    in recent years, full formal verification still involves trade-offs
    that are incompatible with the current usage of low-level software
    components. Programming languages that bring memory safety while
    remaining lightweight, such as Cyclone, Go and Rust, are one of the
    practical answers, and the sound static analysis of the programming
    languages currently in use is another.

    • Thanks for the helpful comment, Pascal! I guess what you are proposing is semi-automatic sound analysis, where the manual part is tuning the analysis parameters to work on the software at hand? This is what you did with PolarSSL? I think manual tuning is pretty common for unsound analyzers too, so it’s not really something you can escape. I’m also guessing that PolarSSL is written in such a way that it’s more amenable to analysis, overcoming some of the issues mentioned in Kupsch and Miller’s report?

      In addition to Frama-C, on the commercial side I believe that Astree is also sound. I don’t know if it would have caught the Heartbleed bug or not; I saw no explicit report from the company confirming it would. I believe that it does not consider all of C, and that may be the reason why not.

      • Pascal Cuoq

        Frama-C’s Abstract Interpretation plug-in emits alarms, like Polyspace or Astrée. The difference with Polyspace and Astrée is that these alarms are expressed in a formal language as a property that must hold for the program to be safe.

        Example:

        void f(short x) {
        int z = x * 23 – 17;
        y = 100 / z;
        }

        An alarm would be emitted that would read “assert z != 0;” immediately before the division. Although (in this case) it looks like C, this formula is expressed in the ACSL annotation language. Weakest precondition Frama-C plug-ins can now be used to verify these properties just as if these were functional properties that the user had written with the intention to verify that the code satisfied them. In the example above, either an automatic prover or Coq would have a chance to formally prove that the weakest pre-condition of f() that ensures z != 0 is true (because 17 is prime and a 16-bit short cannot overflow the multiplication x * 23).

        In TrustInSoft’s verification of PolarSSL, beyond the tuning of the Abstract Interpretation plug-in (which is a sizeable part of the work indeed), a lot of the work consisted in subdividing the program into sub-components that were specified individually (in ACSL) so that each component would only be verified against its specification, and could assume that the sub-component it uses verify their specification.

        Also, during that verification, as many of the emitted alarms as possible were verified using the weakest precondition computation plug-in. Some of them indeed are only justified by very careful review, assisted by the analyzer. This is exactly what you call “directed code review” in your post.

        ______

        From section 2 of https://continuousassurance.org/swamp/SWAMP-Heartbleed.pdf , Frama-C is designed to handle the difficulties a-c for all C programs, not just PolarSSL. Frama-C detects use of uninitialized memory (which may be the only symptom for Heartbleed depending on the memory model used for the analysis). We documented the most serious vulnerability that we found in PolarSSL at http://blog.frama-c.com/index.php?post/2014/02/23/CVE-2013-5914 . This example contains the difficulties a (pointer use) and b (distance between the validation and the actual undefined behavior), and it illustrates our version of “directed code review”.

        The reason we focused on PolarSSL first is one of size: PolarSSL is smaller than OpenSSL and it has a very modular design, making it easier to remove entire files from the verification perimeter. I don’t see any fundamental reason why the same kind of verification would not be applicable to OpenSSL. It would only be a lot of work.

        ______

        Astrée was initially designed for the verification of safety-critical embedded code (as was Frama-C). It will be useful for verifying security properties on low-level server code if its developers do the additional work necessary. One difference is that safety-critical embedded code is written to strict coding standards that limit the number of difficult constructs encountered, whereas the open-source components that make the Internet work are not written to any coding standard in particular.

        • Fantastic explanation, thanks for taking the time to provide it! This is exciting stuff.

        • Pascal,

          I read your comments here and the text about the verification of polarssl with great hope and joy. I genuinely believe that this kind of verification is a way forward for computing and could avert great disaster.

          However today, a bug was identified through fuzzing in polarssl: https://polarssl.org/tech-updates/security-advisories/polarssl-security-advisory-2014-02 This bug seems like it could be categorized under one of the CWEs that the verification effort claims to have proven absent from polarssl. I’m really curious about the ACSL annotations for ssl_decrypt_buf, looking at the code and the patch I can theorize about how this condition could have escaped verification but I would really like to know what the root cause of this is first hand.

          • Pascal Cuoq

            Hello, Andrew.

            Yes, the vulnerability in the PolarSSL advisory 2014-02 is 100% exactly right on what TrustInSoft Analyzer and its open-source counter-part Frama-C are designed to find.

            I said that TrustInSoft’s first product was “the verification report for PolarSSL, a pre-existing SSL
            implementation written in C, in a specific configuration.”

            If you are using PolarSSL in this configuration, then you don’t need to upgrade to PolarSSL 1.3.8 because of the vulnerability described in this advisory. This vulnerability does not concern you, because it only exists if the GCM cyphersuite is enabled, and you disabled it when you followed the deployment instructions in the PolarSSL verification kit.

            HOWEVER, if you are running PolarSSL in this configuration, you should upgrade, because 1.3.8 also fixes another buffer overflow that can affect you in the configuration you are using: https://github.com/polarssl/polarssl/commit/0f651c7422f288dd560901480457756443e73f8e
            This one was introduced in 1.3.7 and we found it when we replayed the verification in order to update the verification kit.

            HOWEVER, if you bought the PolarSSL verification kit, then you shouldn’t upgrade, because who knows what new vulnerability the 1.3.8 version might introduce? You should wait until we have replayed the verification, which shouldn’t take more than until the end of next week (July 14 is France’s national holiday). But then again, you are in no hurry, because since you bought the PolarSSL verification kit, we warned you about the vulnerability fixed by https://github.com/polarssl/polarssl/commit/0f651c7422f288dd560901480457756443e73f8e at the same time as the PolarSSL maintainers.

          • I understand now and that is awesome to hear, thank you!

  2. Google’s Go is designed to be memory-safe as long as your code doesn’t have data races. https://en.wikipedia.org/wiki/Go_%28programming_language%29#Race_condition_safety

    Rust prevents you from creating data races by restricting concurrency APIs to types that fulfill the Send or Share traits, as needed. Linear types and immutability make these possible.

    • To elaborate slightly on Jesse’s remarks, Go supports multi-word values, to reduce the cost of memory indirections. However, assignments to these values are not atomic, which makes memory-corrupting races possible.

  3. Just found a longer essay by David Wheeler that covers many of the same points as my article. For those wanting to learn more technical details (while making the same general points), I recommend checking this out. http://www.dwheeler.com/essays/heartbleed.html

  4. Pingback: What is memory safety? | The PL Enthusiast

  5. Pingback: 型安全性とは何か | POSTD

  6. Pingback: The Synergy between Programming Languages and Cryptography - The PL Enthusiast

  7. Pingback: 【翻訳】なぜ採用される言語とされない言語があるのか | POSTD

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.