DARPA STAC: Challenge-driven Cybersecurity Research

Last week I attended a multi-day meeting for the DARPA STAC program; I am the PI of a UMD-led team. STAC supports research to develop “Space/time Analysis for Cybersecurity.” More precisely, the goal is to develop tools that can analyze software to find exploitable side channels or denial-of-service attacks involving space usage or running time.

In general, DARPA programs focus on a very specific problem, and so are different from the NSF style of funded research that I’m used to, in which the problem, solution, and evaluation approach are proposed by each investigator. One of STAC’s noteworthy features is its use of engagements, during which research teams use their tools to find vulnerabilities in challenge problems produced by an independent red team. Our first engagement was last week, and I found the experience very compelling. I think that both the NSF style and the DARPA style have benefits, and it’s great that both styles are available.

This post talks about my experience with STAC so far. I discuss the interesting PL research challenges the program presents, the use of engagements, and the opportunities STAC’s organizational structure offers, when done right.

What is STAC?

The premise of the STAC program is that low-level vulnerabilities synonymous with C and C++ — like buffer overruns, uses-after-free, integer overflows, etc.— are on their way to being eradicated. As such, adversaries are turning to new vulnerabilities not prevented by the use of state-of-the-art languages, tools, and development techniques. In particular, the status quo does not help programmers prevent so-called complexity attacks or side channel attacks.

Complexity attacks

As an example complexity attack, consider the following program:

public int search(String text, String delims) {
  for (int i = 0; i < text.length(); i++) 
    for (int j = 0; j < delims.length(); j++)
      if (text.charAt(i) == delims.charAt(j)) 
        return i;
  return text.length();
}

The ostensible purpose of the search method is to look through some text for the existence of a delimiter, returning its index. Suppose the attacker provides both delims and text (e.g., through web form fields) where each string’s length is M and N, respectively, such that M+N = for some constant L. In the expected use case, M is small (say, 10 characters), while N is larger, say 990 characters (making L = 1000). The longest execution occurs when the strings are non-overlapping, resulting in 990 * 10 = 9,900 comparisons. But the attacker can make the running time far worse by using string lengths M = N = 500, resulting in running time 500 * 500 = 25,000 comparisons. By the STAC definitions, a complexity attack is possible if a legal input (e.g., a pair of strings up to combined length L) can break an expected cost budget (e.g., 10,000 operations). There have been several examples of real-world complexity attacks where expected case and worse case are extremely divergent.

Side channel attacks

As an example of a side-channel attack, consider the following program:

private String password;
public boolean verifyPassword(String guess) {
  for (int i = 0; i < guess.length(); i++) 
    if (i >= password.length() || guess.charAt(i) != password.charAt(i)) 
      return false;
  if (guess.length() < password.length())
    return false;
  return true;
}

Here the adversary controls the guess input, and the password is intended to be secret. Can an attacker who does not know the password figure it out efficiently? The answer is ‘yes’ because of a side channel in verifyPassword: its running time depends on the length of any prefix that guess and password have in common. If the first character does not match, then verifyPassword returns immediately. If only the first character matches, then it will return after one comparison. If the first two match, then it returns after two comparisons. And so on. If the adversary can observe these differences in running time, then he can guess the password by varying each character at each position until the running time gives away that there is a match. Then he can move on to the next position and do the same thing. This process will take at most N * L tries where N is number of possibilities for each character and L is the maximum length. This is far faster than what brute-forcing would suggest, i.e., trying each of the N^L password possibilities. Real-world side channels have also been discovered, e.g., in cryptography implementations and web applications.

There are eight research teams (each involving several institutions) funded by the STAC program to develop tools and techniques to find these vulnerabilities. The focus is on JVM-based programs, which allows researchers to ignore the otherwise up-front issue of C/C++’s lack of memory safety.

Engagements

A very interesting element of the program is its use of engagements to assess the progress of individual teams and the program overall. At each engagement, a red team presents a set of programs. The research teams’ job is to determine whether these programs exhibit side channel or complexity attacks, in space or time. Engagements take place over a limited time period. Teams use their tools, and their own ingenuity, to analyze the programs and make a determination. A “control team” also participates, and differs from the other teams in only being allowed to use off-the-shelf tools — like decompilers, IDEs, and fuzzers — to make its determinations. All teams are scored by the number of determinations they make, and how many they make correctly (a wrong answer and a non-answer are not the same).

The use of engagements is not new to STAC. The program manager, Tim Fraser, also used them for his APAC and VET programs. Other DARPA initiatives have similarly focused on competitions, including the Grand Challenges for building autonomous vehicles; organizing decentralized, human communication networks; and building systems to automate software exploits and defenses.

The First STAC Engagement

The first STAC engagement took place last week. It began on a Tuesday afternoon, proceeded through Wednesday, and ended on Thursday afternoon. We were given 19 jar files containing the challenge programs and 45 specific questions about vulnerabilities they might have. A team from Apogee Research organized the engagement and set its operational parameters, and teams from Cyberpoint LLC and Raytheon BBN produced the challenge programs. These teams did a fantastic job: the engagement was well organized, and the challenge programs were both interesting and relevant.

The first thing we did was decompile the programs and put them in a shared repository. We also included a spreadsheet to coordinate among team members examining the programs. The next thing we did was try to run our tools on those programs. Our taint analysis — used to figure out how secret and/or attacker-controlled data could influence other data values in the program — largely worked. But just about everything else — our abstract interpretation, bounds analysis, and fuzzing tools — fell over. So the rest of the engagement involved us using the taint analysis results to help us manually explore whether any vulnerabilities were present. In the end we answered about 15 questions of the 45, and we were pretty confident in our answers for more than half of the 15, since we found what we believe are working exploits.

We were slightly depressed about our performance until we started talking to the other teams. Basically, the challenge programs were just large or complicated enough that few teams’ tools handled all the necessary program features. Fortunately, the next engagement begins immediately: We have the next six months to answer the remaining questions, which gives us an opportunity to develop our tools to address the problems presented by the challenge programs.

Programmatic vs. open-ended research

Before the engagement I had very mixed feelings about the organization of the STAC project. Is it really the best use of researcher time and taxpayer resources?

Excessive overlap, or sufficient breadth?

One possible concern is that asking many teams (comprising dozens of people) to solve the same problem could lead to them pursuing very similar ideas. This situation would stunt overall progress and harm individual team members academically because the first group to publish a paper would be scooping everyone else. The NSF style avoids these problems because programs are broader; ultimately only 1 or 2 projects working on a very similar problem will be funded, and the rest of the budget will go to support projects working on different problems.

We will see how things play out, but at the moment I am optimistic that these possible downsides will not materialize. Why? Because the participating teams are using minimally overlapping techniques. It was clear from the post-engagement technical talks that different teams are using different technologies, e.g., involving fuzz testing, symbolic execution, abstract interpretation, and model checking. And teams that are using the same technologies are varying how they are used, e.g., bottom-up vs. top-down static analysis, or constraint-based analysis involving integer transition systems vs. bit-vector constraints. Ideally, new, basic research will be done by each team.

Hyper-focus, or useful guidance and assessment?

Another concern might be about the value of competition-based research. DARPA claims that one’s performance in the engagement does not influence continued funding in the program (there is no “downselect”), but even if that’s true, researchers will want to do well, which requires focusing on the test. As a result, their research may end up overlapping more than it would have otherwise. They may also be driven to spend a lot of time on things that are of practical rather than scientific interest, which could slow intellectual progress.

The counterargument is that a well-conceived competition or challenge can accelerate progress, rather than stunt it. STAC engagements employ a well-defined, representative set of benchmark programs, developed independently of any particular analysis approach. As my colleague David Van Horn put it when speaking about the similarly organized APAC program, such benchmarks force researchers to “scale up from core models to realistic applications, which was really useful in validating our original ideas. We might not have done this work otherwise, and it exposed a bunch of basic research problems we had to solve.” Without engagements driving progress, researchers may cherry pick problems to evaluate, choosing only those that make their approach look good. When such a myopic evaluation is published, it may suggest promise that isn’t really justified.

The key to a good series of engagements is a ready supply of (well-constructed) challenge programs; if these challenges are ill-defined (e.g., because the red teams and competition organizers don’t do a good job), the whole affair can fall apart. Fortunately, it seems like the DARPA STAC team chose well when they picked Apogee, Cyberpoint, and Raytheon BBN. I also like the use of the control team. A team of experts with limited tool support should do well up to a point, but as the research teams’ automated tools improve, their performance should outpace the control. It will be interesting to see the trends over time.

Tools helping humans

The STAC challenge format, up front, makes it clear that tools should assist humans, rather than decide the problem fully automatically.  This makes sense when trying to solve any undecidable problem — in such cases, human judgment will always be required. Researchers often fail to acknowledge this. I also like that the competition format is humane: Teams that don’t do as well are not called out or shamed, as results are only reported publicly as aggregate statistics.

Conclusion

The DARPA STAC program presents an interesting model for cybersecurity research: Fund many teams to solve the same problem; drive progress through carefully defined, independently conceived engagements; and ensure that teams’ approaches are distinct enough that the progress is measured both in breadth and in depth. It’s been a positive experience so far; I commend the PM, Tim Fraser, for a job well done!

David Van Horn, Ian Sweet, and Andrew Ruef provided helpful comments on drafts of this post.

1 Comment

Filed under Process, Program Analysis, Research, Science, Software Security

One Response to DARPA STAC: Challenge-driven Cybersecurity Research

  1. Pingback: PRNG entropy loss and noninterference - The PL Enthusiast

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.