Sunday, June 10, 2012

J. Edgar and Big Data

I finally got around to watching the movie J. Edgar last night, and I think there were some interesting parallels between crime solving in 1919 and solving digital crimes now.  When J. Edgar Hoover founded the modern FBI, he implemented a few novel ideas for law enforcement that resonated with me:
  1. Preserve all evidence, no matter how small.
  2. Evidence provides attribution.
  3. You need to collect Big Data to leverage evidence.
Preserve All Evidence
Computer forensics has come a long way since it was first needed in the 1980's.  But despite the great strides in host-based analysis, network analysis has only become en vogue in the last few years.  Other than a few companies like NetWitness, there are very few commercial network forensic options.  Most network forensic work is still in the old-school mode of gathering IP addresses, because IP's, like phone numbers, are easy to describe and collect.

Increasingly, IP addresses are becoming only half of the story, largely due to virtual hosts at large hosting providers.  It makes IP-based intel less reliable as a solid indicator and incident responders and law enforcement investigators are increasingly having to turn to network content as a supplement.  App content, (often encrypted), is fast becoming the only indicator left that's solid enough to launch an investigation.  This is similar to the shift in crooks using disposable cell phones rendering phone numbers less helpful for tracking criminals.  Just like in the movie where local cops contaminate the crime scene of the Lindbergh baby kidnapping by moving a ladder, ignoring footprints, etc., many orgs do not deem it necessary to collect small bits of evidence like DNS queries and SSL certificate use.

A great example of this is the ever-growing number of banking Trojans, like Zeus, which use a fully-encrypted command-and-control (C2) communications protocol.  You can use Bro and ELSA to find this C2 based on SSL certificate information, allowing us to use the last bit of indicator left on the wire.


Collecting all SSL certificates observed might seem like overkill, but in these cases, this is the only indicator that a host is compromised.  Without this small detail of evidence, we'd have only an IP address to go on, and in a co-location situation, that's not good enough to pull the trigger on physical action, like an arrest or re-image.

Evidence Provides Attribution
This is self-evident, but I want to focus on the importance of using all available evidence toward the goal of attribution in the context of network intrusions.  Too often, responders are willing to stop at the "what happened" phase of an investigation as they consider the intrusion contained and eradicated.   Without knowing the "why" of the situation, you cannot establish intent, and without intent, you cannot establish attribution, the "who."  If you don't understand the underlying motives, you won't be able to predict the next attack.

The required level of maturity in an incident response program increases dramatically from what it takes to figure out the basics of an intrusion to understanding it within the context of a campaign, which could go on for years.  Specifically, the IR program needs a comprehensive collection of data, and then a way to tie specific pieces of data together in permanent incident reports.  All of this needs to be intuitive enough so that analysts can programmatically link incidents together based on very small and varying indicators, such as a single word in an SSL certificate subject.

For example, a recent incident had no signs of compromise other than traffic to a known-bad IP address.  The traffic was encrypted using an SSL certificate:

CN=John Doe,OU=Office_1,O=Security Reaserch WWEB Group\, LLC,L=FarAway,ST=NoState,C=SI

We were able to link another IP address that was not a known-bad IP with the SSL certificate, which in turn revealed further intel on who registered it, for how long, what else it had done, etc.
In addition to the breadth of information collected, like SSL certificates, depth is also important.  It was important to know which was the first internal host to visit a site using that SSL certificate, as we were able to use that to point to the initial infection vector for what ended up being an incident affecting multiple hosts.

Collect Big Data
Hoover was also a pioneer in the collection of fingerprints for forensic purposes, a point which the movie made many times over.  It portrayed one of his proudest moments as when the last cart full of fingerprint files was wheeled into the Bureau's collection from remote branches.  It reminded me of when we got our last Windows server forwarding its logs to our central repository.  One of the first scenes shows his passion for setting up a system of indexing the library of congress, and I couldn't help but relate when he excitedly showed off the speed with which he could retrieve any book from the library when given a random query for a subject.

The concept that his agents needed to have a wealth of information at their disposal was a central theme.  There was even a line: "information is power" that particularly resonates as a truth in both business and public safety today.  With the increasingly complex mix of adversaries involved in cybercrime, it takes far more dots and far more horsepower to connect those dots than in simpler times.  Bruce Schneier made an excellent example with his piece on the dissection of a click fraud campaign.  The sheer amount of moving parts in a fraud scheme like that involving botnets and unknowing accomplices (ad networks) is daunting.  What makes it difficult from a network perspective is that many of the indicators of compromise are in fact legitimate web requests to known-good hosts.  It is only in a specific context that they become indicators.  Specifically, a request with a spoofed referrer will appear completely valid in every way except that the client never actually visited the prior page.  It takes the next level of network forensics to validate that referrers aren't spoofed, especially because subsequent requests will use redirects so that only the very first request will have a spoofed referrer.

What does it take to track this kind of activity?  You must be able to:
  1. Find all hosts which used a known-bad SSL certificate.
  2. For each host, retrieve every web request made.
  3. For each result, get the referrer and find out if a corresponding request was made. 
That requires a very scalable and malleable framework capable of incredibly fast data retrieval as well as a flexible result processing system to deal with all of the recursion.

In addition to this request validation, you need the breadth of data to map out how far the intrusion spreads.  We located scores of hosts using the malicious SSL certificate, all of which were performing various activities:  Some were committing click fraud, others were doing keylogging and form grabbing, and others were attempting to spam.  The only thing they held in common was the SSL certificate used for their C2.  Without tying them together attribution would be impossible, instead of merely the great challenge it is now.