Monday, March 28, 2011

Fighting APT with Open-source Software, Part 1: Logging

Just because Advanced Persistent Threats (FUD) is a marketing buzzword doesn't mean that it isn't a problem. The Cisco Security Blog had a fantastic post detailing what APT is, what it is not, and what it takes to defend against it. From the post: “The state of the art in response to APT does not involve new magic software hardware solution divining for APT, but relies more on asking the right questions and being able to effectively use the existing detection tools (logging, netflow, IDS and DPI).

The article then goes on to detail exactly what you need to combat APT. As they stated, it is in fact NOT a product. It is a collection of information and tools which provides a capability utilized by a team in a perpetual investigatory process. They are dead-on as they describe what you need. Here is my paraphrased reproduction:

  1. A comprehensive collection of logs and the ability to search and alert on them.

  2. Network intrusion detection.

  3. A comprehensive collection of network connection records.

  4. Information sharing and collaboration with other orgs.

  5. The ability to understand the malware you collect.

I'm going to add another requirement of my own:

  1. The ability to quickly view prior network traffic to gain context for a network event and collect network forensic data.

These items shouldn't be a huge shock to anyone, and are probably already on a to-do list somewhere in your organization. It's like asking a doctor what you should do to be healthy: she'll say to exercise and eat right. She will certainly not prescribe diet pills. But much like some people find a workout schedule that works for them, I'm going to detail the implementations and techniques that work for us and will probably work for you.

There is a lot of ground to cover here, so I am going to address solutions to these tasks in a series of posts which detail what is needed to fulfill the above requirements and how it can be done with 100% open-source software.

In this introductory post, I'll tackle the biggest, most important, and perhaps most familiar topic: logs.

Enterprise Log Management (Omniscience Through Logging)

Producing and collecting logs is a crucial part of fighting APT because it allows individual assets and applications to record events that by themselves may be insignificant, but may be an indicator of a malicious activity. There is no way to know ahead of time what logs are important, so all logs must be generated and collected. APT will not generate “error” logs unless you are very lucky—it's the “info” (and sometimes “debug”) logs that have the good stuff.

The first major hurdle for most organizations is collecting all of the relevant information and putting it in one place (at least from a query standpoint). There are a lot of reasons why this task is so difficult. The first of which is that historically, log collection is just not sexy. It's hard for people to get excited about it, and it takes a herculean effort to do it effectively. Unless you have a passion for it, it's not going to get done. Sure, it's easy enough to to get a few logs collected, but for effective defense, you're going to need comprehensive logging. This is generally accomplished by enabling logging on every asset and sending it all to an SIEM or log management solution. This is a daunting task, and this is one of the biggest reasons why fighting APT is so hard. Omniscience does not come easily or cheaply.

If you have the money, there are a lot of commercial SIEM and log management solutions that can do the job out there. Balabit makes a log collection product with a solid framework, ArcSight has an excellent reputation with its SIEM, and I can personally vouch for Splunk as being a terrific log search and reporting product. However, large orgs will have massive amounts of logs, and large-scale commercial deployments are going to be extremely expensive (easily six figures or more). There are a number of free, open-source solutions out there which will provide a means for log collection, searching, and alerting, but they are not designed to scale to collecting all events from a large organization, while still making that data full-text searchable with millisecond response times. That kind of functionality costs a lot of money.

Building Big

Almost two years ago, I set out to create a log collection solution that would allow Google-fast searching on an massively large set of logs. The problem was two-fold: find a syslog server that could receive, normalize, and index logs at a very high rate, and find a database that could query those logs at Google speeds, all with massive scalability. I have to say that when I first started, I believed that this task was impossible, but I was glad to prove myself wrong.

The first breakthrough was finding Syslog-NG version 3, which includes support for the pattern-db parser. It allows Syslog-NG to be given an XML file specifying log patterns to normalize into fields which can be inserted into a database. It does this with a high-speed pattern matching algorithm (Aho-Corasick) instead of a traditional regular expression. This allows it to parse logs at over 100k logs/second on commodity hardware. Combined with MySQL's ability to bulk load data at very high rates (over 100k rows/second), I had an extremely efficient mechanism for getting the logs from the network, parsed, and stored in a database.

The second task, finding an efficient indexing system, was much more challenging. After trying about a half-dozen different indexing techniques and technologies, including native MySQL full-text, MongoDB, TokuDB, HBase, Lucene, and CouchDB, I found that none of them were even close to being fast enough to keep up with a sustained log stream of more than a few thousand logs per second when indexing each word in the message. I was especially surprised when HBase proved too slow, as it's the open-source version of what Google uses.

Then I found Sphinxsearch.com, which specializes in open-source, full-text search for MySQL. Sphinx was able to index log tables at rates of 50k logs/second, and it provided a huge added feature: distributed group-by functionality. So, armed with Syslog-NG, MySQL, and Sphinx, I was able to put together a formal Perl framework to manage bulk loading log files written by Syslog-NG into MySQL and indexing the new rows.

That all proved to be the easy part. Writing a web frontend and middleware server around the whole distributed system proved to be the tougher challenge. Many thousands of lines of Perl and Javascript later, I had a web app that used the industry standard Yahoo User Interface (YUI) to manage searching, reporting, and alerting on the vast store of logs available.

Introducing Enterprise Log Search and Archive (ELSA)


ELSA collects and indexes syslog data as described above, archives and deletes old logs when they reach configured ages, and sends out alerts when preset searches match new logs. It is 100% web-based for both using and administering.

Main features:

  • Full-text search on any word in a message or parsed field.

  • Group by any field and produce reports based on results.

  • Schedule searches.

  • Alert on search hits on new logs.

  • Save searches, email saved search results.

  • Create incident tickets based on search results (with plugin).

  • Complete plugin system for results.

  • Export results as permalink or in Excel, PDF, CSV, and HTML.

  • Full LDAP integration for permissions.

  • Statistics for queries by user and log size and count.

  • Fully distributed architecture, can handle n nodes with all queries executing in parallel.

  • Compressed archive with better than 10:1 ratio.



One of the biggest requirement differences between a large-scale, enterprise logging solution versus your average log collector is assigning permissions to the logs so that users receive only the logs they are authorized for. ELSA accomplishes this by assigning logs a class when they are parsed and allowing administrators to assign permissions based on a combination of log class, sending host, and generating program. The permissions can be either local database users for small implementations, or LDAP group names if an LDAP directory is configured.

Permissions are a crucial and powerful component to any comprehensive logging solution. That gives security departments the power to let web developers have access to the logs specific to their web site to look for problems without allowing them access to sensitive logs. The site authors may be the most qualified to notice suspicious activity because they will have the most knowledge of what is normal. The same goes for administrators and developers in other areas of the enterprise.

However, the biggest win for the security department is that log queries finish quickly. Ad-hoc searches on billions of logs finish in about 500-2000 milliseconds. This is critical, because it allows security analysts to explore hunches and build context for the incident they are analyzing without having to decide if the query is worthwhile before running it. That is, they are free to guess, hypothesize, and explore without being penalized by having to wait around for results. This means that the data from a seed incident may quickly blossom into supporting data for several other, tangentially related incidents because of a common piece of data. It means that the full extent and context of an incident becomes apparent quickly, allowing the analyst to decide if the incident warrants further action or can be set aside as a false-positive.

Getting ELSA

ELSA is available under GPLv2 licensing at http://code.google.com/p/enterprise-log-search-and-archive/ . Please see the INSTALL doc for specifics, but the basic components, as mentioned above, are Linux (untested on *BSD), Syslog-NG 3.1, MySQL 5.1, Sphinx search, Apache, and Perl. It is a complex system and will require a fair amount of initial configuration, but once it is up and running, it will not need much maintenance or tuning. If you run into issues, let me know and I will try to help you get up and running.

8 comments:

  1. Any sizing/scaling advice?

    You mentionned search on billions of lines returning in 500-2000milliseconds, is it actual performance or is it a design goal?

    ReplyDelete
  2. For spec'ing a system, in order of importance: disk size, RAM, disk speed, number CPU's. The overriding performance factor is Sphinx's indexer and search daemon, so refer to sphinxsearch.com for docs. My given stats are taken from large systems (16 CPU, 144 GB RAM, 12 TB HD), but you will get the same performance on a system with 4 CPU, 8 GB RAM, and any sized HD as things scale linearly. The system first ran on IBM blades with 4 GB RAM and slow SAN drives and performed at about the same rate, but 4 GB is cutting it a bit close.

    Those search times are our average query time in a distributed (two box) search, but the search time will always be the same no matter how many boxes you scale out to because Sphinx searches all nodes in parallel. We have 6 billion indexed logs and 30 billion archived (non-indexed, compressed) logs on each of our two nodes. Each node can handle about 50,000 logs/second incoming.

    If you're preparing to do an install, I recommend waiting until the end of the week. I've made a lot of code improvements which will make the install much easier, and I'm planning on putting together much better install docs.

    ReplyDelete
  3. Successfully installed on Debian 6.0
    Needed to add libio-string-perl (apt-get).
    Could not get local authentication to work and did not know which logs to look at to trouble shoot.
    Changed to "none" auth and am able to get to the interface. However, when I put any thing in the query criteria my search times out. It does detect logs because I see it as unclassified reference to program name ossec which is sending logs. How do I troubleshoot?
    Also, how do I keep /data/elsa/tmp/buffers from consuming all the inodes available?

    ReplyDelete
  4. Nice work. I would be interested to know if you did any performance comparison to Splunk on the same hardware used by ELSA?

    ReplyDelete
  5. I've only been able to use Splunk personal edition, which is only different in its limit of 500 MB of logs per day. Our search times for items through several gigabytes of logs in Splunk take longer than our search through 20 TB of logs in ELSA. I have not been able to run Splunk on large hardware like ELSA (we have to run it on a VM), but it does not matter as the performance is almost entirely disk-bound for large datasets. If you have your VM setup correctly (vSphere VM disk set to "high"), then you get essentially native host speeds. So while it's far from a scientific test because of the machine spec differences, the tiny workload we have given Splunk should mean fast searches, and we have experienced equal or slower search times compared with ELSA and it's huge workload. However, I readily admit that Splunk's interface and visualization capabilities are far superior to ELSA's.

    ReplyDelete
  6. Have you considered making ELSA available as a VM appliance?

    ReplyDelete
  7. Yes, and in fact I had one. I found it was much easier to focus on the install/update script because that's needed whether you are updating a VM appliance or your own instance. In the end, the updater works well enough that all that's needed is to install any Linux or FreeBSD on a VM and then install ELSA on top of that. This also makes org-specific customizations easier to handle.

    ReplyDelete
  8. just wonder if u gonna do the search form smoother ?

    ReplyDelete