Open-Source Security Tools: Why ELSA Doesn't Use Hadoop for Big Data

Of late I've been reading and hearing a lot about Apache Hadoop whenever the topic is Big Data. Hadoop solves the Big Data problem: How do I store and analyze data that is of an arbitrary size?

Apache's answer:

Hadoop is incredibly complicated, and though it used to be a pain to setup and manage, things have improved quite a bit. However, it is still computationally inefficient when compared to non-clustered operations. A famous example lately is the RIPE 100GB pcap analysis on Hadoop. The article brags about being able to analyzie 100GB pcap on 100 Amazon EC2 instances in 180 seconds. This performance is atrocious. A simple bash script which breaks a pcap into smaller parts by time using tcpslice in parallel and pipes that to tcpdump will analyze an 80GB pcap in 120 seconds on a single server with four processors. By those numbers, you pay a 100x penalty for using Hadoop.

But you don't use Hadoop because you want something done quickly, you use it because you don't have a way to analyze data at the scale you require. Cisco has a good overview of why orgs choose Hadoop. The main points from the article, Hadoop:

Moves the compute to the data
Is designed to scale massively and predictably
Is designed to support partial failure and recoverability
Provides a powerful and flexible data analytics framework

Let's examine these reasons one-by-one:

Move the Compute to the Data

If your solution uses servers that have the data they need to do the job on local disk, you've just moved the compute to the data. From the article:

"this is a distributed filesystem that leverages local processing for application execution and minimizes data movement"

Minimizing the data movement is key. If your app is grepping data on a remote file server in such a way that it's bringing the data back to look at, then you're going to have a bottleneck somewhere.

However, there are a lot of ways to do this. The principle is a foundation of ELSA, but ELSA doesn't use Hadoop. Logs are delivered to ELSA and, once stored, never migrate. Queries are run against the individual nodes, and the results of the query are delivered to the client. The data never moves and so the network is never a bottleneck. In fact, most systems do things this way. It's what normal databases do.

Scale Massively and Predictably
"Hadoop was built with the assumption that many relatively small and inexpensive computers with local storage could be clustered together to provide a single system with massive aggregated throughput to handle the big data growth problem."

Amen. Again, note the importance of local disk (not SAN/NAS disk) and how not moving the data allows each node to be self-sufficient.

In ELSA, each node is ignorant of the other nodes. This guarantees that when you add a node, it will provide exactly the same amount of benefit as the other nodes you added provide. That is, its returns will not be diminished by increased cluster synchronization overhead or inter-node communications.

This is quite different than traditional RDBMS clustering which require a lot of complexity. Hadoop and ELSA both solve this, but they do it in different ways. Hadoop tries to make the synchronization as lightweight as possible, but it still requires a fair amount of overhead to make sure that all data is replicated where it should be. Conversely, ELSA provides a framework for distributing data and queries across nodes in such a way that no synchronization is done whatsoever. In fact, one ELSA log node may participate in any number of ELSA frontends. It acts as a simple data repository and search engine, leaving all metadata, query overhead, etc. up to the frontend. This is what makes it scalable, but also what makes it so simple.

Support Partial Failure and Recoverability
"Data is (by default) replicated three times in the cluster"

This is where the Hadoop inefficiencies start to show. Now, this is obviously a design decision to use 3x the amount of disk you actually need in favor of resiliency, but I'd argue that most of the data you're using is "best-effort." If it's not, that's fine, but know up-front that you're going to be paying a massive premium for the redundancy. The premium is two-fold: the raw disk needed plus the overhead of having to replicate and synchronize all of the data.

In our ELSA setup, we replicate our logs from production down to development through extremely basic syslog forwarding. That is a 2x redundancy that gives us the utility of a dev environment and the resiliency of having a completely separate environment ready if production fails. I will grant, however, that we don't have any fault tolerance on the query side, so if a node dies during a query, the query will indeed fail or have partial results. We do, however, have a load balancer in front of our log receivers which detects if a node goes down and reroutes logs accordingly, giving us full resiliency for log reception. I think most orgs are willing to sacrifice guaranteed query success for the massive cost savings, as long as they can guarantee that logs aren't being lost.

Powerful and Flexible Data Analytics Framework
Hadoop provides a great general purpose framework in Java, and there are plenty of extensions to other languages. This is a huge win and probably the overall reason for Hadoop's existence. However, I want to stress a key point: It's a general framework and not optimized for whatever task you are giving it. Unless you're performing very basic arithmetic, operations you are doing will be slower than a native program. It also means that your map and reduce functions will be generic. For instance, in log parsing on Hadoop, you've distributed the work to many nodes, but each node is only doing basic regular expressions, and you will have to custom code all of the field and attribute parsers yourself. ELSA uses advanced pattern matchers (Syslog-NG's PatternDB) to be incredibly efficient at parsing logs without using regular expressions. This allows one ELSA node to do the work of dozens of standard syslog receivers.

One could certainly write an Aho-Corasick-based pattern matcher that could be run in Hadoop, but that is not a trivial task, and provides no more benefit than the already-distributed workload of ELSA. So, if what you're doing is very generic, Hadoop may be a good fit. Very often, however, the capabilities you gain from distributing the workload will be eclipsed by the natural performance of custom-built, existing apps.

ELSA Will Always Be Faster Than Hadoop
ELSA is not a generic data framework like Hadoop, so it benefits from not having the overhead of:

Versioning
3x replication
Synchronization
Java virtual machine
Hadoop file system

Here's what it does have:

Unparalleled Indexing Speed
ELSA uses Sphinx, and Sphinx has the absolute fastest full-text indexing engine on the planet. Desktop-grade hardware can see as many as 100k records/second indexed from standard MySQL databases with data rates above 30 MB/sec of data indexed. It does this while still storing attributes to go along with each record. It is this unparalleled indexing speed which is the single largest factor for why ELSA is the fastest log collection and searching solution.

Efficient Deletes
Any logging solution is dependent on the efficiency of deletes once the dataset has grown to the final retention size. (This is often overlooked during testing because a full dataset is not yet present.) Old logs must be dropped to make room for the new ones. HBase (the noSQL database for Hadoop) does not delete data! Rather, data is marked for later deletion which happens during compaction. Now, this may be ok for small or sporadically large workloads, but ELSA is designed for speed and write-heavy workloads. HBase must suffer the overhead of deleted data (slower queries, more disk utilization) until it gets around to doing its costly compaction. ELSA has extremely efficient deletes by simply marking an entire index (which encompasses a time range) as expired and issuing a re-index, which effectively truncates the file. Not having to check each record in a giant index to see if it should be deleted is critical for quickly dumping old data.

Unparalleled Index Consolidation Speed
It is the speed of compaction (termed "consolidation" in ELSA or "index merge" in Sphinx) which is the actual overriding bottleneck for the entire system during sustained input. Almost any database or noSQL solution can scale to tens of thousands of writes per second per server for bursts, but as those records are flushed to disk periodically, it becomes this flushing and subsequent consolidation of disk buffers that dictates the overall sustainable writes per second. ELSA consolidates its indexes at rates of around 30k records/second, which establishes its sustained receiving limit.

Purpose-built Specialized Features
Sphinx provides critical features for full-text indexing such as stopwords (to boost performance when certain words are very common), advanced search techniques including phrase proximity matching (as in when quoting a search phrase), character set translation features, and many, many more.

When to Use Hadoop
This is a description of why Hadoop isn't always the right solution to Big Data problems, but that certainly doesn't mean that it's not a valuable project or that it isn't the best solution for a lot challenges. It's important to use the right tool for the job, and thinking critically about what features each tool provides is paramount to a project's success. In general, you should use Hadoop when:

Data access patterns will be very basic but analytics will be very complicated.
Your data needs absolutely guaranteed availability for both reading and writing.
There are inadequate traditional database-oriented tools which currently exist for your problem.

Do not use Hadoop if:

You're don't know exactly why you're using it.
You want to maximize hardware efficiency.
Your data fits on a single "beefy" server.
You don't have full-time staff to dedicate to it.

The easiest alternative to using Hadoop for Big Data is to use multiple traditional databases and architect your read and write patterns such that the data in one database does not rely on the data in another. Once that is established, it is much easier than you'd think to write basic aggregation routines in languages you're already invested in and familiar with. This means you need to think very critically about your app architecture before you throw more hardware at it.

Saturday, March 31, 2012

Why ELSA Doesn't Use Hadoop for Big Data