Monday, April 11, 2011

Network Intrusion Detection Systems


Incident response starts with a seed event which triggers an investigation. In my organization, this usually starts with a network intrusion detection system (NIDS) generating an alert based on predefined signatures. There are two major open-source NIDS: Snort and Suricata. There are lots of FAQ's and descriptions on the respective project web sites, so I will not cover the basics here. Instead, I will discuss some of the differences between them as well as performance guidelines.

Architecture Overview
The general flow of information is the same for both Snort and Suricata, though their internals differ substantially. At a high level, the network inspection looks like this:
Network packets → Packet header parsing → Payload normalization → Payload inspection → Alert generation
Generally, the processing effort for each phase is distributed as 10% for parsing, 10-20% for normalization, and 70-80% for payload inspection. Alert generation is usually negligible. If the sum total of effort it takes to complete these phases for a given packet exceeds the resources available, the packet will be “dropped,” meaning it goes unnoticed. Therefore, performance tuning is critical to reliably detecting intrusions.

Capacity Planning
Clearly, the vast majority of performance is dictated by the effort it takes to complete the payload inspection. This means that the reliability of a sensor is ultimately decided by whether or not it has enough resources to inspect the payload of the traffic it is assigned. (Some IDS events are generated solely based on IP addresses, but they are an exception.) When sizing hardware to create an IDS, here is my rule of thumb:
1 CPU = (1000 signatures ) * (500 megabits network traffic)
That is, you need one CPU for every thousand signatures inspecting 500 Megabits of network traffic. So if your rule set has 4000 signatures and your Internet gateway has 300 Megabits of network traffic, you will need at least ((4000/1000) = 4) * ((300/500) = .6) = 2.4 CPU's, meaning you'll need to spread the traffic across three CPU's. I should take a moment to point out that this formula applies to standard traffic for most organizations in which web makes up 80-90% of the traffic, followed by email. In a server hosting environment, you will need to find your own benchmarks.

Sizing Preprocessor Overhead
After you've acquired a box to use but before you start with finding out how many signatures you can run with your resources, you need to get a baseline of the performance when only running the payload normalization, usually referred to as preprocessors. Doing so is very simple: run the sensor with no rules loaded at peak traffic periods (around noon in an office environment). If you run the NIDS without daemonizing using the time command, you can get a pretty accurate reading on how much CPU time it takes. Here's an example:
time snort -c /etc/snort/snort-norules.conf
Let it run for around five minutes and kill it with Ctrl+C. The time command will then print out stats on the run, that look something like this:
real 4m33.143s
user 0m36.218s
sys 0m14.937s
“User” and “sys” refer to how much time was spent performing the inspection and how much time was spent moving packets from the network card into RAM and then into the NIDS, respectively. Add up “user” and “sys” and divide by “real” to get the percentage of CPU required. In this run, it would be (36.2 + 14.9)/273.1 = .18, or 18%. That is the percentage of CPU it takes to normalize the payloads. Keep in mind that during off-peak hours, most packets will not have large payloads, and so the percentage can be higher than during peak usage periods.

Detecting Packet Drops
Any libpcap-based IDS like Suricata and Snort will give you packet drop statistics. However, these numbers are unreliable, especially in high-drop situations. One way you can quickly find out if you are dropping packets is to run the above test but with your rules loaded. If your total CPU percentage is over 75%, there is a very good chance that you are occasionally dropping packets. If it's 95% or more, you are frequently dropping packets.

However, despite the above math, detecting packet loss when it really counts is still something of an art, as it's very difficult to account for a given packet. The best way to really determine whether or not your sensor is generally catching what it is supposed to catch is to setup a “heartbeat” signature. This consists of two parts: a script that will make a web request to a test site at a regular interval, and a signature designed to alert on that request. Here's an example Perl command:
perl -MLWP::UserAgent -e 'LWP::UserAgent->new()->get("http://example.com/testheartbeat123");'
That will make a web request to example.com. You should replace example.com with a site you have permission to make requests against.

The second part is to write a signature that will detect the heartbeat. Here's a corresponding Snort/Suricata signature that will detect the heartbeat request:
alert tcp any any → any 80 (msg:”Heartbeat” content:”/testheartbeat123”; http_uri; classification:not-suspicious; sid:1;)
Put an entry in your sensor's cron tab (this is assuming you're using Linux) to make the request every minute:
* * * * * perl -MLWP::UserAgent -e 'LWP::UserAgent->new()->get("http://example.com/testheartbeat123");' > /dev/null 2>&1

Now you should get an alert every minute, on the minute for your heartbeat signature. If there is ever a missing entry in your alert log, then you know the sensor had a lapse in coverage at that moment. If you log your alerts to a database, you can then create graphs using a spreadsheet to plot the times when your sensor was overloaded. You may also consider setting up a monitoring script that feeds a program like Nagios to detect if an entry is missing. The really nice thing about this setup is that it is a true check for the entire chain: traffic sourcing, detection, and alert output. If any part of the chain breaks, you'll be able to tell.

The caveat, of course, is that it is very possible to get a heartbeat alert in the same minute that the sensor is overloaded, so this is better for establishing trends, e.g. “the sensor got all 3600 heartbeats last hour” versus “there's no way we could've missed a packet that minute—we got a heartbeat.” The absolute, definitive test is to record all traffic to pcap for a given amount of time and replay it as a readfile through the NIDS. If the alerts from the readfile match the alerts you got live, then you're all set.

Multi-CPU Setups
Snort, unlike Suricata, is single-threaded. This means that anytime a single CPU cannot handle the load, packets will be lost. Suricata, by contrast, will attempt to use all of the CPU's on the sensor and will load-balance the traffic across all of the CPU's, so there is little tuning needed in this regard. In order to inspect more traffic than a single CPU can handle with Snort, you will need to run multiple Snort instances. In addition to the extra management overhead, this means that you will have to find a way to split the traffic evenly across those instances. The easiest way to do this is to use Luca Deri's PF_RING module to create a pfring DAQ for Snort. Details can be seen on Luca's blog.

It should be noted that when running Snort with multiple instances, each instance will have to normalize the traffic. Suricata improves on this by normalizing the traffic once (depending on configuration), then pushing the normalized traffic to worker threads which perform the payload inspection. Therefore, you will incur a 10-20% CPU utilization penalty per Snort instance.

Advanced Performance Enhancement
I stated above that about 10% of the CPU is devoted to the initial packet header parsing. This can be reduced and eliminated through several means. Using PF_RING, as mentioned above, the overhead can be drastically reduced to more like 1-2%. This can be very important if the link has gigabit or higher data but, you only want to inspect some of it. Filtering high-speed networks can be CPU intensive unless you use something like PF_RING to offload the packet filtering.

Another alternative is to purchase a pcap acceleration card, like the DAG cards manufactured by Endace Corporation. These cards range in price from a few to tens of thousands of dollars, depending on 1/10 gigabit configuration. They offload the entire burden of packet header parsing and filtering from the CPU onto the built-in packet processor.

Lastly, you can purchase a pcap load balancer, such as those made by Datasys and Gigamon. The feature sets range from basic replication to appliance-based pcap filtering. This option is the most effective but also the most costly.

Conclusion
If at all possible, I recommend diversifying your security portfolio by running both Snort and Suricata. Rapid development continues on Suricata (as well as improvements to Snort), so I hesitate to make any claims right now regarding which performs better overall. It is clear that Suricata has some major advantages when it comes to IP-only rules as well as detecting protocols on non-standard ports, but Snort has a solid track record and a mature codebase.

Saturday, April 2, 2011

Lizamoon: Knowing is Half the Battle

In my last post, I showed the different log sources that are readily available for collection, including a custom one: httpry_logger. Having a detailed record of every URL visited that is instantly accessible via ELSA makes incident response for well-publicized events very simple. This week, many news sites reported on a new attack made famous by Websense. Across the globe, IT security analysts were asking themselves (and, in many cases, being asked by management), if anyone had been affected by the malicious links the attack had scattered over hundreds of thousands of web sites. Depending on the tools available to the analyst, this can be a time-consuming task. With ELSA being fed by httpry_logger, however, this question can be answered while your boss is still standing at your desk!

The question: Did anyone visit the malicious web page, lizamoon.com (now offline), and if so, what happened? In ELSA query language, which is intentionally very similar to Google query language, the query is:
site:lizamoon.com

There are lots of hits, and your boss looks worried! Looks like a lot of cross-site scripting (XSS) in there. However, Websense reported that the site had already been taken offline. As you can see, many of these requests did not receive a response. So, let's drill down a bit and ask: who visited the malicious site and got data back? ELSA:
site:lizamoon.com content_length>=1

That looks better, but there are still a few hits. We need to drill down to see what the malicious content was. For this, we'll need to use a different tool, StreamDB, which I will describe in detail in a later post. For now, let's have a look at the StreamDB output for our query:

Returning 2 of 2 at offset 0 from Mon Mar 28 17:57:17 2011 to Mon Mar 28

17:57:17 2011
2011-03-28 17:57:17 x.x.x2.4:57986 ->; 95.64.9.18:80 0s 594 bytes RST GET lizamoon.com/ur.php
oid=12447-2986004740-594-0
GET /ur.php

Connection: Keep-Alive
Accept: */*
Accept-Encoding: gzip, deflate
Accept-Language: en-us

Host: lizamoon.com
Referer: http://www.designbasics.com/search/cdl_template/cdl-photo.asp?sPlanNo=24059&Exposure=33a-400&Path=http://www.designbasics.com/designs/24000/&ViewType=UPPER&IsPhoto=False&lPath=http://www.designbasics.com/designs/24000/&HomeTour=False&PlanName=Chatham&SquareFeet=1593&MaxWidth=&MaxDepth=
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 1.0.3705; InfoPath.1; .NET CLR 2.0.50727; MS-RTC LM 8)
X-HTTP-Version: 1.1

2011-03-28 17:57:17 x.x.x2.4:57986 <- 95.64.9.18:80 0s 975 bytes RST 200 ASCII text, with very long lines, with no line terminators

oid=12447-2986005334-975-0
200 OK

Connection: Keep-Alive

Date: Mon, 28 Mar 2011 22:56:18 GMT

Server: Apache/2.2.17 (FreeBSD) mod_ssl/2.2.17 OpenSSL/0.9.8n DAV/2 PHP/5.3.3
Content-Length: 650
Content-Type: text/html
Keep-Alive: timeout=5, max=100
Set-Cookie: click888=1; expires=Wed, 27-Apr-2011 22:56:18 GMT
X-HTTP-Version: 1.1
X-Powered-By: PHP/5.3.3
document.location = 'http://defender-nrpr.in/scan1b/237?sessionId=0500(snip)

So, content did indeed come down, and obviously, the goal is to set the browser's location to the next malicious site, defender-nrpr.in. We can then use ELSA to see if the client did follow to that site:
site:defender-nrpr.in
No results found, so our client is ok! So what did our client do? We do an ELSA query for that client IP for the few seconds around the request and see that there were no other suspicious requests. Looks like everything is fine, and management can rest easy.

Let's dig a little deeper in ELSA to see what other information we can find out about lizamoon. What are the hacked sites that are funneling traffic to the malicious drop site? ELSA:
site:lizamoon.com groupby:referer

This shows us the top unique page URI's that are linking to the malicious site. What IP addresses are serving lizamoon.com? ELSA:
site:lizamoon.com groupby:dstip
Who went there?
site:lizamoon.com groupby:srcip

Because each one of these queries finishes in a second or two, we can answer all of these questions in under thirty seconds—short enough to do while your boss is right there! This process of broad queries followed by drill-downs is the staple of efficient investigation. A lot of tools, (Websense, for instance), will give you the ability to do the broad queries, and to some extent, drill-down, but the difference is that the follow-up queries take far more time in large organizations, which can make your analysts less likely to follow every lead as they have to constantly prioritize their time. Queries that take almost no time are much more likely to be made. So if knowing is half the battle, then asking the questions is the other half.