CS244 ‘16: An Accurate Sampling Scheme for Detecting SYN Flooding Attacks and Portscans


Introduction

In this paper, authors Korczyński, Janowski, and Duda try to develop a sampling scheme to detect SYN Flooding attacks and Portscans. Protection from DDoS attacks is a particularly important part of internet security, and being able to detect malicious packets from ordinary ones is incredibly valuable.  There has been previous literature on SYN flooding detection mechanisms, but randomly checking only a few packets can make it hard to draw conclusions about the whole system’s behavior. The authors are proposing a new sampling detection scheme so that routers can effectively detect SYN floods with low rate sampling, and their proposal has a very high true positive rate and a low false positive rate. 
Accurately detecting SYN Flooding is a relevant security problem, with one main goal—being able to accurately detect a malicious packet and blacklist future packets, without accidentally treating a legitimate packet as a malicious one. After all, companies or organizations want to be able to have their clients access their servers without being blocked, without being vulnerable to attacks. We found the focus on improving true positive rate and decreasing false positive rate to be very important in this field. The other issue with SYN Flooding is that you want to be able to detect it without inspecting every single packet, and we are curious to see if their sampling scheme actually provides reliable results without sacrificing accuracy. 

Reproducing the Results

While this paper is about detecting SYN Floods, in order to do this reproduction of results, we are following their definition of a SYN Flood—namely, a client SYN is malicious if the final client ACK is never sent. Based on this definition of a SYN flood, our goal in reproducing the results is to aim to see if their sampling methods accurately detect these specific instances to the same true positive/false positive rate. In other words, the focus is more on the effectiveness of their sampling techniques rather than detecting all possible types of SYN Floods. 
We aim to reproduce the three sampling schemes that they explain in the paper: Systematic, random 1-out-of-N, and uniform probabilistic sampling. We will pay attention to SYN flood sampling and not to port scanning behavior.  

Datasets

We used two main datasets, a CAIDA 2007 DDoS attack, and a NTUA 2003 DDoS attack. The NTUA attack was the original dataset used in the paper.  We got access to 2 more datasets, but unfortunately one of them did not properly work for our purposes (not a SYN Flood DDoS) and the other one had technical difficulties while downloading the dataset, and we were unable to get anyone to respond to our emails about fixing the problem. We considered making our own dataset by creating a SYN Flood attack, but ultimately decided that the CAIDA dataset was large enough (several gigabytes). 

Challenges

One of the biggest challenges for this project was aquiring SYN Flooding datasets. After contacting the paper’s authors for weeks, we got access to their original datasets, but only 2 days before the deadline. So despite the fact that we found the paper’s implementation of the sampling methods very clear, we didn’t have much time to debug our original implementation of their detection/sampling methods to see why our graphs didn’t match the original. 
We also tried to find datasets available online, but unfortunately SYN Flood pcap files were surprisingly difficult to find, and when we did find them, they usually only had a few packets in them. When we proposed this paper, we assumed that there would be plenty of SYN Flood datasets available online, but they were actually only a few of them and all of them required applications/prior approval. We reached out to several places to apply to use their academic datasets, and fortunately acquired one very good dataset (CAIDA) with 16GB of a DDoS Syn Flood Attack.  This dataset was already preprocessed into pcap files, separating the packets being sent and received.
We also faced some unexpected implementation problems. We chose to use dpkt, a python library for processing pcap files. Unfortunately, after writing most of our code and testing on smaller datasets, we ran it on a larger file and discovered a memory leak in the library. This made it tricky to run our code on very large pcap files, and ended up capping the max number of packets we could process  and analyze to 27 million (on a machine running with 32 GB of RAM). 

Original Results

Here is the original graph from the paper. It’s checking the true positive/false positive rate of detecting malicious IPs corresponding to the sampling rate.  Both these graphs were calculated from packets corresponding to a high volume SYN Flood attack. You can see that after a certain sampling rate, the true positive rate goes close to 100%, and the false positive rate is incredibly low regardless of sampling rate (but has some small spikes around 0.04%-0.125%).
Our goal was to reproduce this graph, and see if we could get similar results with a different dataset. We did not include points at sampling rates that were smaller than 1 packet, and instead of keeping all 3 of the sampling methods in one graph, we plotted each one individually.

ORIGINAL PAPER’S DATASET – TRUE POSITIVE RATE

Original Dataset – False Positive Rate

CAIDA DATASET – TRUE POSITIVE RATE

CAIDA DATASET – FALSE POSITIVE RATE

Interpreting Our Results

Ultimately, we ended up with the bizarre results of having our CAIDA dataset come close to the original graphs, and the original dataset graphs be completely different. 
We have a few explanations for the results of the CAIDA dataset. We only used part of the CAIDA traffic since we were capped at ~27 million packets due to the dpkt memory leak, but this is still over 50% larger that the original dataset’s attack, which had 17 million packets. So our true positive rate didn’t have the increase we see in the paper, most likely because even a very small sampling rate still resulted in a large number of packets examined, making our true positive rate close to 100% for almost all rates. The original graph didn’t get to that number until about a sampling rate of 0.03%, but then stayed close to 100% after that, meaning that the sampling methods are very accurate as long as enough packets are examined. 
Our CAIDA false positive rate was close to the original’s as well, but not quite as small. We had one sampling rate with a false positive rate of 1%, and the rest of the points were at or below 0.5%. The original graph’s false positive rate is capped at 0.015%, but the shape is relatively similar. This also could be explained by the size of the dataset—more traffic means there are more opportunities to accidentally blacklist legitimate IPs. Still, the sampling methods manage to have a very high true positive rate and a very low false positive rate on this dataset, which matches the thesis of the paper.
As for our True Positive Rate graphs below a 0.1% sampling rate on the original dataset, this may also just demonstrate the inconsistency of the true positive rate when sampling below a certain number of packets.  While the authors were able to get a smoother graph demonstrating the steady improvements on SYN flood detection as their sampling rate increased, it may have been partially a lucky run, and that sampling below a certain number of packets may or may not provide good results. 
However, for the True Positive graphs at sampling rates > 0.1% and for the False positive graphs, we got unexpected results that differed a lot from the original ones. We are at a loss to explain the differences. We feel relatively confident in our implementation of their sampling methods based on the successful CAIDA graphs, so the problem most likely lies elsewhere.  We suspect that the original authors may have done some processing of the PCAP files (whereas the CAIDA dataset was already processed into packets only going to the victim machine), or had a specific environment setup to read the traces. However, given that we only got the dataset 2 days before the deadline, we were unable to pinpoint the exact problem despite many attempts.

Critique

While the sampling methods seem to hold potential, we believe that their heuristic of detecting a SYN Flood is narrow and most likely outdated. Basing malicious packets on a specific heuristic of having received an ACK makes it relatively easy for attackers to change up their scheme to avoid detection.  There are also plenty of other DDoS attacks, so basing a detection scheme on only 1 very specific type of attack may be successful in theory but not in practice. So while the sampling methods are a good way to detect DDoS attacks without having to inspect every single packet, without an adequate heuristic scheme that covers a multitude of possible attacks, it’s not as useful.

Future Work

We only applied the sampling method to SYN Floods using their ground truth of determining what makes a packet malicious. However, if we had more time, we would have liked to compare different detection schemes using one of the sampling methods to see if this applies when using alternate DDoS detection heuristics. It would be interesting to find out if the sampling methods work with alternate attacks, like ACK flooding.

Platform

We chose to implement this in Python. The original paper used TracesPlay and then imported the results into MATLAB, but we decided a Python script would make it easiest to reproduce. While the setup is a bit time consuming (need specific AWS instances, downloading the datasets, etc) running the script is very simple and automatically generates the graphs corresponding with a pcap file.

README

First, request access to the CAIDA dataset. We cannot share these files without permission, but you should be able to get access in a few days. 
Next, in order to use dpkt, I recommend spinning up a standard Ubuntu instance on AWS with at least 50GB of disk space and 32 GB of memory (This may be overkill, but it will definitely work). 
SSH into the machine using the -X flag.
When you have the machine set up and your CAIDA username and password, download the attack files with the following command:
wget --user=<username> --password=<pass> -r -nH -nd -np -R index.html* https://data.caida.org/datasets/security/ddos-20070804/to-victim/
Then, clone our github repo. 
Run our setup script to install any dependencies that might be needed.
./installer.sh
In order to reproduce the results, run:
python detector.py <pcap_file>
The graph will pop up once the pcap file is fully processed and we have done all the sampling methods.
If you find that you’re hitting a segmentation fault after processing m packets, pass in the flag -m to cap the maximum number of packets read (in millions) from a file.
python detector.py -m 10 <pcap_file>
So this will stop once you read 10 million packets.  The AWS instance with 32GB of RAM should allow you to read ~27million packets, though.
We ran the CAIDA graphs with the pcap file ddostrace.to-victim.20070804_141436.pcap, and the original dataset with exp31.pcap. Both captured the incoming packets in the middle of the SYN Flood.

Feedback

The main issue was with the memory leak in dpkt. We were able to get around it by just running our code on machines with a lot of memory (Felipe’s laptop has 16GB of RAM, and Keziah spun up an AWS instance with 32 GB of RAM and we stopped having problems). However, if you want to run locally on machines with less memory, I would avoid this library.  We would also like to advise future students against choosing a project that requires difficult to acquire datasets, as it was actually quite a hurdle to acquire them and when we finally did, we didn’t have much time to work on getting our implementation to succeed with them.

Citations

Original Paper
The CAIDA UCSD “DDoS Attack 2007” Dataset

 

One response to “CS244 ‘16: An Accurate Sampling Scheme for Detecting SYN Flooding Attacks and Portscans

  1. Reproducibility: 4

    The original data set was inaccessible. The CAIDA data set results varied slightly in the false positive Uniform and Random 1 in N graphs.
    We had more spikes around the 10^-2 sampling rate.

    We were a bit confused as to how the actual sampling was done. It would be interesting to know what you were looking for in the packets. Also, we were wondering why all of the packets need to be loaded into memory. Would it be possible to process them as a stream?

    It would have been nice to have more context on prior state-of-the-art techniques for detecting these attacks. The ones in this experiment do not seem that sophisticated.

    -Ryan Hermstein and Andrew Lim

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s