In this paper, authors Korczyński, Janowski, and Duda try to develop a sampling scheme to detect SYN Flooding attacks and Portscans. Protection from DDoS attacks is a particularly important part of internet security, and being able to detect malicious packets from ordinary ones is incredibly valuable. There has been previous literature on SYN flooding detection mechanisms, but randomly checking only a few packets can make it hard to draw conclusions about the whole system’s behavior. The authors are proposing a new sampling detection scheme so that routers can effectively detect SYN floods with low rate sampling, and their proposal has a very high true positive rate and a low false positive rate.
Accurately detecting SYN Flooding is a relevant security problem, with one main goal—being able to accurately detect a malicious packet and blacklist future packets, without accidentally treating a legitimate packet as a malicious one. After all, companies or organizations want to be able to have their clients access their servers without being blocked, without being vulnerable to attacks. We found the focus on improving true positive rate and decreasing false positive rate to be very important in this field. The other issue with SYN Flooding is that you want to be able to detect it without inspecting every single packet, and we are curious to see if their sampling scheme actually provides reliable results without sacrificing accuracy.
Reproducing the Results
While this paper is about detecting SYN Floods, in order to do this reproduction of results, we are following their definition of a SYN Flood—namely, a client SYN is malicious if the final client ACK is never sent. Based on this definition of a SYN flood, our goal in reproducing the results is to aim to see if their sampling methods accurately detect these specific instances to the same true positive/false positive rate. In other words, the focus is more on the effectiveness of their sampling techniques rather than detecting all possible types of SYN Floods.
We aim to reproduce the three sampling schemes that they explain in the paper: Systematic, random 1-out-of-N, and uniform probabilistic sampling. We will pay attention to SYN flood sampling and not to port scanning behavior.
We used two main datasets, a CAIDA 2007 DDoS attack, and a NTUA 2003 DDoS attack. The NTUA attack was the original dataset used in the paper. We got access to 2 more datasets, but unfortunately one of them did not properly work for our purposes (not a SYN Flood DDoS) and the other one had technical difficulties while downloading the dataset, and we were unable to get anyone to respond to our emails about fixing the problem. We considered making our own dataset by creating a SYN Flood attack, but ultimately decided that the CAIDA dataset was large enough (several gigabytes).
One of the biggest challenges for this project was aquiring SYN Flooding datasets. After contacting the paper’s authors for weeks, we got access to their original datasets, but only 2 days before the deadline. So despite the fact that we found the paper’s implementation of the sampling methods very clear, we didn’t have much time to debug our original implementation of their detection/sampling methods to see why our graphs didn’t match the original.
We also tried to find datasets available online, but unfortunately SYN Flood pcap files were surprisingly difficult to find, and when we did find them, they usually only had a few packets in them. When we proposed this paper, we assumed that there would be plenty of SYN Flood datasets available online, but they were actually only a few of them and all of them required applications/prior approval. We reached out to several places to apply to use their academic datasets, and fortunately acquired one very good dataset (CAIDA) with 16GB of a DDoS Syn Flood Attack. This dataset was already preprocessed into pcap files, separating the packets being sent and received.
We also faced some unexpected implementation problems. We chose to use dpkt, a python library for processing pcap files. Unfortunately, after writing most of our code and testing on smaller datasets, we ran it on a larger file and discovered a memory leak in the library. This made it tricky to run our code on very large pcap files, and ended up capping the max number of packets we could process and analyze to 27 million (on a machine running with 32 GB of RAM).
Here is the original graph from the paper. It’s checking the true positive/false positive rate of detecting malicious IPs corresponding to the sampling rate. Both these graphs were calculated from packets corresponding to a high volume SYN Flood attack. You can see that after a certain sampling rate, the true positive rate goes close to 100%, and the false positive rate is incredibly low regardless of sampling rate (but has some small spikes around 0.04%-0.125%).
Our goal was to reproduce this graph, and see if we could get similar results with a different dataset. We did not include points at sampling rates that were smaller than 1 packet, and instead of keeping all 3 of the sampling methods in one graph, we plotted each one individually.
ORIGINAL PAPER’S DATASET – TRUE POSITIVE RATE
Original Dataset – False Positive Rate
CAIDA DATASET – TRUE POSITIVE RATE
CAIDA DATASET – FALSE POSITIVE RATE
Interpreting Our Results
Ultimately, we ended up with the bizarre results of having our CAIDA dataset come close to the original graphs, and the original dataset graphs be completely different.
We have a few explanations for the results of the CAIDA dataset. We only used part of the CAIDA traffic since we were capped at ~27 million packets due to the dpkt memory leak, but this is still over 50% larger that the original dataset’s attack, which had 17 million packets. So our true positive rate didn’t have the increase we see in the paper, most likely because even a very small sampling rate still resulted in a large number of packets examined, making our true positive rate close to 100% for almost all rates. The original graph didn’t get to that number until about a sampling rate of 0.03%, but then stayed close to 100% after that, meaning that the sampling methods are very accurate as long as enough packets are examined.
Our CAIDA false positive rate was close to the original’s as well, but not quite as small. We had one sampling rate with a false positive rate of 1%, and the rest of the points were at or below 0.5%. The original graph’s false positive rate is capped at 0.015%, but the shape is relatively similar. This also could be explained by the size of the dataset—more traffic means there are more opportunities to accidentally blacklist legitimate IPs. Still, the sampling methods manage to have a very high true positive rate and a very low false positive rate on this dataset, which matches the thesis of the paper.
As for our True Positive Rate graphs below a 0.1% sampling rate on the original dataset, this may also just demonstrate the inconsistency of the true positive rate when sampling below a certain number of packets. While the authors were able to get a smoother graph demonstrating the steady improvements on SYN flood detection as their sampling rate increased, it may have been partially a lucky run, and that sampling below a certain number of packets may or may not provide good results.
However, for the True Positive graphs at sampling rates > 0.1% and for the False positive graphs, we got unexpected results that differed a lot from the original ones. We are at a loss to explain the differences. We feel relatively confident in our implementation of their sampling methods based on the successful CAIDA graphs, so the problem most likely lies elsewhere. We suspect that the original authors may have done some processing of the PCAP files (whereas the CAIDA dataset was already processed into packets only going to the victim machine), or had a specific environment setup to read the traces. However, given that we only got the dataset 2 days before the deadline, we were unable to pinpoint the exact problem despite many attempts.
While the sampling methods seem to hold potential, we believe that their heuristic of detecting a SYN Flood is narrow and most likely outdated. Basing malicious packets on a specific heuristic of having received an ACK makes it relatively easy for attackers to change up their scheme to avoid detection. There are also plenty of other DDoS attacks, so basing a detection scheme on only 1 very specific type of attack may be successful in theory but not in practice. So while the sampling methods are a good way to detect DDoS attacks without having to inspect every single packet, without an adequate heuristic scheme that covers a multitude of possible attacks, it’s not as useful.
We only applied the sampling method to SYN Floods using their ground truth of determining what makes a packet malicious. However, if we had more time, we would have liked to compare different detection schemes using one of the sampling methods to see if this applies when using alternate DDoS detection heuristics. It would be interesting to find out if the sampling methods work with alternate attacks, like ACK flooding.
We chose to implement this in Python. The original paper used TracesPlay and then imported the results into MATLAB, but we decided a Python script would make it easiest to reproduce. While the setup is a bit time consuming (need specific AWS instances, downloading the datasets, etc) running the script is very simple and automatically generates the graphs corresponding with a pcap file.
First, request access to the CAIDA dataset. We cannot share these files without permission, but you should be able to get access in a few days.
Next, in order to use dpkt, I recommend spinning up a standard Ubuntu instance on AWS with at least 50GB of disk space and 32 GB of memory (This may be overkill, but it will definitely work).
SSH into the machine using the -X flag.
When you have the machine set up and your CAIDA username and password, download the attack files with the following command:
wget --user=<username> --password=<pass> -r -nH -nd -np -R index.html* https://data.caida.org/datasets/security/ddos-20070804/to-victim/
Then, clone our github repo.
Run our setup script to install any dependencies that might be needed.
In order to reproduce the results, run:
python detector.py <pcap_file>
The graph will pop up once the pcap file is fully processed and we have done all the sampling methods.
If you find that you’re hitting a segmentation fault after processing m packets, pass in the flag -m to cap the maximum number of packets read (in millions) from a file.
python detector.py -m 10 <pcap_file>
So this will stop once you read 10 million packets. The AWS instance with 32GB of RAM should allow you to read ~27million packets, though.
We ran the CAIDA graphs with the pcap file ddostrace.to-victim.20070804_141436.pcap, and the original dataset with exp31.pcap. Both captured the incoming packets in the middle of the SYN Flood.
The main issue was with the memory leak in dpkt. We were able to get around it by just running our code on machines with a lot of memory (Felipe’s laptop has 16GB of RAM, and Keziah spun up an AWS instance with 32 GB of RAM and we stopped having problems). However, if you want to run locally on machines with less memory, I would avoid this library. We would also like to advise future students against choosing a project that requires difficult to acquire datasets, as it was actually quite a hurdle to acquire them and when we finally did, we didn’t have much time to work on getting our implementation to succeed with them.
The CAIDA UCSD “DDoS Attack 2007” Dataset