Machine learning for malware and intrusion detection: dataset design, cost-aware models, and research pitfalls
Files
Date
Authors
Keywords
Degree Level
Advisor
Degree Name
Volume
Issue
Publisher
Abstract
Information technology has reduced constraints of physical distance and delays associated with traditional methods in areas such as medicine, economy, industry, and beyond. However, it also presents potential threats such as hackers and cybercriminals. As information technology advances, threats become smarter and more complex, cat-and-mouse-game that continuously increases in complexity. Machine learning improves security tools such as malware or intrusion detection by taking advantage of past experiences. Machine learning requires high-quality datasets to create effective models. The first paper in this thesis, eBPF-Powered Dynamic Analysis for Linux Malware Detection: A Dataset and Experimental Study, explores the application of machine learning to detect malware. The paper also introduces an automated eBPF-based data collection pipeline using Docker containers to generate labeled malware and clean environment traces. We construct a dataset of clean and infected Linux operating systems and use various machine learning techniques to identify patterns in Linux system calls that indicate whether the operating system is infected or not, achieving a detection F1-Score of up to 99% with Random Forest models. Machine learning can also be used to develop intrusion detection systems. Two critical components of such systems are the dataset and the models. However, popular network attack datasets suffer from imbalances, with significant disparities in the number of instances between different classes (e.g., benign traffic can have thousands of samples, while rare attack types may have fewer than 50). This imbalance can severely affect model performance; for example, rare attack classes may be underrepresented by a ratio of 40:1 compared to benign traffic, which can significantly reduce recall for these classes. To address this issue, over- and undersampling methods balance datasets before feeding them into the algorithms. However, undersampling may overlook important data, whereas oversampling can introduce redundancy, ultimately weakening the model's performance. Furthermore, the speed with which an intrusion detection tool makes decisions plays a vital role in its effectiveness. The second paper in this thesis, titled Cost-Aware Machine Learning for Intrusion Detection: A Performance Trade-Off Study, demonstrates that by sacrificing an insignificant amount of accuracy, it is possible to achieve models that are tens of times faster and significantly less memory-consuming, making them practical for real-time deployment. This is accomplished by exploring the combination of different deep learning and machine learning models, along with various over- and under-sampling methods. Furthermore, the paper proposes twelve prediction cost functions that integrate these trade-offs alongside traditional performance measures. A slow intrusion detection tool can otherwise become a bottleneck in a network, highlighting the need for models that balance accuracy and efficiency. The third paper, titled Power and Pitfalls of ML-Based Intrusion Detection Systems, examines key challenges in developing machine learning-based intrusion detection systems, with a focus on both dataset generation and model design. It highlights issues such as the lack of representative datasets and the limited generalizability of models. This paper examines ten significant research barriers and their interconnections, which means that a barrier may lead to one or more barriers. The study includes a statistical analysis of dozens of research papers, revealing the current state of the field. Two best-practice checklists are proposed to guide future work in dataset creation and IDS research, with the aim of improving the quality and reliability of publications in this domain. Together, these three studies provide a comprehensive framework for designing more accurate, efficient, and trustworthy machine learning-based security tools such as malware or intrusion detection systems. By combining practical data generation, cost-aware modeling, and critical analysis of research pitfalls, this thesis contributes to more robust and realistic security research and practice.
