TY - GEN
T1 - BinAdapter
T2 - 19th ACM Asia Conference on Computer and Communications Security, AsiaCCS 2024
AU - Murodova, Nozima
AU - Koo, Hyungjoon
N1 - Publisher Copyright:
© 2024 Association for Computing Machinery.
PY - 2024/7/1
Y1 - 2024/7/1
N2 - Binary reverse engineering is crucial to gaining insights into the inner workings of a stripped binary. Yet, it is challenging to read the original semantics from a binary code snippet because of the unavailability of high-level information in the source, such as function names, variable names, and types. Recent advancements in deep learning show the possibility of recovering such vanished information with a well-trained model from a pre-defined dataset. Albeit a static model’s notable performance, it can hardly cope with an ever-increasing data stream (e.g., compiled binaries) by nature. The two viable approaches for ceaseless learning are retraining the whole dataset from scratch and fine-tuning a pre-trained model; however, retraining suffers from large computational overheads and fine-tuning from performance degradation (i.e., catastrophic forgetting). Lately, continual learning (CL) tackles the problem of handling incremental data in security domains (e.g., network intrusion detection, malware detection) using reasonable resources while maintaining performance in practice. In this paper, we focus on how CL assists in the improvement of a generative model that predicts a function symbol name from a series of machine instructions. To this end, we introduce BinAdapter, a system that can infer function names from an incremental dataset without performance degradation from an original dataset by leveraging CL techniques. Our major finding shows that incremental tokens in the source (i.e., machine instructions) or the target (i.e., function names) largely affect the overall performance of a CL-enabled model. Accordingly, BinAdapter adopts three built-in approaches: 1 inserting adapters in case of no incremental tokens in both the source and target, 2 harnessing multilingual neural machine translation (M-NMT) and fine-tuning the source embeddings with 1 in case of incremental tokens in the source, and 3 fine-tuning target embeddings with 2 in case of incremental tokens in both. To demonstrate the effectiveness of BinAdapter, we evaluate the above three scenarios using incremental datasets with or without a set of new tokens (e.g., unseen machine instructions or function names), spanning across different architectures and optimization levels. Our empirical results show that BinAdapter outperforms the state-of-the-art CL techniques for an F1 of up to 24.3% or a Rouge-l of 21.5% in performance.
AB - Binary reverse engineering is crucial to gaining insights into the inner workings of a stripped binary. Yet, it is challenging to read the original semantics from a binary code snippet because of the unavailability of high-level information in the source, such as function names, variable names, and types. Recent advancements in deep learning show the possibility of recovering such vanished information with a well-trained model from a pre-defined dataset. Albeit a static model’s notable performance, it can hardly cope with an ever-increasing data stream (e.g., compiled binaries) by nature. The two viable approaches for ceaseless learning are retraining the whole dataset from scratch and fine-tuning a pre-trained model; however, retraining suffers from large computational overheads and fine-tuning from performance degradation (i.e., catastrophic forgetting). Lately, continual learning (CL) tackles the problem of handling incremental data in security domains (e.g., network intrusion detection, malware detection) using reasonable resources while maintaining performance in practice. In this paper, we focus on how CL assists in the improvement of a generative model that predicts a function symbol name from a series of machine instructions. To this end, we introduce BinAdapter, a system that can infer function names from an incremental dataset without performance degradation from an original dataset by leveraging CL techniques. Our major finding shows that incremental tokens in the source (i.e., machine instructions) or the target (i.e., function names) largely affect the overall performance of a CL-enabled model. Accordingly, BinAdapter adopts three built-in approaches: 1 inserting adapters in case of no incremental tokens in both the source and target, 2 harnessing multilingual neural machine translation (M-NMT) and fine-tuning the source embeddings with 1 in case of incremental tokens in the source, and 3 fine-tuning target embeddings with 2 in case of incremental tokens in both. To demonstrate the effectiveness of BinAdapter, we evaluate the above three scenarios using incremental datasets with or without a set of new tokens (e.g., unseen machine instructions or function names), spanning across different architectures and optimization levels. Our empirical results show that BinAdapter outperforms the state-of-the-art CL techniques for an F1 of up to 24.3% or a Rouge-l of 21.5% in performance.
KW - Binary analysis
KW - Continual learning
KW - Reverse engineering
KW - Software security
UR - https://www.scopus.com/pages/publications/85199250056
U2 - 10.1145/3634737.3645006
DO - 10.1145/3634737.3645006
M3 - Conference contribution
AN - SCOPUS:85199250056
T3 - ACM AsiaCCS 2024 - Proceedings of the 19th ACM Asia Conference on Computer and Communications Security
SP - 852
EP - 865
BT - ACM AsiaCCS 2024 - Proceedings of the 19th ACM Asia Conference on Computer and Communications Security
PB - Association for Computing Machinery, Inc
Y2 - 1 July 2024 through 5 July 2024
ER -