BinAdapter: Leveraging Continual Learning for Inferring Function Symbol Names in a Binary

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Binary reverse engineering is crucial to gaining insights into the inner workings of a stripped binary. Yet, it is challenging to read the original semantics from a binary code snippet because of the unavailability of high-level information in the source, such as function names, variable names, and types. Recent advancements in deep learning show the possibility of recovering such vanished information with a well-trained model from a pre-defined dataset. Albeit a static model’s notable performance, it can hardly cope with an ever-increasing data stream (e.g., compiled binaries) by nature. The two viable approaches for ceaseless learning are retraining the whole dataset from scratch and fine-tuning a pre-trained model; however, retraining suffers from large computational overheads and fine-tuning from performance degradation (i.e., catastrophic forgetting). Lately, continual learning (CL) tackles the problem of handling incremental data in security domains (e.g., network intrusion detection, malware detection) using reasonable resources while maintaining performance in practice. In this paper, we focus on how CL assists in the improvement of a generative model that predicts a function symbol name from a series of machine instructions. To this end, we introduce BinAdapter, a system that can infer function names from an incremental dataset without performance degradation from an original dataset by leveraging CL techniques. Our major finding shows that incremental tokens in the source (i.e., machine instructions) or the target (i.e., function names) largely affect the overall performance of a CL-enabled model. Accordingly, BinAdapter adopts three built-in approaches: 1 inserting adapters in case of no incremental tokens in both the source and target, 2 harnessing multilingual neural machine translation (M-NMT) and fine-tuning the source embeddings with 1 in case of incremental tokens in the source, and 3 fine-tuning target embeddings with 2 in case of incremental tokens in both. To demonstrate the effectiveness of BinAdapter, we evaluate the above three scenarios using incremental datasets with or without a set of new tokens (e.g., unseen machine instructions or function names), spanning across different architectures and optimization levels. Our empirical results show that BinAdapter outperforms the state-of-the-art CL techniques for an F1 of up to 24.3% or a Rouge-l of 21.5% in performance.

Original languageEnglish
Title of host publicationACM AsiaCCS 2024 - Proceedings of the 19th ACM Asia Conference on Computer and Communications Security
PublisherAssociation for Computing Machinery, Inc
Pages852-865
Number of pages14
ISBN (Electronic)9798400704826
DOIs
StatePublished - 1 Jul 2024
Event19th ACM Asia Conference on Computer and Communications Security, AsiaCCS 2024 - Singapore, Singapore
Duration: 1 Jul 20245 Jul 2024

Publication series

NameACM AsiaCCS 2024 - Proceedings of the 19th ACM Asia Conference on Computer and Communications Security

Conference

Conference19th ACM Asia Conference on Computer and Communications Security, AsiaCCS 2024
Country/TerritorySingapore
CitySingapore
Period1/07/245/07/24

Keywords

  • Binary analysis
  • Continual learning
  • Reverse engineering
  • Software security

Fingerprint

Dive into the research topics of 'BinAdapter: Leveraging Continual Learning for Inferring Function Symbol Names in a Binary'. Together they form a unique fingerprint.

Cite this