TY - JOUR
T1 - Binary Code Representation With Well-Balanced Instruction Normalization
AU - Koo, Hyungjoon
AU - Park, Soyeon
AU - Choi, Daejin
AU - Kim, Taesoo
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2023
Y1 - 2023
N2 - The recovery of contextual meanings on a machine code is required by a wide range of binary analysis applications, such as bug discovery, malware analysis, and code clone detection. To accomplish this, advancements on binary code analysis borrow the techniques from natural language processing to automatically infer the underlying semantics of a binary, rather than replying on manual analysis. One of crucial pipelines in this process is instruction normalization, which helps to reduce the number of tokens and to avoid an out-of-vocabulary (OOV) problem. However, existing approaches often substitutes the operand(s) of an instruction with a common token (e. g., callee target → FOO), inevitably resulting in the loss of important information. In this paper, we introduce well-balanced instruction normalization (WIN), a novel approach that retains rich code information while minimizing the downsides of code normalization. With large swaths of binary code, our finding shows that the instruction distribution follows Zipf's Law like a natural language, a function conveys contextually meaningful information, and the same instruction at different positions may require diverse code representations. To show the effectiveness of WIN, we present DEEP SEMANTIC that harnesses the BERT architecture with two training phases: pre-training for generic assembly code representation, and fine-tuning for building a model tailored to a specialized task. We define a downstream task of binary code similarity detection, which requires underlying code semantics. Our experimental results show that our binary similarity model with WIN outperforms two state-of-the-art binary similarity tools, DeepBinDiff and SAFE, with an average improvement of 49.8% and 15.8%, respectively.
AB - The recovery of contextual meanings on a machine code is required by a wide range of binary analysis applications, such as bug discovery, malware analysis, and code clone detection. To accomplish this, advancements on binary code analysis borrow the techniques from natural language processing to automatically infer the underlying semantics of a binary, rather than replying on manual analysis. One of crucial pipelines in this process is instruction normalization, which helps to reduce the number of tokens and to avoid an out-of-vocabulary (OOV) problem. However, existing approaches often substitutes the operand(s) of an instruction with a common token (e. g., callee target → FOO), inevitably resulting in the loss of important information. In this paper, we introduce well-balanced instruction normalization (WIN), a novel approach that retains rich code information while minimizing the downsides of code normalization. With large swaths of binary code, our finding shows that the instruction distribution follows Zipf's Law like a natural language, a function conveys contextually meaningful information, and the same instruction at different positions may require diverse code representations. To show the effectiveness of WIN, we present DEEP SEMANTIC that harnesses the BERT architecture with two training phases: pre-training for generic assembly code representation, and fine-tuning for building a model tailored to a specialized task. We define a downstream task of binary code similarity detection, which requires underlying code semantics. Our experimental results show that our binary similarity model with WIN outperforms two state-of-the-art binary similarity tools, DeepBinDiff and SAFE, with an average improvement of 49.8% and 15.8%, respectively.
KW - BERT
KW - Binary code
KW - binary code similarity detection
KW - code representation
KW - well-balanced instruction normalization
UR - https://www.scopus.com/pages/publications/85151570711
U2 - 10.1109/ACCESS.2023.3259481
DO - 10.1109/ACCESS.2023.3259481
M3 - Article
AN - SCOPUS:85151570711
SN - 2169-3536
VL - 11
SP - 29183
EP - 29198
JO - IEEE Access
JF - IEEE Access
ER -