TY - GEN
T1 - FIGNA
T2 - 30th IEEE International Symposium on High-Performance Computer Architecture, HPCA 2024
AU - Jang, Jaeyong
AU - Kim, Yulhwa
AU - Lee, Juheun
AU - Kim, Jae Joon
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - The weight-only quantization has emerged as a promising technique for alleviating the computational burden of large language models (LLMs) by employing low-precision integer (INT) weights, while retaining full-precision floating point (FP) activations to ensure inference quality. Despite the memory footprint reduction achieved through decreased bit-precision of weight parameters, the actual computing performance is often not improved significantly due to FP-INT multiply-Accumulation (MAC) operations being performed on the floating point unit (FPU) after de quantizing the INT weight values to FP values, owing to the lack of dedicated FP-INT arithmetic units. In this study, we investigate the impact of introducing a dedicated FP-INT unit on overall performance and find that such specialization does not yield substantial improvements. As an alternative approach, we propose FIGNA, an accelerator based on INT units designed specifically for FP-INT MAC operations. A key feature of FIGNA is its ability to achieve the same numerical accuracy as the FPU while relying solely on the integer-unit, a departure from prior methods that relied on integer-units with numerical approximations for FP arithmetic results, albeit claiming similar inference accuracy through dedicated network training. Through comprehensive experiments on FP-INT quantized networks for LLMs, including OPT and BLOOM, we demonstrate the superior performance of FIGNA compared to conventional FPUs in terms of performance per area (TOPS/mm^{2}) and energy efficiency (TOPS/W) across various input and weight precision combinations. For instance, in the FP16-INT4 case, FIGNA shows 6.34x higher TOPS/ mm^{2} and 2.19x higher TOPS/W compared to the baseline.
AB - The weight-only quantization has emerged as a promising technique for alleviating the computational burden of large language models (LLMs) by employing low-precision integer (INT) weights, while retaining full-precision floating point (FP) activations to ensure inference quality. Despite the memory footprint reduction achieved through decreased bit-precision of weight parameters, the actual computing performance is often not improved significantly due to FP-INT multiply-Accumulation (MAC) operations being performed on the floating point unit (FPU) after de quantizing the INT weight values to FP values, owing to the lack of dedicated FP-INT arithmetic units. In this study, we investigate the impact of introducing a dedicated FP-INT unit on overall performance and find that such specialization does not yield substantial improvements. As an alternative approach, we propose FIGNA, an accelerator based on INT units designed specifically for FP-INT MAC operations. A key feature of FIGNA is its ability to achieve the same numerical accuracy as the FPU while relying solely on the integer-unit, a departure from prior methods that relied on integer-units with numerical approximations for FP arithmetic results, albeit claiming similar inference accuracy through dedicated network training. Through comprehensive experiments on FP-INT quantized networks for LLMs, including OPT and BLOOM, we demonstrate the superior performance of FIGNA compared to conventional FPUs in terms of performance per area (TOPS/mm^{2}) and energy efficiency (TOPS/W) across various input and weight precision combinations. For instance, in the FP16-INT4 case, FIGNA shows 6.34x higher TOPS/ mm^{2} and 2.19x higher TOPS/W compared to the baseline.
KW - FP-INT GEMM
KW - NPU
KW - Numerical Accuracy
KW - Weight Only Quantization
UR - https://www.scopus.com/pages/publications/85190248502
U2 - 10.1109/HPCA57654.2024.00064
DO - 10.1109/HPCA57654.2024.00064
M3 - Conference contribution
AN - SCOPUS:85190248502
T3 - Proceedings - International Symposium on High-Performance Computer Architecture
SP - 760
EP - 773
BT - Proceedings - 2024 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2024
PB - IEEE Computer Society
Y2 - 2 March 2024 through 6 March 2024
ER -