FIGNA: Integer Unit-Based Accelerator Design for FP-INT GEMM Preserving Numerical Accuracy

  • Jaeyong Jang
  • , Yulhwa Kim
  • , Juheun Lee
  • , Jae Joon Kim

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

23 Scopus citations

Abstract

The weight-only quantization has emerged as a promising technique for alleviating the computational burden of large language models (LLMs) by employing low-precision integer (INT) weights, while retaining full-precision floating point (FP) activations to ensure inference quality. Despite the memory footprint reduction achieved through decreased bit-precision of weight parameters, the actual computing performance is often not improved significantly due to FP-INT multiply-Accumulation (MAC) operations being performed on the floating point unit (FPU) after de quantizing the INT weight values to FP values, owing to the lack of dedicated FP-INT arithmetic units. In this study, we investigate the impact of introducing a dedicated FP-INT unit on overall performance and find that such specialization does not yield substantial improvements. As an alternative approach, we propose FIGNA, an accelerator based on INT units designed specifically for FP-INT MAC operations. A key feature of FIGNA is its ability to achieve the same numerical accuracy as the FPU while relying solely on the integer-unit, a departure from prior methods that relied on integer-units with numerical approximations for FP arithmetic results, albeit claiming similar inference accuracy through dedicated network training. Through comprehensive experiments on FP-INT quantized networks for LLMs, including OPT and BLOOM, we demonstrate the superior performance of FIGNA compared to conventional FPUs in terms of performance per area (TOPS/mm^{2}) and energy efficiency (TOPS/W) across various input and weight precision combinations. For instance, in the FP16-INT4 case, FIGNA shows 6.34x higher TOPS/ mm^{2} and 2.19x higher TOPS/W compared to the baseline.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2024
PublisherIEEE Computer Society
Pages760-773
Number of pages14
ISBN (Electronic)9798350393132
DOIs
StatePublished - 2024
Event30th IEEE International Symposium on High-Performance Computer Architecture, HPCA 2024 - Edinburgh, United Kingdom
Duration: 2 Mar 20246 Mar 2024

Publication series

NameProceedings - International Symposium on High-Performance Computer Architecture
ISSN (Print)1530-0897

Conference

Conference30th IEEE International Symposium on High-Performance Computer Architecture, HPCA 2024
Country/TerritoryUnited Kingdom
CityEdinburgh
Period2/03/246/03/24

Keywords

  • FP-INT GEMM
  • NPU
  • Numerical Accuracy
  • Weight Only Quantization

Fingerprint

Dive into the research topics of 'FIGNA: Integer Unit-Based Accelerator Design for FP-INT GEMM Preserving Numerical Accuracy'. Together they form a unique fingerprint.

Cite this