Quantization Techniques Demystified: Boosting Efficiency in Large Language Models (LLMs)
In recent years, deep-learning models have evolved significantly, leading to the development of Large Language Models (LLMs) with billions of parameters. These models have transformed various fields, such as natural language processing. However, their large size presents challenges for deployment in real-world applications due to the substantial computational power and memory they require. Therefore, optimizing LLMs for practical deployment is crucial.
Understanding Tensors & Precision
Deep-learning models utilize Tensors, which are multi-dimensional arrays representing the network's inputs, outputs, and transformations. Tensors can process data in multiple dimensions, facilitating the handling of complex data by neural networks. These tensors are stored as 32-bit and 16-bit floats (FP32 and FP16, respectively), along with hardware-specific formats like Nvidia's TF32, Google's bfloat16, and AMD's FP24.
Lower precision is important as 64-bit floating-point precision isn't always necessary for effective performance. For training and inference, lower precision can significantly reduce memory and computational requirements.
What is Model Quantization, and Why is it Required?
Quantization is a technique that reduces models' memory and computational demands without significantly affecting their effectiveness. This process converts models from a higher precision format, like 32-bit floating point, to a lower precision format, such as 8-bit integers.
For LLMs, quantization is essential for reducing the memory footprint and computational demands, enabling deployment across various hardware platforms, especially those with limited computational resources. However, quantization involves a trade-off between efficiency and accuracy. While it reduces resource requirements, it may also slightly decrease accuracy due to the reduced numerical precision, which can affect the model's ability to capture subtle data differences. Therefore, optimizing quantization is crucial to minimize its impact on model performance.
Quantization Methods for LLMs
There are several quantization methods available for Large Language Models (LLMs), each with unique advantages and challenges. Below, we outline some popular options, including details on their performance, memory requirements, adaptability, and adoption. Additionally, many pre-quantized models using GPTQ and AWQ can be found on the Hugging Face Hub.
TL;DR
This table provides a quick overview of how each quantization method compares in terms of speed, video RAM (VRAM) usage, adaptability across different LLMs, and their adoption in the community or research.
1. GPTQ Overview
GPTQ is a cutting-edge weight quantization technique designed to decrease the computational and storage demands associated with large language models, optimizing them primarily for GPU inference and enhanced performance. This method has achieved significant speed improvements in inference, boasting a 3.25x acceleration with high-end GPUs like NVIDIA's A100 and up to a 4.5x increase with more affordable GPUs such as the NVIDIA A6000.
Quantizing models with 175 billion parameters takes only four GPU hours, reducing the bit width to 3 or 4 bits per weight. For smaller models, this process requires less than a minute.
GPTQ focuses on minimizing memory movement over computational reductions, leading to notable speed enhancements, especially in generative tasks. However, it does not apply to activation quantization.
For example - This method is illustrated using the model TheBloke/Mistral-7B-v0.1-GPTQ
on Hugging Face. Without compression, the GPU memory requirement stands at 27.99 GB, whereas GPTQ with vLLM needs 66.44 GB. vLLM, though it reserves all GPU memory for KV Cache, provides an option to adjust memory use through --gpu-memory-utilization
parameter, ranging from 0 to 1. It is also worth noting that vLLM achieves higher throughput compared to other libraries.
Here's how you can use GPTQ with vLLM for inference:
2. AWQ
AWQ (Activation-aware Weight Quantization) offers a hardware-friendly solution for quantizing the weights of Large Language Models (LLMs) to low-bit formats without quantizing activations. This method focuses on preserving a small fraction of critical weights by analyzing the distribution of activations rather than weights.
In comparison to GPTQ, AWQ achieves notable speed improvements and consistently outperforms round-to-nearest (RTN) and GPTQ in perplexity evaluations across various model sizes, ranging from 7B to 70B parameters. A key advantage of AWQ is its ability to maintain the generalization capabilities of LLMs across different domains and modalities, without overfitting to calibration datasets. This is because AWQ does not rely on back propagation or reconstruction methods.
For example - Using the AWQ-quantized model TheBloke/Mistral-7B-v0.1-AWQ
on vLLM requires 48.44 GB of GPU memory, which is less than the memory required for GPTQ with vLLM. This efficiency in memory usage underscores AWQ's effectiveness for deploying LLMs in hardware-constrained environments.
3. 4-bit NormalFloat (NF4)
The NormalFloat (NF) data type, based on the Quantile Quantization technique, introduces 4-bit NormalFloat (NF4) quantization to compress Large Language Models (LLMs) by reducing model weights and activations to 4-bit floating-point values.
NF4 quantization is a component of the QLoRA (Quantized Low-Rank Adaptation) framework, designed to decrease memory consumption and boost LLM efficiency with minimal impact on performance.
An innovative aspect of NF4 is its use of Double Quantization (DQ), which further quantizes the quantization constants for additional memory savings, reducing the memory footprint to just 0.373 bits per parameter. This makes NF4 highly efficient in terms of memory usage. NF4, compatible with DQ, integrates seamlessly with the bitsandbytes library, allowing for use alongside the Hugging Face transformers library to quantize LLMs effectively.
For example - Implementing a model with NF4 quantization through bitsandbytes significantly lowers memory requirements to just 4.35 GB, offering a stark contrast to the memory usage of GPTQ and AWQ with vLLM. However, this efficiency comes at the cost of reduced accuracy and lower throughput, which is an important consideration for applications requiring high performance.
4. HQQ
Half-Quadratic Quantization (HQQ) is a compression method for Large Language Models (LLMs) that eliminates the need for calibration data, significantly accelerating the quantization process while maintaining competitive compression quality. HQQ employs advanced optimization techniques such as Half-Quadratic splitting, reducing the quantization time to just a few minutes. For example, processing Llama-2-70B takes less than 5 minutes, making it over 50x faster than GPTQ.
HQQ is versatile, supporting both language and vision models, and effectively minimizes GPU memory usage without compromising on compression quality. For vision models, HQQ outperforms 4-bit bitsandbytes quantization and, in extreme low-bit settings, even surpasses the quality of full-precision models.
For example - The code snippet below demonstrates the use of a 4-bit quantized model, [
Mixtral-8x7B-Instruct-v0.1
,with the HQQ library. Despite requiring 24.70 GB of GPU memory, this approach achieves excellent accuracy, showcasing HQQ's efficiency and effectiveness in model compression.
Conclusion
In this blog post, we've explored the crucial role of quantization in enhancing the efficiency and accessibility of Large Language Models (LLMs) across various computing environments. Quantization stands as a pivotal development in making complex models more deployable, even on devices with limited computational resources.
As the field of quantization advances, it promises to bridge the gap between computational efficiency and model accuracy more effectively. This evolution paves the way for deploying sophisticated models in more diverse scenarios, expanding their usability and accessibility. With ongoing improvements in quantization techniques, we anticipate even greater advancements that will broaden the scope of LLM applications, making these powerful tools accessible to a wider audience.
For those interested in practical applications, including one-click deployment tutorials for quantized models like Llama2, Mistral, and CodeLlama, you can check out our tutorials. Additionally, for support in deploying these models on Inferless in a truly serverless manner, we invite you to request access.