Efficiently Deploying LoRA Adapters: Optimizing LLM Fine-Tuning for Multi-Task AI

Table of contents

Introduction

LoRA (Low-Rank Adaptation) adapters are a recent innovation in fine-tuning large language models (LLMs). Instead of updating all model parameters during fine-tuning, a process that is both computationally and memory intensive, LoRA injects additional, trainable low-rank matrices into selected layers of the pre-trained network. This approach enables rapid adaptation to new tasks while keeping most of the model (the “base” weights) frozen.

Key Advantages of Deploying Multiple LoRA Adapters:

Modularity and Scalability: A single base model can be augmented with numerous task-specific adapters. This enables a flexible architecture where different users or tasks can trigger the loading of specific adapters without the overhead of maintaining several full copies of the model.
Cost and Resource Efficiency: By sharing the base model across many adapters, organizations can significantly reduce both storage and computational costs. Only a small fraction of parameters (those in the adapters) need to be updated or stored separately.
Rapid Specialization: Whether for domain-specific language, personalized user interactions, or adapting to specialized data distributions, multiple LoRA adapters allow one system to switch quickly between various specialized tasks without re-training or redeploying the base model.

Understanding LoRA Adapters

What Are LoRA Adapters?

LoRA adapters are an instance of parameter-efficient fine-tuning (PEFT) methods. In a standard Transformer-based LLM, most parameters reside in large weight matrices (e.g., the query, key, and value projection matrices in self-attention layers). Instead of fine-tuning these full matrices, LoRA proposes to update them by adding a low-rank perturbation.

Formally, if a given weight matrix is denoted as W, then after applying a LoRA adapter the effective weight becomes:

W′=W+ΔW with ΔW=AB

How They Modify an LLM While Keeping the Base Model Intact

The base model's weights remain frozen during fine-tuning that means the extensive knowledge acquired during pre-training is preserved. Instead of modifying these weights directly, LoRA augments the network's behaviour by introducing a task-specific corrective term.

Instead of updating the full-weight matrix W, LoRA learns two small matrices, A and B, whose product approximates the necessary update. In a typical Transformer layer, the standard forward pass computes the output as:

h = XW

(with X representing the input). With LoRA, the computation becomes:

h = XW + X(AB)

Here, the term X(AB) serves as a low-rank adjustment that "nudges” the network's outputs toward the target task without altering the bulk of the pre-trained parameters.

This design not only enables efficient fine-tuning with fewer trainable parameters but also helps prevent catastrophic forgetting by maintaining the integrity of the original model's knowledge.

Benefits of Using LoRA Instead of Full Fine-Tuning

Parameter Efficiency: LoRA significantly reduces the number of parameters that need to be updated during fine-tuning.
Reduced Memory and Compute Requirements: Since only the adapter parameters (i.e., A and B) need to be stored in GPU memory for gradient computation, fine-tuning can be performed on hardware with limited memory. This efficiency also leads to faster training iterations.
Preservation of Pre-trained Knowledge: By freezing the base model's parameters, LoRA preserves the broad general knowledge acquired during large-scale pre-training, while the adapters specialize the model for downstream tasks.
Zero Inference Latency Overhead: With careful implementation for instance, by merging the learned low-rank update into W for inference LoRA typically introduces no extra latency compared to a fully fine-tuned model.

The seminal paper by Hu et al. (2021) demonstrated that LoRA could achieve comparable or even superior performance relative to full fine-tuning while using a fraction of the parameters.

Why Serve Multiple LoRA Adapters?

In many real-world applications, a single base model may need to cater to a wide variety of tasks or domains. Some common scenarios include:

Domain-Specific Specializations: An enterprise might deploy one general-purpose LLM that is augmented with multiple LoRA adapters, each fine-tuned on data from different domains (e.g., legal, medical, finance). This enables the same base model to generate outputs that are tailored to specific terminologies and regulatory requirements.
Personalized AI Applications: In customer-facing systems (such as chatbots or recommendation engines), personalization is key. Different users or user segments may require responses that reflect their unique preferences or language styles. Multiple LoRA adapters can be maintained to address these varied needs dynamically.
Multi-Task Inference: In systems where multiple tasks must be handled concurrentlysuch as language translation, summarization, and sentiment analysisa single deployment that serves different adapters can switch between these tasks on the fly without the overhead of reloading or deploying new models.

Challenges in Serving Multiple Adapters Dynamically

While the benefits of LoRA adapters make them ideal for specialization, deploying and dynamically serving multiple adapters brings its own set of challenges:

Dynamic Loading and Unloading: In many applications, the system must select and load the appropriate adapter for each incoming request. This process must be highly efficient to avoid latency spikes.
Memory Fragmentation and Management: Serving many adapters simultaneously can lead to GPU memory fragmentation. Since each adapter has its own set of parameters (even if they are small), careful memory management is required to fetch only the necessary adapters for a given batch of requests.
Scalability and Throughput: In scenarios with high request volume, the serving system must support batching across different adapters.

Approaches to Serving LoRA Adapters

There are two primary methods for deploying adapters to fine-tune large language models:

Merging with the Base Model: After fine-tuning your adapter, you can integrate it into the base model, resulting in a standalone model that can be served using various inference libraries, eliminating latency. However, it limits the model to the specific task for which it was fine-tuned, reducing flexibility. Here’s the code to merge the LoRA adapter using the PEFT library.

from transformers import AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
peft_model_id = "alignment-handbook/zephyr-7b-sft-lora"
model = PeftModel.from_pretrained(base_model, peft_model_id)
model.merge_and_unload()

Dynamically Loading Adapters: If you possess multiple LoRA adapters tailored for different tasks but based on the same base model, you can dynamically load the appropriate adapter during inference. This method allows a single base model to serve multiple tasks by loading the relevant adapter as needed, enhancing flexibility and resource efficiency. Here’s the code to merge the LoRA adapter using the vLLM library.

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
from huggingface_hub import snapshot_download

lora_path = snapshot_download(repo_id="alignment-handbook/zephyr-7b-sft-lora")
llm = LLM(model="mistralai/Mistral-7B-v0.1", enable_lora=True)
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=256
)

prompts = "Write a poem."

outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=LoRARequest("adapter", 1, lora_path)
)

‍

When choosing between these methods, consider the trade-offs between simplicity and flexibility. Merging adapters into the base model simplifies deployment but restricts the model to a single task. Dynamic loading offers greater versatility, enabling the base model to handle various tasks by applying different adapters as required.

LoRA Adapters Latency Overview

When working with LoRA adapters, inference latency is a critical performance metric. Two primary approaches to use these adapters are Dynamic LoRA Adapters and Merged LoRA Adapters.

Performance Comparison

Using vLLM:

In vLLM, merging the adapter directly with the base model significantly reduces latency.

Using Huggingface Transformers:

Similarly, with HuggingFace Transformers, the merged adapter approach nearly halves the latency compared to the dynamic method, underscoring the benefits of pre-integrating adapter parameters.

For production systems that demand fast response times, using merged LoRA adapters greatly enhances performance by reducing computational overhead and streamlining the inference process, making them the preferred option for latency-sensitive applications.

Libraries for Serving Multi-LoRA Adapters

We have explored several open-source approaches to serving multi-LoRA adapters and discuss how libraries like Hugging Face Transformers, vLLM, LoRAX, and TGI, allow users to build scalable, flexible inference systems using LoRA adapters.

Using Hugging Face Transformers

One of the most widely used libraries for working with state-of-the-art LLMs. Its flexibility makes it an ideal platform for integrating LoRA adapters. Thanks to extensions such as the PEFT library (Parameter-Efficient Fine-Tuning), users can easily load a base model and attach one or more adapters to the model without modifying the core weights.

Users can also dynamically switch adapters based on the task and context at inference time.

Here’s the sample code:

from transformers import AutoModelForCausalLM,AutoTokenizer
from peft import PeftModel

class InferlessPythonModel:
    def initialize(self):
        self.tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
        base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1").to("cuda")
        self.model = PeftModel.from_pretrained(base_model,"CATIE-AQ/mistral7B-FR-InstructNLP-LoRA", adapter_name="french")
        self.model.load_adapter("Liu-Xiang/mistral7bit-lora-sql", adapter_name="sql")
        self.model.load_adapter("alignment-handbook/zephyr-7b-dpo-lora", adapter_name="dpo")
        self.model.load_adapter("uukuguy/Mistral-7B-OpenOrca-lora", adapter_name="orca")

    def infer(self, inputs):
        prompt = inputs["prompt"]
        adapter_name = inputs.pop("adapter_name")
        temperature = inputs.get("temperature",0.7)
        repetition_penalty = float(inputs.get("repetition_penalty",1.18))
        max_new_tokens = inputs.get("max_new_tokens",128)
        
        if (self.model.active_adapter) != adapter_name:
            self.model.set_adapter(adapter_name)

        model_input = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        result = self.tokenizer.decode(self.model.generate(**model_input,temperature=temperature, max_new_tokens=max_new_tokens, repetition_penalty=repetition_penalty)[0], skip_special_tokens=True)

        return {'generated_result': result}

Using vLLM

vLLM is an open-source project built to address the performance bottlenecks with Dynamic batching, Memory optimization and Low latency.

By integrating vLLM, users can serve multi-LoRA adapter enhanced models in production settings where both speed and cost-efficiency are paramount.

Here’s an the sample code:

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
from transformers import AutoTokenizer  # vLLM uses Hugging Face tokenizers

class InferlessPythonModel:
    def initialize(self):
        # Initialize the tokenizer (using Hugging Face AutoTokenizer)
        self.tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
        
        # Instantiate the base model with LoRA enabled
        # Ensure that vLLM is built with LoRA support (enable_lora flag)
        self.llm = LLM(model="mistralai/Mistral-7B-v0.1", enable_lora=True)
        
        # Store adapter information in a dictionary.
        # Each adapter is identified by a unique name and its corresponding model repository.
        self.adapters = {
            "french": "CATIE-AQ/mistral7B-FR-InstructNLP-LoRA",
            "sql": "Liu-Xiang/mistral7bit-lora-sql",
            "orca": "uukuguy/Mistral-7B-OpenOrca-lora"
        }
        # Track the currently active adapter
        self.active_adapter = None

    def infer(self, inputs):
        prompt = inputs["prompt"]
        adapter_name = inputs.pop("adapter_name")
        temperature = float(inputs.get("temperature", 0.7))
        repetition_penalty = float(inputs.get("repetition_penalty", 1.18))
        max_new_tokens = int(inputs.get("max_new_tokens", 128))
        
        # If the requested adapter is different from the active one,
        # prepare a new LoRARequest using the adapter's path.
        if self.active_adapter != adapter_name:
            # The second argument (adapter id) can be an arbitrary identifier (here set to 1)
            lora_req = LoRARequest(adapter_name, 1, self.adapters[adapter_name])
            self.active_adapter = adapter_name
        else:
            lora_req = None

        # Create sampling parameters for generation.
        sampling_params = SamplingParams(
            temperature=temperature,
            max_tokens=max_new_tokens,
            repetition_penalty=repetition_penalty
        )

        # Generate output using vLLM.
        # Note: vLLM accepts a list of prompt strings.
        outputs = self.llm.generate(prompt, sampling_params, lora_request=lora_req)
        
        result_output = [output.outputs[0].text for output in outputs][0]
        return {'generated_result': result_output}

    def finalize(self):
        self.llm = None

Using LoRAX

LoRAX allows users to load any LoRA adapter dynamically at runtime per request and batch many different LoRAs together at once for high throughput.

Here are the steps of using LoRAX, and how you can leverages LoRAX's capability to dynamically load and manage multiple LoRA adapters:

1. Setting Up the LoRAX Server:

First, ensure you have Docker installed, as LoRAX provides a pre-built Docker image for easy deployment.

docker pull ghcr.io/predibase/lorax:latest

Run the LoRAX server with your chosen base model (e.g., Mistral-7B):

export HUGGING_FACE_HUB_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN>

docker run --gpus all --shm-size 1g -p 8080:80 -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN ghcr.io/predibase/lorax --model-id mistralai/Mistral-7B-Instruct-v0.1

This command starts the LoRAX server on port 8000 with the specified base model.

2. Interacting with the LoRAX Server:

With the server running, you can send HTTP requests to perform inference with different LoRA adapters. Below is a Python example using the requests library:

import requests
import json

# LoRAX API endpoint
LORAX_API_URL = "http://localhost:8080/generate"

# Define generation parameters
generation_params = {
    "max_new_tokens": 128
}

# Define your prompt and adapter (if needed)
prompt = "Explique-moi en français comment fonctionne l'apprentissage automatique."
adapter = "CATIE-AQ/mistral7B-FR-InstructNLP-LoRA"  # Set to None if you want to use the base model

# Build the payload for the request
payload = {
    "inputs": prompt,
    "parameters": generation_params
}

# Include the adapter_id if an adapter is specified
if adapter:
    payload["parameters"]["adapter_id"] = adapter

headers = {"Content-Type": "application/json"}

# Send the inference request to the LoRAX API
response = requests.post(LORAX_API_URL, headers=headers, data=json.dumps(payload))

# Process and display the response
if response.status_code == 200:
    output = response.json()
    generated_text = output.get("generated_text", "")
    print("Generated Text:")
    print(generated_text)
else:
    print(f"Error: {response.status_code} - {response.text}")

‍

Using TGI

TGI is designed to efficiently handle large volume of text generation request. Integrating LoRA adapters with TGI offers a resource efficient approach to fine-tuning and deploying these models.

Here’s how you can use TGI for serving multiple LoRA adapters efficiently, follow the steps below:

1. Setting Up the TGI Server:

First, ensure you have Docker installed, as TGI provides a Docker image for easy deployment.

docker pull ghcr.io/huggingface/text-generation-inference:latest

Run the TGI server with your chosen base model (e.g., mistralai/Mistral-7B-v0.1) and specify the LoRA adapters you intend to use:

docker run --env HUGGING_FACE_HUB_TOKEN= <YOUR_HUGGING_FACE_HUB_TOKEN>
				--gpus all --shm-size 1g -p 8080:80 \
				ghcr.io/huggingface/text-generation-inference \
				--model-id mistralai/Mistral-7B-v0.1 \
				--lora-adapters=CATIE-AQ/mistral7B-FR-InstructNLP-LoRA,Liu-Xiang/mistral7bit-lora-sql

This command starts the TGI server on port 8080 with the specified base model and loads the listed LoRA adapters. The --lora-adapters flag accepts a comma-separated list of adapter identifiers.

2. Interacting with the TGI Server:

With the server running, you can send HTTP requests to perform inference using different LoRA adapters. Below is a Python example utilizing the requests library:

import requests
import json

# LoRAX API endpoint
LORAX_API_URL = "http://localhost:8080/generate"

# Define generation parameters
generation_params = {
    "max_new_tokens": 128
}

# Define your prompt and adapter (if needed)
prompt = "Explique-moi en français comment fonctionne l'apprentissage automatique."
adapter = "CATIE-AQ/mistral7B-FR-InstructNLP-LoRA"  # Set to None if you want to use the base model

# Build the payload for the request
payload = {
    "inputs": prompt,
    "parameters": generation_params
}

# Include the adapter_id if an adapter is specified
if adapter:
    payload["parameters"]["adapter_id"] = adapter

headers = {"Content-Type": "application/json"}

# Send the inference request to the LoRAX API
response = requests.post(LORAX_API_URL, headers=headers, data=json.dumps(payload))

# Process and display the response
if response.status_code == 200:
    output = response.json()
    generated_text = output.get("generated_text", "")
    print("Generated Text:")
    print(generated_text)
else:
    print(f"Error: {response.status_code} - {response.text}")

Performance and Feature Comparison:

To provide a comprehensive view of the capabilities of popular LoRA serving libraries vLLM, TGI, and LoRAX, we conducted an in-depth evaluation focusing on both quantitative performance metrics and features.

Performance Metrics

We assessed each library based on two key performance indicators: throughput (measured in tokens per second or TPS) and latency (measured in seconds). These metrics were evaluated under three different adapter configurations: French-LoRA, SQL-LoRA, and DPO-LoRA.

From these graph, we have observed that the vLLM and TGI provides good performance with high TPS and low latency whereas LoRAX exhibits varied performance.

Feature Comparison

In addition to raw performance, the libraries differ in design and deployment aspects. The table below summarizes their core features:

To ensure our comparisons across LoRA serving libraries (vLLM, TGI, and LoRAX) were both consistent and reliable, we have followed this process:

Platform: All tests were executed on NVIDIA A100 GPUs on Azure, ensuring a uniform, high-performance hardware baseline.
Setup: We have deployed each library, TGI, vLLM, and LoRAX, using their latest Docker containers that expose standardized API endpoints.
Standardized Settings: We maintained consistent generation parameters, temperature set to 0.5, top_p at 1 and max_tokens at 128.
Adapter Configurations: Three distinct adapter setups (French-LoRA, SQL-LoRA, and DPO-LoRA) were tested.

Following these procedures, our evaluation delivers a robust and reliable performance benchmark.

Conclusion

LoRA adapters represent a transformative approach to fine-tuning LLMs, enabling rapid and efficient adaptation for diverse tasks without the heavy computational burden of full-model fine-tuning. By isolating task-specific adjustments into lightweight, low-rank matrices while keeping the pre-trained base model intact, LoRA not only preserves the model’s general knowledge but also dramatically reduces resource requirements.

Whether through dynamic loading or adapter merging, modern serving frameworks are making it increasingly practical to deploy multiple adapters within a single system which delivers scalability, cost efficiency, and low inference latency.

As research and engineering innovations continue to refine memory management and parallel processing strategies, LoRA-based systems are set to become a cornerstone for personalized and multi-task AI applications in real-world environments.

Resources:

‍

Introduction

Key Advantages of Deploying Multiple LoRA Adapters:

Modularity and Scalability: A single base model can be augmented with numerous task-specific adapters. This enables a flexible architecture where different users or tasks can trigger the loading of specific adapters without the overhead of maintaining several full copies of the model.
Cost and Resource Efficiency: By sharing the base model across many adapters, organizations can significantly reduce both storage and computational costs. Only a small fraction of parameters (those in the adapters) need to be updated or stored separately.
Rapid Specialization: Whether for domain-specific language, personalized user interactions, or adapting to specialized data distributions, multiple LoRA adapters allow one system to switch quickly between various specialized tasks without re-training or redeploying the base model.

Understanding LoRA Adapters

What Are LoRA Adapters?

Formally, if a given weight matrix is denoted as W, then after applying a LoRA adapter the effective weight becomes:

W′=W+ΔW with ΔW=AB

How They Modify an LLM While Keeping the Base Model Intact

h = XW

(with X representing the input). With LoRA, the computation becomes:

h = XW + X(AB)

Here, the term X(AB) serves as a low-rank adjustment that "nudges” the network's outputs toward the target task without altering the bulk of the pre-trained parameters.

This design not only enables efficient fine-tuning with fewer trainable parameters but also helps prevent catastrophic forgetting by maintaining the integrity of the original model's knowledge.

Benefits of Using LoRA Instead of Full Fine-Tuning

Parameter Efficiency: LoRA significantly reduces the number of parameters that need to be updated during fine-tuning.
Reduced Memory and Compute Requirements: Since only the adapter parameters (i.e., A and B) need to be stored in GPU memory for gradient computation, fine-tuning can be performed on hardware with limited memory. This efficiency also leads to faster training iterations.
Preservation of Pre-trained Knowledge: By freezing the base model's parameters, LoRA preserves the broad general knowledge acquired during large-scale pre-training, while the adapters specialize the model for downstream tasks.
Zero Inference Latency Overhead: With careful implementation for instance, by merging the learned low-rank update into W for inference LoRA typically introduces no extra latency compared to a fully fine-tuned model.

The seminal paper by Hu et al. (2021) demonstrated that LoRA could achieve comparable or even superior performance relative to full fine-tuning while using a fraction of the parameters.

Why Serve Multiple LoRA Adapters?

In many real-world applications, a single base model may need to cater to a wide variety of tasks or domains. Some common scenarios include:

Domain-Specific Specializations: An enterprise might deploy one general-purpose LLM that is augmented with multiple LoRA adapters, each fine-tuned on data from different domains (e.g., legal, medical, finance). This enables the same base model to generate outputs that are tailored to specific terminologies and regulatory requirements.
Personalized AI Applications: In customer-facing systems (such as chatbots or recommendation engines), personalization is key. Different users or user segments may require responses that reflect their unique preferences or language styles. Multiple LoRA adapters can be maintained to address these varied needs dynamically.
Multi-Task Inference: In systems where multiple tasks must be handled concurrentlysuch as language translation, summarization, and sentiment analysisa single deployment that serves different adapters can switch between these tasks on the fly without the overhead of reloading or deploying new models.

Challenges in Serving Multiple Adapters Dynamically

While the benefits of LoRA adapters make them ideal for specialization, deploying and dynamically serving multiple adapters brings its own set of challenges:

Dynamic Loading and Unloading: In many applications, the system must select and load the appropriate adapter for each incoming request. This process must be highly efficient to avoid latency spikes.
Memory Fragmentation and Management: Serving many adapters simultaneously can lead to GPU memory fragmentation. Since each adapter has its own set of parameters (even if they are small), careful memory management is required to fetch only the necessary adapters for a given batch of requests.
Scalability and Throughput: In scenarios with high request volume, the serving system must support batching across different adapters.

Approaches to Serving LoRA Adapters

There are two primary methods for deploying adapters to fine-tune large language models:

Merging with the Base Model: After fine-tuning your adapter, you can integrate it into the base model, resulting in a standalone model that can be served using various inference libraries, eliminating latency. However, it limits the model to the specific task for which it was fine-tuned, reducing flexibility. Here’s the code to merge the LoRA adapter using the PEFT library.

from transformers import AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
peft_model_id = "alignment-handbook/zephyr-7b-sft-lora"
model = PeftModel.from_pretrained(base_model, peft_model_id)
model.merge_and_unload()

Dynamically Loading Adapters: If you possess multiple LoRA adapters tailored for different tasks but based on the same base model, you can dynamically load the appropriate adapter during inference. This method allows a single base model to serve multiple tasks by loading the relevant adapter as needed, enhancing flexibility and resource efficiency. Here’s the code to merge the LoRA adapter using the vLLM library.

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
from huggingface_hub import snapshot_download

lora_path = snapshot_download(repo_id="alignment-handbook/zephyr-7b-sft-lora")
llm = LLM(model="mistralai/Mistral-7B-v0.1", enable_lora=True)
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=256
)

prompts = "Write a poem."

outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=LoRARequest("adapter", 1, lora_path)
)

‍

LoRA Adapters Latency Overview

When working with LoRA adapters, inference latency is a critical performance metric. Two primary approaches to use these adapters are Dynamic LoRA Adapters and Merged LoRA Adapters.

Performance Comparison

Using vLLM:

In vLLM, merging the adapter directly with the base model significantly reduces latency.

Using Huggingface Transformers:

Similarly, with HuggingFace Transformers, the merged adapter approach nearly halves the latency compared to the dynamic method, underscoring the benefits of pre-integrating adapter parameters.

Libraries for Serving Multi-LoRA Adapters

Using Hugging Face Transformers

Users can also dynamically switch adapters based on the task and context at inference time.

Here’s the sample code:

from transformers import AutoModelForCausalLM,AutoTokenizer
from peft import PeftModel

class InferlessPythonModel:
    def initialize(self):
        self.tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
        base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1").to("cuda")
        self.model = PeftModel.from_pretrained(base_model,"CATIE-AQ/mistral7B-FR-InstructNLP-LoRA", adapter_name="french")
        self.model.load_adapter("Liu-Xiang/mistral7bit-lora-sql", adapter_name="sql")
        self.model.load_adapter("alignment-handbook/zephyr-7b-dpo-lora", adapter_name="dpo")
        self.model.load_adapter("uukuguy/Mistral-7B-OpenOrca-lora", adapter_name="orca")

    def infer(self, inputs):
        prompt = inputs["prompt"]
        adapter_name = inputs.pop("adapter_name")
        temperature = inputs.get("temperature",0.7)
        repetition_penalty = float(inputs.get("repetition_penalty",1.18))
        max_new_tokens = inputs.get("max_new_tokens",128)
        
        if (self.model.active_adapter) != adapter_name:
            self.model.set_adapter(adapter_name)

        model_input = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        result = self.tokenizer.decode(self.model.generate(**model_input,temperature=temperature, max_new_tokens=max_new_tokens, repetition_penalty=repetition_penalty)[0], skip_special_tokens=True)

        return {'generated_result': result}

Using vLLM

vLLM is an open-source project built to address the performance bottlenecks with Dynamic batching, Memory optimization and Low latency.

By integrating vLLM, users can serve multi-LoRA adapter enhanced models in production settings where both speed and cost-efficiency are paramount.

Here’s an the sample code:

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
from transformers import AutoTokenizer  # vLLM uses Hugging Face tokenizers

class InferlessPythonModel:
    def initialize(self):
        # Initialize the tokenizer (using Hugging Face AutoTokenizer)
        self.tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
        
        # Instantiate the base model with LoRA enabled
        # Ensure that vLLM is built with LoRA support (enable_lora flag)
        self.llm = LLM(model="mistralai/Mistral-7B-v0.1", enable_lora=True)
        
        # Store adapter information in a dictionary.
        # Each adapter is identified by a unique name and its corresponding model repository.
        self.adapters = {
            "french": "CATIE-AQ/mistral7B-FR-InstructNLP-LoRA",
            "sql": "Liu-Xiang/mistral7bit-lora-sql",
            "orca": "uukuguy/Mistral-7B-OpenOrca-lora"
        }
        # Track the currently active adapter
        self.active_adapter = None

    def infer(self, inputs):
        prompt = inputs["prompt"]
        adapter_name = inputs.pop("adapter_name")
        temperature = float(inputs.get("temperature", 0.7))
        repetition_penalty = float(inputs.get("repetition_penalty", 1.18))
        max_new_tokens = int(inputs.get("max_new_tokens", 128))
        
        # If the requested adapter is different from the active one,
        # prepare a new LoRARequest using the adapter's path.
        if self.active_adapter != adapter_name:
            # The second argument (adapter id) can be an arbitrary identifier (here set to 1)
            lora_req = LoRARequest(adapter_name, 1, self.adapters[adapter_name])
            self.active_adapter = adapter_name
        else:
            lora_req = None

        # Create sampling parameters for generation.
        sampling_params = SamplingParams(
            temperature=temperature,
            max_tokens=max_new_tokens,
            repetition_penalty=repetition_penalty
        )

        # Generate output using vLLM.
        # Note: vLLM accepts a list of prompt strings.
        outputs = self.llm.generate(prompt, sampling_params, lora_request=lora_req)
        
        result_output = [output.outputs[0].text for output in outputs][0]
        return {'generated_result': result_output}

    def finalize(self):
        self.llm = None

Using LoRAX

LoRAX allows users to load any LoRA adapter dynamically at runtime per request and batch many different LoRAs together at once for high throughput.

Here are the steps of using LoRAX, and how you can leverages LoRAX's capability to dynamically load and manage multiple LoRA adapters:

1. Setting Up the LoRAX Server:

First, ensure you have Docker installed, as LoRAX provides a pre-built Docker image for easy deployment.

docker pull ghcr.io/predibase/lorax:latest

Run the LoRAX server with your chosen base model (e.g., Mistral-7B):

export HUGGING_FACE_HUB_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN>

docker run --gpus all --shm-size 1g -p 8080:80 -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN ghcr.io/predibase/lorax --model-id mistralai/Mistral-7B-Instruct-v0.1

This command starts the LoRAX server on port 8000 with the specified base model.

2. Interacting with the LoRAX Server:

With the server running, you can send HTTP requests to perform inference with different LoRA adapters. Below is a Python example using the requests library:

import requests
import json

# LoRAX API endpoint
LORAX_API_URL = "http://localhost:8080/generate"

# Define generation parameters
generation_params = {
    "max_new_tokens": 128
}

# Define your prompt and adapter (if needed)
prompt = "Explique-moi en français comment fonctionne l'apprentissage automatique."
adapter = "CATIE-AQ/mistral7B-FR-InstructNLP-LoRA"  # Set to None if you want to use the base model

# Build the payload for the request
payload = {
    "inputs": prompt,
    "parameters": generation_params
}

# Include the adapter_id if an adapter is specified
if adapter:
    payload["parameters"]["adapter_id"] = adapter

headers = {"Content-Type": "application/json"}

# Send the inference request to the LoRAX API
response = requests.post(LORAX_API_URL, headers=headers, data=json.dumps(payload))

# Process and display the response
if response.status_code == 200:
    output = response.json()
    generated_text = output.get("generated_text", "")
    print("Generated Text:")
    print(generated_text)
else:
    print(f"Error: {response.status_code} - {response.text}")

‍

Using TGI

TGI is designed to efficiently handle large volume of text generation request. Integrating LoRA adapters with TGI offers a resource efficient approach to fine-tuning and deploying these models.

Here’s how you can use TGI for serving multiple LoRA adapters efficiently, follow the steps below:

1. Setting Up the TGI Server:

First, ensure you have Docker installed, as TGI provides a Docker image for easy deployment.

docker pull ghcr.io/huggingface/text-generation-inference:latest

Run the TGI server with your chosen base model (e.g., mistralai/Mistral-7B-v0.1) and specify the LoRA adapters you intend to use:

docker run --env HUGGING_FACE_HUB_TOKEN= <YOUR_HUGGING_FACE_HUB_TOKEN>
				--gpus all --shm-size 1g -p 8080:80 \
				ghcr.io/huggingface/text-generation-inference \
				--model-id mistralai/Mistral-7B-v0.1 \
				--lora-adapters=CATIE-AQ/mistral7B-FR-InstructNLP-LoRA,Liu-Xiang/mistral7bit-lora-sql

This command starts the TGI server on port 8080 with the specified base model and loads the listed LoRA adapters. The --lora-adapters flag accepts a comma-separated list of adapter identifiers.

2. Interacting with the TGI Server:

With the server running, you can send HTTP requests to perform inference using different LoRA adapters. Below is a Python example utilizing the requests library:

import requests
import json

# LoRAX API endpoint
LORAX_API_URL = "http://localhost:8080/generate"

# Define generation parameters
generation_params = {
    "max_new_tokens": 128
}

# Define your prompt and adapter (if needed)
prompt = "Explique-moi en français comment fonctionne l'apprentissage automatique."
adapter = "CATIE-AQ/mistral7B-FR-InstructNLP-LoRA"  # Set to None if you want to use the base model

# Build the payload for the request
payload = {
    "inputs": prompt,
    "parameters": generation_params
}

# Include the adapter_id if an adapter is specified
if adapter:
    payload["parameters"]["adapter_id"] = adapter

headers = {"Content-Type": "application/json"}

# Send the inference request to the LoRAX API
response = requests.post(LORAX_API_URL, headers=headers, data=json.dumps(payload))

# Process and display the response
if response.status_code == 200:
    output = response.json()
    generated_text = output.get("generated_text", "")
    print("Generated Text:")
    print(generated_text)
else:
    print(f"Error: {response.status_code} - {response.text}")

Performance and Feature Comparison:

Performance Metrics

From these graph, we have observed that the vLLM and TGI provides good performance with high TPS and low latency whereas LoRAX exhibits varied performance.

Feature Comparison

In addition to raw performance, the libraries differ in design and deployment aspects. The table below summarizes their core features:

To ensure our comparisons across LoRA serving libraries (vLLM, TGI, and LoRAX) were both consistent and reliable, we have followed this process:

Platform: All tests were executed on NVIDIA A100 GPUs on Azure, ensuring a uniform, high-performance hardware baseline.
Setup: We have deployed each library, TGI, vLLM, and LoRAX, using their latest Docker containers that expose standardized API endpoints.
Standardized Settings: We maintained consistent generation parameters, temperature set to 0.5, top_p at 1 and max_tokens at 128.
Adapter Configurations: Three distinct adapter setups (French-LoRA, SQL-LoRA, and DPO-LoRA) were tested.

Following these procedures, our evaluation delivers a robust and reliable performance benchmark.

Conclusion

Resources:

‍

Table of contents

Text Link

How to serve Multi-LoRA Adapters

Introduction

Key Advantages of Deploying Multiple LoRA Adapters:

Understanding LoRA Adapters

What Are LoRA Adapters?

How They Modify an LLM While Keeping the Base Model Intact

Benefits of Using LoRA Instead of Full Fine-Tuning

Why Serve Multiple LoRA Adapters?

Challenges in Serving Multiple Adapters Dynamically

Approaches to Serving LoRA Adapters

LoRA Adapters Latency Overview

Performance Comparison

Libraries for Serving Multi-LoRA Adapters

Using Hugging Face Transformers

Using vLLM

Using LoRAX

Using TGI

Performance and Feature Comparison:

Performance Metrics

Feature Comparison

Conclusion

Introduction

Key Advantages of Deploying Multiple LoRA Adapters:

Understanding LoRA Adapters

What Are LoRA Adapters?

How They Modify an LLM While Keeping the Base Model Intact

Benefits of Using LoRA Instead of Full Fine-Tuning

Why Serve Multiple LoRA Adapters?

Challenges in Serving Multiple Adapters Dynamically

Approaches to Serving LoRA Adapters

LoRA Adapters Latency Overview

Performance Comparison

Libraries for Serving Multi-LoRA Adapters

Using Hugging Face Transformers

Using vLLM

Using LoRAX

Using TGI

Performance and Feature Comparison:

Performance Metrics

Feature Comparison

Conclusion