Comprehensive Benchmarking of Top LLMs: Qwen2, Llama, Mistral, Gemma, Phi

Table of contents

Introduction

In our ongoing quest to help developers find the right libraries and LLMs for their use cases, we've turned our attention this month to benchmarking the latest five models:

Qwen2-7B-Instruct, is a powerful, open-source language model from Alibaba Cloud, developed to excel at understanding and generating human-like text through its 7 billion parameters and instruction-based tuning.
Gemma-2-9B-it, released by Google, offers impressive efficiency and performance for its size, attracting researchers and developers with its open-source accessibility and integration with popular AI platforms
Llama-3.1-8B-Instruct, developed by Meta, boasts enhanced multilingual capabilities and a massive 128k token context window, making it increasingly popular for diverse language tasks
Mistral-7B-Instruct-v0.3, created by Mistral AI, features an extended vocabulary and improved tokenizer, gaining traction in the AI community for its strong performance across various benchmarks
Phi-3-medium-128k-instruct, introduced by Microsoft, excels in reasoning tasks and offers competitive performance against much larger models, attracting attention for its efficiency and long context handling

Each model brings unique strengths, from Qwen2's rapid token generation to Llama's impressive efficiency under various token loads.

We tested them across six different inference engines (vLLM, TGI, TensorRT-LLM, Tritonvllm, Deepspeed-mii, ctranslate) on A100 GPUs hosted on Azure, ensuring a neutral playing field separate from our Inferless platform.

The goal? To help developers, researchers, and AI enthusiasts pinpoint the best LLMs for their needs, whether for development or production. If you missed our previous deep dive into 10B to 34B parameter models and which library to choose and why, you can catch up here.

‍

Tokens/sec Performance Ranking and Recommendations

1. Qwen2-7B-Instruct with TensorRT-LLM

This combination consistently ranks highest in tokens/sec performance, particularly with medium to high input and output tokens. It’s ideal for developers focused on maximizing throughput without sacrificing speed, making it the top choice for high-performance applications.

2. Llama-3.1-8B-Instruct with TensorRT-LLM

This model-library pair is a close contender, especially strong in scenarios with lower input tokens. It strikes an excellent balance between speed and efficiency, making it a go-to option for applications where both high throughput and low latency are critical.

3. Mistral-7B-Instruct-v0.3 with vLLM

Mistral-7B-Instruct-v0.3 shows consistent performance across various token configurations, making it a versatile choice. It’s particularly suitable for developers needing scalability and reliability in their AI workloads, handling different token loads efficiently.

4. Gemma-2-9B-it with TensorRT-LLM

This model exhibits strong tokens/sec performance, especially in high-throughput applications. While it’s not the fastest, it offers a robust performance across different token configurations, making it a solid choice for developers focused on reliability with decent speed.

5. Phi-3-medium-128k-instruct with TensorRT-LLM

Though slightly behind in raw speed, Phi-3-medium-128k-instruct offers steady performance with consistent tokens/sec across a range of token loads. It’s an excellent choice for tasks requiring stable performance and quick initialization, particularly when large context lengths are involved.

Recommendation: For developers prioritizing tokens/sec performance, Qwen2-7B-Instruct with TensorRT-LLM is the top pick, especially for heavy workloads. If you need slightly better performance with smaller token counts, Llama-3.1-8B-Instruct with TensorRT-LLM is your best bet. Mistral-7B-Instruct-v0.3 with vLLM is the most versatile, handling a variety of tasks efficiently across different token configurations.

Impact on TTFT

TTFT is a critical metric indicating the responsiveness of a model. A lower TTFT signifies a faster interaction capability, crucial for user-facing applications.

Key TTFT Insights Across Libraries and Input Sizes:

Library Performance: TGI, vLLM, and Triton vLLM backend generally offer lower TTFTs with smaller inputs but see increases as input size grows. CTranslate2, however, struggles with scalability, showing higher TTFTs as token counts rise.
Scalability: Most libraries saw performance drops with larger input tokens. Triton vLLM backend with Llama-3.1-8B-Instruct stood out by maintaining stable TTFT even with larger inputs.
Average TTFT: As input tokens increase, so does TTFT. Performance is solid with lower token counts but drops sharply beyond 500 tokens. The lowest TTFTs were around 20 input tokens, ranging from 0.0168 to 0.0277 seconds depending on the library

How we tested

Here’s how we ensured consistent, reliable benchmarks:

Platform: All tests ran on A100 GPUs from Azure, providing a level playing field.
Setup: Docker containers for each library ensured a consistent environment.
Configuration: Standard settings (temperature 0.5, top_p 1) kept the focus on performance, not external variables.
Prompts & Token Ranges: We used six distinct prompts with input lengths from 20 to 2,000 tokens and tested generation lengths of 100, 200, and 500 tokens to evaluate each library’s flexibility.
Models & Libraries Tested: We evaluated Phi-3-medium-128k-instruct, Meta-Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, Qwen2-7B-Instruct, and Gemma-2-9b-it using Text Generation Inference (TGI), vLLM, DeepSpeed Mii, CTranslate2, Triton with vLLM Backend, and TensorRT-LLM.

Detailed Benchmarks

We now present a detailed dive into the benchmarks. The terminology used in our data tables is explained as follows:

Model Name: The specific model tested.
Library: The inference engine used for benchmarking.
TTFT: Time taken to generate and deliver the first token after receiving input.
Token_count: The total number of tokens generated by the model.
Latency (second): Time taken to receive a response.
Tokens/second: The rate at which the model generates tokens per second.
Output_tokens: The anticipated maximum number of tokens in the response.
Input_tokens: The count of input tokens provided in the prompt

‍‍

1. Qwen2-7B-Instruct:

‍

2.Gemma-2-9B-it:

‍

3. Llama-3.1-8B-Instruct:

‍

4.Mistral-7B-Instruct-v0.3

‍

5.Phi-3-medium-128k-instruct

‍

Note: We appreciate and look forward to your thoughts or insights to help refine our benchmarks better. Our objective is to empower decisions with data, not to discount any service.
‍

Introduction

In our ongoing quest to help developers find the right libraries and LLMs for their use cases, we've turned our attention this month to benchmarking the latest five models:

Qwen2-7B-Instruct, is a powerful, open-source language model from Alibaba Cloud, developed to excel at understanding and generating human-like text through its 7 billion parameters and instruction-based tuning.
Gemma-2-9B-it, released by Google, offers impressive efficiency and performance for its size, attracting researchers and developers with its open-source accessibility and integration with popular AI platforms
Llama-3.1-8B-Instruct, developed by Meta, boasts enhanced multilingual capabilities and a massive 128k token context window, making it increasingly popular for diverse language tasks
Mistral-7B-Instruct-v0.3, created by Mistral AI, features an extended vocabulary and improved tokenizer, gaining traction in the AI community for its strong performance across various benchmarks
Phi-3-medium-128k-instruct, introduced by Microsoft, excels in reasoning tasks and offers competitive performance against much larger models, attracting attention for its efficiency and long context handling

Each model brings unique strengths, from Qwen2's rapid token generation to Llama's impressive efficiency under various token loads.

‍

Tokens/sec Performance Ranking and Recommendations

1. Qwen2-7B-Instruct with TensorRT-LLM

2. Llama-3.1-8B-Instruct with TensorRT-LLM

3. Mistral-7B-Instruct-v0.3 with vLLM

4. Gemma-2-9B-it with TensorRT-LLM

5. Phi-3-medium-128k-instruct with TensorRT-LLM

Impact on TTFT

TTFT is a critical metric indicating the responsiveness of a model. A lower TTFT signifies a faster interaction capability, crucial for user-facing applications.

Key TTFT Insights Across Libraries and Input Sizes:

Library Performance: TGI, vLLM, and Triton vLLM backend generally offer lower TTFTs with smaller inputs but see increases as input size grows. CTranslate2, however, struggles with scalability, showing higher TTFTs as token counts rise.
Scalability: Most libraries saw performance drops with larger input tokens. Triton vLLM backend with Llama-3.1-8B-Instruct stood out by maintaining stable TTFT even with larger inputs.
Average TTFT: As input tokens increase, so does TTFT. Performance is solid with lower token counts but drops sharply beyond 500 tokens. The lowest TTFTs were around 20 input tokens, ranging from 0.0168 to 0.0277 seconds depending on the library

How we tested

Here’s how we ensured consistent, reliable benchmarks:

Platform: All tests ran on A100 GPUs from Azure, providing a level playing field.
Setup: Docker containers for each library ensured a consistent environment.
Configuration: Standard settings (temperature 0.5, top_p 1) kept the focus on performance, not external variables.
Prompts & Token Ranges: We used six distinct prompts with input lengths from 20 to 2,000 tokens and tested generation lengths of 100, 200, and 500 tokens to evaluate each library’s flexibility.
Models & Libraries Tested: We evaluated Phi-3-medium-128k-instruct, Meta-Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, Qwen2-7B-Instruct, and Gemma-2-9b-it using Text Generation Inference (TGI), vLLM, DeepSpeed Mii, CTranslate2, Triton with vLLM Backend, and TensorRT-LLM.

Detailed Benchmarks

We now present a detailed dive into the benchmarks. The terminology used in our data tables is explained as follows:

Model Name: The specific model tested.
Library: The inference engine used for benchmarking.
TTFT: Time taken to generate and deliver the first token after receiving input.
Token_count: The total number of tokens generated by the model.
Latency (second): Time taken to receive a response.
Tokens/second: The rate at which the model generates tokens per second.
Output_tokens: The anticipated maximum number of tokens in the response.
Input_tokens: The count of input tokens provided in the prompt

‍‍

1. Qwen2-7B-Instruct:

‍

2.Gemma-2-9B-it:

‍

3. Llama-3.1-8B-Instruct:

‍

4.Mistral-7B-Instruct-v0.3

‍

5.Phi-3-medium-128k-instruct

‍

Note: We appreciate and look forward to your thoughts or insights to help refine our benchmarks better. Our objective is to empower decisions with data, not to discount any service.
‍

Table of contents

Text Link

Exploring LLMs Speed Benchmarks: Independent Analysis - Part 3

Introduction

Tokens/sec Performance Ranking and Recommendations

Impact on TTFT

How we tested

Detailed Benchmarks

Introduction

Tokens/sec Performance Ranking and Recommendations

Impact on TTFT

How we tested

Detailed Benchmarks

Join the serverless revolution today