Resources/Learn/the-ultimate-guide-to-qwen-model

The Ultimate Guide to Qwen Model

January 9, 2025
5
mins read
Aishwarya Goel
CoFounder & CEO
Rajdeep Borgohain
DevRel Engineer
Table of contents
Subscribe to our blog
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Introduction

The Qwen model series, developed by Alibaba Group's Qwen Team, represents a significant advancement in the field of generative models.Since its initial beta release in April 2023, Qwen has evolved into a comprehensive suite of models, each tailored to address specific challenges in natural language processing and understanding.

Qwen, also known as Tongyi Qianwen, series have undergone several iterations, with notable releases including Qwen2.5 models, QwQ-32B-Preview and QvQ-72B-Preview model. These models have been pre-trained on extensive multilingual and multimodal datasets, ensuring their applicability across various languages and modalities. They have been further refined with high-quality data to align with human preferences, enhancing their effectiveness in real-world applications.

Types of Qwen Models

The Qwen family encompasses a diverse array of models, each designed to cater to specific domains and tasks:

  • Qwen: The foundational large language model, proficient in natural language understanding and text generation
    • Qwen Chat & Instruct: Chat & Instruct models are trained with techniques like Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO), optimized for conversational interactions.
  • Qwen-VL: A large vision-language model capable of processing and generating content that combines visual and textual information.
  • Qwen-Audio: Designed for audio-related tasks, excelling in processing and understanding various audio inputs—including human speech, natural sounds, music, and songs—and generating text as output.
  • Qwen-Coder: A specialized model aimed at coding assistance, offering support in programming and software development tasks.
  • Qwen-Math: Focused on mathematical problem-solving, this model aids in understanding and generating mathematical content.
  • Qwen-QwQ Model: The Qwen Team introduced QwQ-32B-Preview, an experimental research model designed to advance AI reasoning capabilities.
  • Qwen-QVQ Model: QvQ-72B-Preview is an experimental research model focusing on enhancing visual reasoning capabilities.

Each model variant is built upon the base Qwen architecture, with specialized training to enhance performance in their respective domains, reflecting the principles of generative models in AI. This modular approach allows for targeted applications while maintaining the robust capabilities of the core model.

Qwen ‘s Key Milestones


The Qwen series began with the Qwen-7B model built with a transformer-based architecture, trained on up to 3 trillion tokens of diverse content, including text and code. Key innovations like rotary positional embeddings and flash attention helped improve training efficiency and performance, as discussed in an in-depth explanation of generative AI.

Over time, the Qwen family expanded with both larger and more specialized models:

  • Qwen-14B and Qwen-72B: Larger versions with more parameters, offering deeper language understanding and more accurate text generation.
  • Qwen-VL: A vision-language model that understood and generated responses from both text and visual inputs.
  • Qwen-1.8: A small-sized model with about 1.8 billion parameters, aiming for a balance between performance and efficiency.
  • Qwen Audio: A model focused on audio processing, allowing it to handle speech.

Then, the Qwen-1.5 series introduced improved long-context handling, making conversations more natural and engaging. Ranging from 0.5 billion to 110 billion parameters, these models could maintain context over longer interactions, improving how AI systems communicate with users.

In September 2024, Qwen released their Qwen-2 series, featuring improved data quality and training methods. Trained on 7 trillion tokens, these models underwent supervised fine-tuning and Direct Preference Optimization (DPO) for better human alignment.

The Qwen-2 series included:

  • Qwen2-0.5B to Qwen2-72B: Models of various sizes for different computing needs.
  • Qwen2-57B-A14B (MoE): A mixture-of-experts model for efficient parameter usage.
  • Specialized Models: Including Qwen2-VL, Qwen2-Audio, and Qwen2-Math.


The subsequent Qwen2.5 series expanded the training dataset to 18 trillion tokens and introduced cost-effective models like Qwen2.5-14B and Qwen2.5-32B. A mobile-friendly Qwen2.5-3B was also released. They have also released Qwen2.5-Math and Qwen2.5-Coder.

Qwen2.5 showed improved performance in coding, math, and instruction following. Two notable additions include:

  • Qwen2.5-Turbo: A closed-source model handling up to 1 million tokens
  • QwQ-32B-Preview: An experimental language model focused on complex reasoning tasks
  • QVQ-72B-Preview: An experimental visual model designed to enhance visual reasoning capabilities.

These advances expand AI capabilities across information processing, reasoning, and practical applications.

Addressing Initial Challenges

1. Context Length a Critical Challenge in LLMs

According to Reddit discussions, developers often face challenges with context length limitations in LLMs. Two key questions frequently arise:

The context length of Large Language Models (LLMs) significantly affects their performance. A model's context window determines how much information it can process at once. For example, a 4K token window can handle about six pages of text, while a 32K token window can manage around 49 pages. A shorter context window limits how well the model recalls previous content, often leading to irrelevant or incoherent responses when referencing earlier parts of a conversation.

This constraint also reduces a model's effectiveness in summarizing long documents or maintaining complex dialogues. Essential details that fall outside the current context window may be overlooked, resulting in incomplete or inaccurate outputs.

The Qwen series of models has made substantial progress in overcoming these limitations. Each new version from Qwen through Qwen-2.5, introduced features designed to handle increasingly longer contexts.

  • Qwen: Qwen 1.8B, 7B and 72B supports upto 32k and  Qwen-14B supports upto 8k tokens, laying the groundwork for longer inputs.
  • Qwen1.5: All models offered stable support for 32K tokens across all model sizes, boosting its ability to handle larger inputs.
  • Qwen2: Expanded context length to 128K tokens, greatly improving its capacity for large datasets and complex queries. Techniques like YARN helped maintain performance over long texts.
  • Qwen2.5: The Qwen2.5 models support upto 128k context length just like Qwen2.
  • Qwen2.5-Turbo: Qwen2.5-Trubo pushed context length to an unprecedented 1 million tokens. This version employs advanced strategies such as Dual Chunk Attention (DCA), Sparse Attention Mechanisms, and Dynamic Sparse Attention with Context Memory.

Notably, Qwen2.5 retains strong performance in both long and short contexts, ensuring that extending context length does not compromise its capabilities with shorter text.

2. Multilingual Support and Challenges

Developers on Reddit have highlighted key questions about multilingual LLM capabilities:

Another challenge is the multilingual support in LLMs which have significant challenges like data imbalances, model architecture limitations, and the complexity of language representation.

Qwen addresses the challenges of multilingual support in LLMs through a combination of extensive pretrained on large-scale multilingual data, encompassing up to 18 trillion tokens. This extensive dataset includes a variety of languages, which helps improve the model's understanding and generation capabilities across different linguistic contexts.

Other multilingual models include LLaMA and Aya-Expanse, which support 8 and 23 languages, respectively. In contrast, Gemma and Mistral are trained on english language, offering no support for other languages. These three models outshine monolingual models  by addressing the growing demand for robust language generation and understanding in multilingual contexts.

Use Cases and Applications Across Industries

The Qwen models, have found diverse applications across multiple industries, leveraging their capabilities in natural language processing, coding assistance, audio generation, mathematical problem-solving, and visual understanding, as highlighted in the applications and key concepts of generative models.

In the healthcare sector, models like Qwen and Qwen-Math are instrumental in analyzing medical documents. For instance, Med-Qwen2-7B, a fine-tuned version of Qwen, demonstrates improved  accuracy in diagnosing medical conditions, generating specialized medical texts, and responding to medical queries with contextually relevant information.

The financial services industry benefits from Qwen models through automated report generation and data analysis. Qwen's natural language processing capabilities enable the summarization of financial documents and extraction of key insights, enhancing decision-making processes.

In customer service, Qwen’s instruction tuned models facilitates the development of intelligent chatbots capable of understanding and responding to customer inquiries in multiple languages, thereby improving user experience and operational efficiency. Qwen-Audio enhances this by analyzing vocal tones to assess customer emotions, allowing for more empathetic and tailored responses.

The technology sector leverages Qwen-Coder for code generation and debugging, aiding developers in automating repetitive tasks and enhancing productivity. Qwen-Coder's proficiency in understanding and generating code snippets makes it a valuable tool for software development and maintenance.

In the media and entertainment industry, Qwen-VL is utilized for image and video analysis, enabling applications such as automated captioning and content moderation. For example, a SaaS application that creates captions for images would employ Qwen-VL to interpret visual content and generate accurate descriptions, enhancing accessibility and user engagement.

Overall, the Qwen series offers versatile solutions across various sectors, driving innovation and efficiency through advanced AI capabilities.

Performance Benchmarks

Language Models

When compared to larger open-source models like  Llama3.1-405B Instruct and the latest released similar size model like Llama-3.3-70B-Instruct, it delivers exceptional performance, even surpassing many benchmarks.

  1. General Task: Qwen2.5-72B-Instruct achieved a score of 86.8, surpassing Llama-3.1-405B's score of 86.2. This indicates that Qwen2.5 is not only competitive but excels in general understanding and reasoning tasks.
  2. Mathematics & Science Task: The Qwen2.5-72B-Instruct model achieved a score of 83.1 on the MATH benchmark, showcasing significant advancements in mathematical reasoning capabilities compared to larger and latest models. The QwQ-32B-preview model has also demonstrated strong capabilities in the GPQA score, achieving competitive scores that reflect its proficiency in both math and science-related queries.
  3. Coding Task: The performance of Qwen2.5-72B-Instruct and QwQ-32B-Preview on various coding benchmarks demonstrates their superior capabilities compared to other models.

Qwen2.5-72B-Instruct has shown significant advancements in coding tasks, achieving a score of 86.6 and 78.4 on HumanEval and HumanEval+, and excelling in benchmarks such as MBPP with a score of 88.2. This model has outperformed its predecessor and other models in the same category, showcasing its prowess in generating and understanding code effectively.

Vision Models

Compared to other leading vision language, QvQ-72B-Preview shows strong overall performance particularly demonstrated by its 70.3 score on the MMMU benchmark. It also surpasses its predecessor (Qwen2-VL-72B-Instruct) and excels at mathematics and science problems on test sets like MathVista, MathVision, and OlympiadBench, bringing it closer to top SOTA o1 model.

Open-Source Community Engagement

Qwen maintains a strong open-source presence by providing models on platforms like Hugging Face and ModelScope. The project maintains an active Discord community (join here) where developers can get real-time support, share experiences, and collaborate on solutions. Additionally, you can connect with Junyang Lin, a key member of the Qwen Team who actively engages with the community on X (formerly Twitter) to address queries and share updates. Below are some essential resources provided by the Qwen Team:

  • Example Code & Documentation: GitHub repositories with integration examples and detailed documentation.
  • Developer Tools: Comprehensive tools and guides for both quantization and fine-tuning, help developers optimize model performance and customize it for specific applications.
  • Deployment Support: Comprehensive deployment guides cover various scenarios, from Docker containers to cloud platforms. These resources include performance optimization tips and troubleshooting guides.

The Qwen Agent framework further simplifies development by providing tools for building intelligent agents

Inference Options for Qwen Models

As organizations increasingly look to deploy these Qwen models for various applications, understanding the available inference options is crucial.

Qwen models can be deployed using several inference methods that cater to different needs and environments, and you can compare Machine Learning Libraries to find the best fit for your project. Here are some of the prominent options:

  • vLLM: This framework is designed for efficient large language model (LLM) inference. It optimizes memory usage and speed, making it suitable for high-performance applications.
  • SGLang: SGL focuses on scalability, allowing users to deploy models that can handle varying loads without compromising performance.
  • SkyPilot: This platform simplifies the deployment of machine learning models on cloud infrastructure, providing a user-friendly interface and automated scaling features.
  • TensorRT-LLM: NVIDIA's TensorRT is optimized for deep learning inference. TensorRT-LLM enhances Qwen model performance on NVIDIA GPUs through techniques like layer fusion and precision calibration.Check out our step-by-step guide on using TensorRT-LLM.
  • OpenVino: Intel's OpenVino toolkit enables the optimization of deep learning models for Intel hardware, ensuring efficient execution on CPUs and VPUs.
  • TGI: TGI provides a streamlined approach to deploying text generation models, focusing on low-latency inference suitable for real-time applications.
  • Xinference: This framework emphasizes cross-platform compatibility and efficient resource utilization, making it an attractive option for diverse deployment scenarios.
  • MLX: A versatile tool that allows users to run Qwen models on local machines with minimal setup. It supports various model sizes and configurations.
  • Llama.cpp: This implemented using plain C/C++ implementation without any dependencies, which focuses on efficiency and ease of use, allowing developers to integrate Qwen models into existing applications seamlessly.
  • Ollama: Ollama provides a straightforward interface for running Qwen models locally, making it accessible even for those with limited technical expertise.
  • LM Studio: This integrated development environment (IDE) is tailored for machine learning projects, offering built-in support for Qwen model deployment and testing.
  • Jan: A lightweight option that allows for quick local testing of Qwen models without extensive configuration requirements.

Step-by-Step Guide for Inference of Qwen Models

This guide provides a comprehensive approach to deploying the Qwen model using the vLLM framework. Follow these steps to set up and utilize the Qwen model effectively.

Prerequisites

  1. Python Environment: Ensure you have Python installed (preferably Python 3.8 or later).
  2. Install Required Packages: Install the required libraries using pip:
pip install vllm==0.6.2 transformers==4.45.2

Step 1: Initialize the Tokenizer and Model

Begin by importing necessary libraries and initializing the tokenizer and model.

from vllm import LLM
from vllm.sampling_params import SamplingParams
from transformers import AutoTokenizer

# Define the Model name
model_id = "Qwen/Qwen2.5-Coder-32B-Instruct"

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model
llm = LLM(model=model_id)

# Set up sampling parameters for generation
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.8,
    repetition_penalty=1.05,
    max_tokens=512,
    top_k=40
)

Step 2: Prepare Your Prompt

Define the prompt that you want to use for generating responses from the model.

# Prepare your prompt
prompt = "Tell me something about large language models."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt},
]

Step 3: Generate Responses

Use the LLM instance to generate responses based on your prepared messages.

# Apply the message template
input_text = self.tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
response = llm.generate(messages, sampling_params=sampling_params)
result_output = [output.outputs[0].text for output in response]
print(result_output)

Step 4: If You are Using Quantized Models

If you're working with quantized models for efficiency, you can specify quantization parameters when initializing your model.

For example, to use an AWQ quantized model:

llm = LLM(model="Qwen/Qwen2.5-Coder-32B-Instruct-AWQ", quantization="awq")

Now you can just follow the same process to use the Quantized model.

Also we have provided deployment guide for multiple Qwen models, you can check them out here:

  1. Qwen 2.5-Coder-32B-Instruct
  2. Qwen-2-VL-7B
  3. Qwen2-72B
  4. Qwen's QwQ-32B-Preview

Conclusion

The Qwen model series represents a significant milestone in AI development, combining powerful capabilities with practical accessibility. From its initial release to the latest Qwen2.5 series, these models have demonstrated impressive performance across various benchmarks, particularly in coding and mathematical reasoning tasks.

What sets Qwen apart is its strong commitment to open-source development and community engagement. Through comprehensive documentation, development tools, and resources available on platforms like Hugging Face and ModelScope, Alibaba has created an ecosystem that enables developers worldwide to leverage and build upon this technology.

The variety of inference options available, from vLLM to TensorRT-LLM, ensures that organizations can deploy Qwen models in ways that best suit their specific needs and infrastructure requirements. With detailed deployment guides and support for various platforms, Qwen has positioned itself as a versatile and accessible solution for both individual developers and enterprise applications.

Resources:

Introduction

The Qwen model series, developed by Alibaba Group's Qwen Team, represents a significant advancement in the field of generative models.Since its initial beta release in April 2023, Qwen has evolved into a comprehensive suite of models, each tailored to address specific challenges in natural language processing and understanding.

Qwen, also known as Tongyi Qianwen, series have undergone several iterations, with notable releases including Qwen2.5 models, QwQ-32B-Preview and QvQ-72B-Preview model. These models have been pre-trained on extensive multilingual and multimodal datasets, ensuring their applicability across various languages and modalities. They have been further refined with high-quality data to align with human preferences, enhancing their effectiveness in real-world applications.

Types of Qwen Models

The Qwen family encompasses a diverse array of models, each designed to cater to specific domains and tasks:

  • Qwen: The foundational large language model, proficient in natural language understanding and text generation
    • Qwen Chat & Instruct: Chat & Instruct models are trained with techniques like Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO), optimized for conversational interactions.
  • Qwen-VL: A large vision-language model capable of processing and generating content that combines visual and textual information.
  • Qwen-Audio: Designed for audio-related tasks, excelling in processing and understanding various audio inputs—including human speech, natural sounds, music, and songs—and generating text as output.
  • Qwen-Coder: A specialized model aimed at coding assistance, offering support in programming and software development tasks.
  • Qwen-Math: Focused on mathematical problem-solving, this model aids in understanding and generating mathematical content.
  • Qwen-QwQ Model: The Qwen Team introduced QwQ-32B-Preview, an experimental research model designed to advance AI reasoning capabilities.
  • Qwen-QVQ Model: QvQ-72B-Preview is an experimental research model focusing on enhancing visual reasoning capabilities.

Each model variant is built upon the base Qwen architecture, with specialized training to enhance performance in their respective domains, reflecting the principles of generative models in AI. This modular approach allows for targeted applications while maintaining the robust capabilities of the core model.

Qwen ‘s Key Milestones


The Qwen series began with the Qwen-7B model built with a transformer-based architecture, trained on up to 3 trillion tokens of diverse content, including text and code. Key innovations like rotary positional embeddings and flash attention helped improve training efficiency and performance, as discussed in an in-depth explanation of generative AI.

Over time, the Qwen family expanded with both larger and more specialized models:

  • Qwen-14B and Qwen-72B: Larger versions with more parameters, offering deeper language understanding and more accurate text generation.
  • Qwen-VL: A vision-language model that understood and generated responses from both text and visual inputs.
  • Qwen-1.8: A small-sized model with about 1.8 billion parameters, aiming for a balance between performance and efficiency.
  • Qwen Audio: A model focused on audio processing, allowing it to handle speech.

Then, the Qwen-1.5 series introduced improved long-context handling, making conversations more natural and engaging. Ranging from 0.5 billion to 110 billion parameters, these models could maintain context over longer interactions, improving how AI systems communicate with users.

In September 2024, Qwen released their Qwen-2 series, featuring improved data quality and training methods. Trained on 7 trillion tokens, these models underwent supervised fine-tuning and Direct Preference Optimization (DPO) for better human alignment.

The Qwen-2 series included:

  • Qwen2-0.5B to Qwen2-72B: Models of various sizes for different computing needs.
  • Qwen2-57B-A14B (MoE): A mixture-of-experts model for efficient parameter usage.
  • Specialized Models: Including Qwen2-VL, Qwen2-Audio, and Qwen2-Math.


The subsequent Qwen2.5 series expanded the training dataset to 18 trillion tokens and introduced cost-effective models like Qwen2.5-14B and Qwen2.5-32B. A mobile-friendly Qwen2.5-3B was also released. They have also released Qwen2.5-Math and Qwen2.5-Coder.

Qwen2.5 showed improved performance in coding, math, and instruction following. Two notable additions include:

  • Qwen2.5-Turbo: A closed-source model handling up to 1 million tokens
  • QwQ-32B-Preview: An experimental language model focused on complex reasoning tasks
  • QVQ-72B-Preview: An experimental visual model designed to enhance visual reasoning capabilities.

These advances expand AI capabilities across information processing, reasoning, and practical applications.

Addressing Initial Challenges

1. Context Length a Critical Challenge in LLMs

According to Reddit discussions, developers often face challenges with context length limitations in LLMs. Two key questions frequently arise:

The context length of Large Language Models (LLMs) significantly affects their performance. A model's context window determines how much information it can process at once. For example, a 4K token window can handle about six pages of text, while a 32K token window can manage around 49 pages. A shorter context window limits how well the model recalls previous content, often leading to irrelevant or incoherent responses when referencing earlier parts of a conversation.

This constraint also reduces a model's effectiveness in summarizing long documents or maintaining complex dialogues. Essential details that fall outside the current context window may be overlooked, resulting in incomplete or inaccurate outputs.

The Qwen series of models has made substantial progress in overcoming these limitations. Each new version from Qwen through Qwen-2.5, introduced features designed to handle increasingly longer contexts.

  • Qwen: Qwen 1.8B, 7B and 72B supports upto 32k and  Qwen-14B supports upto 8k tokens, laying the groundwork for longer inputs.
  • Qwen1.5: All models offered stable support for 32K tokens across all model sizes, boosting its ability to handle larger inputs.
  • Qwen2: Expanded context length to 128K tokens, greatly improving its capacity for large datasets and complex queries. Techniques like YARN helped maintain performance over long texts.
  • Qwen2.5: The Qwen2.5 models support upto 128k context length just like Qwen2.
  • Qwen2.5-Turbo: Qwen2.5-Trubo pushed context length to an unprecedented 1 million tokens. This version employs advanced strategies such as Dual Chunk Attention (DCA), Sparse Attention Mechanisms, and Dynamic Sparse Attention with Context Memory.

Notably, Qwen2.5 retains strong performance in both long and short contexts, ensuring that extending context length does not compromise its capabilities with shorter text.

2. Multilingual Support and Challenges

Developers on Reddit have highlighted key questions about multilingual LLM capabilities:

Another challenge is the multilingual support in LLMs which have significant challenges like data imbalances, model architecture limitations, and the complexity of language representation.

Qwen addresses the challenges of multilingual support in LLMs through a combination of extensive pretrained on large-scale multilingual data, encompassing up to 18 trillion tokens. This extensive dataset includes a variety of languages, which helps improve the model's understanding and generation capabilities across different linguistic contexts.

Other multilingual models include LLaMA and Aya-Expanse, which support 8 and 23 languages, respectively. In contrast, Gemma and Mistral are trained on english language, offering no support for other languages. These three models outshine monolingual models  by addressing the growing demand for robust language generation and understanding in multilingual contexts.

Use Cases and Applications Across Industries

The Qwen models, have found diverse applications across multiple industries, leveraging their capabilities in natural language processing, coding assistance, audio generation, mathematical problem-solving, and visual understanding, as highlighted in the applications and key concepts of generative models.

In the healthcare sector, models like Qwen and Qwen-Math are instrumental in analyzing medical documents. For instance, Med-Qwen2-7B, a fine-tuned version of Qwen, demonstrates improved  accuracy in diagnosing medical conditions, generating specialized medical texts, and responding to medical queries with contextually relevant information.

The financial services industry benefits from Qwen models through automated report generation and data analysis. Qwen's natural language processing capabilities enable the summarization of financial documents and extraction of key insights, enhancing decision-making processes.

In customer service, Qwen’s instruction tuned models facilitates the development of intelligent chatbots capable of understanding and responding to customer inquiries in multiple languages, thereby improving user experience and operational efficiency. Qwen-Audio enhances this by analyzing vocal tones to assess customer emotions, allowing for more empathetic and tailored responses.

The technology sector leverages Qwen-Coder for code generation and debugging, aiding developers in automating repetitive tasks and enhancing productivity. Qwen-Coder's proficiency in understanding and generating code snippets makes it a valuable tool for software development and maintenance.

In the media and entertainment industry, Qwen-VL is utilized for image and video analysis, enabling applications such as automated captioning and content moderation. For example, a SaaS application that creates captions for images would employ Qwen-VL to interpret visual content and generate accurate descriptions, enhancing accessibility and user engagement.

Overall, the Qwen series offers versatile solutions across various sectors, driving innovation and efficiency through advanced AI capabilities.

Performance Benchmarks

Language Models

When compared to larger open-source models like  Llama3.1-405B Instruct and the latest released similar size model like Llama-3.3-70B-Instruct, it delivers exceptional performance, even surpassing many benchmarks.

  1. General Task: Qwen2.5-72B-Instruct achieved a score of 86.8, surpassing Llama-3.1-405B's score of 86.2. This indicates that Qwen2.5 is not only competitive but excels in general understanding and reasoning tasks.
  2. Mathematics & Science Task: The Qwen2.5-72B-Instruct model achieved a score of 83.1 on the MATH benchmark, showcasing significant advancements in mathematical reasoning capabilities compared to larger and latest models. The QwQ-32B-preview model has also demonstrated strong capabilities in the GPQA score, achieving competitive scores that reflect its proficiency in both math and science-related queries.
  3. Coding Task: The performance of Qwen2.5-72B-Instruct and QwQ-32B-Preview on various coding benchmarks demonstrates their superior capabilities compared to other models.

Qwen2.5-72B-Instruct has shown significant advancements in coding tasks, achieving a score of 86.6 and 78.4 on HumanEval and HumanEval+, and excelling in benchmarks such as MBPP with a score of 88.2. This model has outperformed its predecessor and other models in the same category, showcasing its prowess in generating and understanding code effectively.

Vision Models

Compared to other leading vision language, QvQ-72B-Preview shows strong overall performance particularly demonstrated by its 70.3 score on the MMMU benchmark. It also surpasses its predecessor (Qwen2-VL-72B-Instruct) and excels at mathematics and science problems on test sets like MathVista, MathVision, and OlympiadBench, bringing it closer to top SOTA o1 model.

Open-Source Community Engagement

Qwen maintains a strong open-source presence by providing models on platforms like Hugging Face and ModelScope. The project maintains an active Discord community (join here) where developers can get real-time support, share experiences, and collaborate on solutions. Additionally, you can connect with Junyang Lin, a key member of the Qwen Team who actively engages with the community on X (formerly Twitter) to address queries and share updates. Below are some essential resources provided by the Qwen Team:

  • Example Code & Documentation: GitHub repositories with integration examples and detailed documentation.
  • Developer Tools: Comprehensive tools and guides for both quantization and fine-tuning, help developers optimize model performance and customize it for specific applications.
  • Deployment Support: Comprehensive deployment guides cover various scenarios, from Docker containers to cloud platforms. These resources include performance optimization tips and troubleshooting guides.

The Qwen Agent framework further simplifies development by providing tools for building intelligent agents

Inference Options for Qwen Models

As organizations increasingly look to deploy these Qwen models for various applications, understanding the available inference options is crucial.

Qwen models can be deployed using several inference methods that cater to different needs and environments, and you can compare Machine Learning Libraries to find the best fit for your project. Here are some of the prominent options:

  • vLLM: This framework is designed for efficient large language model (LLM) inference. It optimizes memory usage and speed, making it suitable for high-performance applications.
  • SGLang: SGL focuses on scalability, allowing users to deploy models that can handle varying loads without compromising performance.
  • SkyPilot: This platform simplifies the deployment of machine learning models on cloud infrastructure, providing a user-friendly interface and automated scaling features.
  • TensorRT-LLM: NVIDIA's TensorRT is optimized for deep learning inference. TensorRT-LLM enhances Qwen model performance on NVIDIA GPUs through techniques like layer fusion and precision calibration.Check out our step-by-step guide on using TensorRT-LLM.
  • OpenVino: Intel's OpenVino toolkit enables the optimization of deep learning models for Intel hardware, ensuring efficient execution on CPUs and VPUs.
  • TGI: TGI provides a streamlined approach to deploying text generation models, focusing on low-latency inference suitable for real-time applications.
  • Xinference: This framework emphasizes cross-platform compatibility and efficient resource utilization, making it an attractive option for diverse deployment scenarios.
  • MLX: A versatile tool that allows users to run Qwen models on local machines with minimal setup. It supports various model sizes and configurations.
  • Llama.cpp: This implemented using plain C/C++ implementation without any dependencies, which focuses on efficiency and ease of use, allowing developers to integrate Qwen models into existing applications seamlessly.
  • Ollama: Ollama provides a straightforward interface for running Qwen models locally, making it accessible even for those with limited technical expertise.
  • LM Studio: This integrated development environment (IDE) is tailored for machine learning projects, offering built-in support for Qwen model deployment and testing.
  • Jan: A lightweight option that allows for quick local testing of Qwen models without extensive configuration requirements.

Step-by-Step Guide for Inference of Qwen Models

This guide provides a comprehensive approach to deploying the Qwen model using the vLLM framework. Follow these steps to set up and utilize the Qwen model effectively.

Prerequisites

  1. Python Environment: Ensure you have Python installed (preferably Python 3.8 or later).
  2. Install Required Packages: Install the required libraries using pip:
pip install vllm==0.6.2 transformers==4.45.2

Step 1: Initialize the Tokenizer and Model

Begin by importing necessary libraries and initializing the tokenizer and model.

from vllm import LLM
from vllm.sampling_params import SamplingParams
from transformers import AutoTokenizer

# Define the Model name
model_id = "Qwen/Qwen2.5-Coder-32B-Instruct"

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model
llm = LLM(model=model_id)

# Set up sampling parameters for generation
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.8,
    repetition_penalty=1.05,
    max_tokens=512,
    top_k=40
)

Step 2: Prepare Your Prompt

Define the prompt that you want to use for generating responses from the model.

# Prepare your prompt
prompt = "Tell me something about large language models."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt},
]

Step 3: Generate Responses

Use the LLM instance to generate responses based on your prepared messages.

# Apply the message template
input_text = self.tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
response = llm.generate(messages, sampling_params=sampling_params)
result_output = [output.outputs[0].text for output in response]
print(result_output)

Step 4: If You are Using Quantized Models

If you're working with quantized models for efficiency, you can specify quantization parameters when initializing your model.

For example, to use an AWQ quantized model:

llm = LLM(model="Qwen/Qwen2.5-Coder-32B-Instruct-AWQ", quantization="awq")

Now you can just follow the same process to use the Quantized model.

Also we have provided deployment guide for multiple Qwen models, you can check them out here:

  1. Qwen 2.5-Coder-32B-Instruct
  2. Qwen-2-VL-7B
  3. Qwen2-72B
  4. Qwen's QwQ-32B-Preview

Conclusion

The Qwen model series represents a significant milestone in AI development, combining powerful capabilities with practical accessibility. From its initial release to the latest Qwen2.5 series, these models have demonstrated impressive performance across various benchmarks, particularly in coding and mathematical reasoning tasks.

What sets Qwen apart is its strong commitment to open-source development and community engagement. Through comprehensive documentation, development tools, and resources available on platforms like Hugging Face and ModelScope, Alibaba has created an ecosystem that enables developers worldwide to leverage and build upon this technology.

The variety of inference options available, from vLLM to TensorRT-LLM, ensures that organizations can deploy Qwen models in ways that best suit their specific needs and infrastructure requirements. With detailed deployment guides and support for various platforms, Qwen has positioned itself as a versatile and accessible solution for both individual developers and enterprise applications.

Resources:

Table of contents