DeepSeek AI: Advancing Open-Source LLMs with MoE & Reinforcement Learning

Table of contents

Introduction

In the rapid development of open-source large language models (LLMs), DeepSeek Models represent a significant advancement in the landscape. Starting from the first releases of DeepSeek-Coder, they have garnered attention for their innovative approaches, particularly in using attention mechanisms and the Mixture-of-Experts (MoE) architecture. These innovations have not only improved model efficiency but have also challenged existing paradigms within the AI community.

Recently, they have introduced the DeepSeek-R1 models and DeepSeek-V3, based on MoE architectures of 671 billion parameters. DeepSeek-R1 achieves performance comparable to OpenAI-o1.

Type of DeepSeek Models

DeepSeek has developed a diverse range of models tailored to various applications in natural language processing, coding, and mathematical reasoning. Below is an overview of the different types of DeepSeek models:

DeepSeek-R1: DeepSeek-R1 is their latest first-generation reasoning model, which matches OpenAI's o1 in benchmarks. They also have DeepSeek-R1-Zero trained solely through large-scale reinforcement learning without supervised fine-tuning. It naturally developed reasoning behaviors such as self-verification and reflection.
DeepSeekMoE: DeepSeekMoE is an innovative architecture within the DeepSeek model family, specifically designed to enhance the performance and specialization of LLMs through a MoE approach. This architecture has evolved through multiple iterations, including DeepSeek-V2, DeepSeek-V2.5, and the latest DeepSeek-V3.
DeepSeek LLM: The DeepSeek LLM is a language model for text generation. With versions like DeepSeek LLM 7B and DeepSeek LLM 67B, these models are trained on extensive datasets comprising 2 trillion tokens in both English and Chinese.
DeepSeek-Coder: DeepSeek-Coder models represent a significant progress in coding-specific task comparable to closed-source models. The first version of DeepSeek-Coder model is engineered to assist programmers by providing code generation capabilities in over 80 programming languages. The latest DeepSeek-Coder-V2, marks a significant leap forward in capabilities. With an impressive 236 billion parameters, this model has been pre-trained on an extensive dataset of 6 trillion tokens, enhancing its coding and mathematical reasoning abilities.
DeepSeek-VL: DeepSeek-VL models designed to enhance multimodal understanding capabilities. The DeepSeek-VL model is built upon the DeepSeek-LLM-1.3B-base model. The model has been trained on approximately 500 billion text tokens and 400 billion vision-language tokens. DeepSeek-VL2 introduces significant improvements in performance and efficiency. This advanced series consists of multiple variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small, and DeepSeek-VL2, with activated parameters ranging from 1.0 billion to 4.5 billion.
DeepSeek-Math: DeepSeek-Math is specifically designed to tackle complex mathematical reasoning tasks. Building upon the DeepSeek-Coder-Base-v1.5 7B model, it undergoes continuous pre-training on a substantial dataset comprising 120 billion math-related tokens sourced from Common Crawl, along with natural language and code data.
DeepSeek-Prover: DeepSeek-Prover is an open-source language model developed to advance automated theorem proving within the Lean 4 proof assistant framework. By leveraging large-scale synthetic data and innovative training techniques, it aims to enhance the efficiency and accuracy of formal mathematical proofs. It has undergone significant advancements from its initial version (V1) to the enhanced V1.5, resulting in improved performance in formal theorem proving tasks.
Janus: Janus introduced as a novel autoregressive framework that unifies multimodal understanding and generation. Janus decouples visual encoding into separate pathways for understanding and generation tasks. This series of multimodal consists of multiple variants, JanusFlow and Janus-Pro.

DeepSeek ‘s Key Milestones

DeepSeek has consistently pushed AI research boundaries. Below are some major releases:

DeepSeek-Coder (Nov 2, 2023) – A commercial-grade coding model (1.3B–33B parameters) based on the Llama architecture.
DeepSeek LLM (Nov 29, 2023) – A 67B model outperforming LLaMA-2 70B in reasoning, coding, math, and Chinese comprehension.
DeepSeekMoE 16B (Jan 11, 2024) – First MoE model with 2.8B active parameters, boosting efficiency.
DeepSeek-Math (Feb 6, 2024) – A 7B model scoring 51.7% on MATH benchmarks, approaching Gemini-Ultra/GPT-4 performance.
DeepSeek-VL (Mar 11, 2024) – A vision-language model handling 1024×1024 images with low computational cost.
DeepSeek-V2 (May 6, 2024) – A 236B MoE model ranking top 3 on AlignBench, competing with GPT-4-Turbo.
DeepSeek-Coder-V2 (June 17, 2024) – A coding MoE model surpassing GPT-4 Turbo, supporting 338 languages with 128K context length.
DeepSeek-Prover-V1.5 (Aug 15, 2024) – Achieved SOTA results in theorem proving via RLPAF and RMaxTS algorithms.
DeepSeek-V2.5 (Sep 6, 2024) – Combined strengths of DeepSeek-V2-0628 and DeepSeek-Coder-V2-0724, outperforming both.
DeepSeek R1-Lite-Preview (Nov 20, 2024) – A reasoning model excelling in logical inference, math, and problem-solving.
DeepSeek-VL2 (Dec 13, 2024) – A multimodal MoE model with competitive performance at lower computational cost.
DeepSeek-V3 (Dec 27, 2024) – A 671B MoE model (37B active parameters), outperforming LLaMA 3.1 and Qwen 2.5 while rivaling GPT-4o.
DeepSeek-R1 & DeepSeek-R1-Zero (Jan 20, 2025) – R1 rivals OpenAI’s o1, while R1-Zero explores reinforcement learning-only training.
Janus-Pro (Jan 27, 2025) – A multimodal model excelling in text-to-image generation, outperforming DALL-E 3 and Stable Diffusion.

‍

More about DeepSeek-R1

DeepSeek-R1 is a flagship reasoning model that excels at solving math and reasoning problems. The training process incorporates multi-stage training and cold-start data before RL.

Here’s an overview of their training process:

First stage Reinforcement Learning:They trained a model with large-scale RL without fine-tuning, which resulted in the DeepSeek-R1-Zero. DeepSeek-R1-Zero shows steady performance improvement during RL training, with its AIME 2024 average pass@1 score rising from 15.6% to 71.0%, matching OpenAI-o1-0912 levels. Further, they have used this model to create a synthetic dataset for supervised fine-tuning (SFT). This dataset consists of reasoning problems generated by DeepSeek-R1-Zero itself, providing a strong initial foundation for the model.
‍
SFT on Synthetic Data:Using the synthetic dataset from DeepSeek-R1-Zero, the base model which is DeepSeek-V3-Base undergoes supervised fine-tuning. This step ensures that the model starts improvement in readability before reinforcement learning is applied. ‍‍
Large-Scale Reinforcement Learning on Reasoning Tasks: After fine-tuning DeepSeek-V3-Base on cold-start data, reinforcement learning is applied, following the same large-scale training process as DeepSeek-R1-Zero. This phase aims to improve reasoning-intensive tasks like coding, mathematics, science, and logic reasoning.
‍‍
Rejection Sampling for Further Optimization: After reasoning-oriented RL converges, the resulting model checkpoint is used to collect Supervised Fine-Tuning (SFT) data for the next training phase. This stage integrates data beyond reasoning, including writing, role-playing, and other general tasks.
- Reasoning Data: Uses rejection sampling from the RL-trained model to curate high-quality reasoning samples (~600k). This includes additional data evaluated via a generative reward model and filtering out unclear outputs (e.g., mixed languages, long paragraphs, code blocks).
- Non-Reasoning Data: Includes tasks like writing, factual QA, self-cognition, and translation (~200k). Some responses use chain-of-thought (CoT) via DeepSeek-V3, while simpler ones don’t. The DeepSeek-V3-Base model is fine-tuned on ~800k samples over two epochs to improve its general capabilities.
‍‍
Final RL Training: In the last stage, to better align the model with human preferences, they introduced a second RL stage to enhance helpfulness, harmlessness, and reasoning. Helpfulness is judged by the final summary, ensuring relevance, while harmlessness is assessed across the entire response to mitigate risks and biases. This approach refines reasoning while prioritizing safety and user benefit.

Dataset used for DeepSeek Models Training

DeepSeek's dataset strategy is centered on creating a highly diverse and expansive training corpus to support their large language models. The DeepSeek models used datasets ranging from 2 trillion to 14.8 trillion tokens which expands the multilingual support. Their dataset for DeeSeek-V3 was built to ensure a rich mix of text types, languages, and informational content. The focus was on not just the quantity but also the quality and variety of the data, which includes a significant portion of high-quality multilingual data to foster a comprehensive understanding of diverse linguistic nuances. For DeepSeek-R1, they have curated about 600k reasoning related training samples and 200k training samples that are unrelated to reasoning.

A crucial part of their strategy was to eliminate redundancy and maximize the information density within the training dataset. DeepSeek employed advanced deduplication techniques to remove duplicate instances of data across multiple data dumps, achieving an effective reduction in data repetition.

In addition to deduplication, DeepSeek implemented robust filtering criteria to ensure data quality. This involved linguistic and semantic evaluations to maintain a high standard of dataset integrity. The remixing stage of their dataset creation involved adjusting the dataset composition to address any imbalances, ensuring a broad representation across different domains.

Contributions of DeepSeek

DeepSeek AI has made significant contributions through its research, particularly DeepSeek-R1 and DeepSeek-V3.

Development Pipeline of DeepSeek-R1: The development pipeline for DeepSeek-R1 incorporates two reinforcement learning (RL) stages aimed at discovering improved reasoning patterns and aligning with human preferences. Additionally, it includes two supervised fine-tuning (SFT) stages that serve as the seed for the model’s reasoning and non-reasoning capabilities. This comprehensive pipeline is designed to create better models by combining RL and SFT approaches.
Distilling Reasoning Capabilities: DeepSeek has developed an innovative methodology to distill reasoning capabilities from the DeepSeek-R1 series models into standard large language models (LLMs), particularly DeepSeek-V3. This process involves integrating the verification and reflection patterns of R1 into DeepSeek-V3, resulting in improved reasoning performance.
FP8 Mixed Precision Training Framework: DeepSeek-V3 is notable for its implementation of an FP8 mixed precision training framework. This approach involves using 8-bit floating-point (FP8) precision during training, which reduces memory usage and accelerates computation. The adoption of FP8 mixed precision in training large-scale models like DeepSeek-V3 represents a pioneering effort in the field, demonstrating both feasibility and effectiveness.
Auxiliary-Loss-Free Strategy: In traditional MoE models, load balancing is often achieved by incorporating auxiliary loss functions, which can inadvertently degrade model performance. DeepSeek-V3 addresses this challenge by introducing an innovative auxiliary-loss-free strategy for load balancing. This approach eliminates the need for additional loss functions, thereby minimizing potential performance degradation. By optimizing load distribution among experts without relying on auxiliary losses, DeepSeek-V3 maintains high efficiency and effectiveness in processing tasks.

Performance Benchmarks

1. Reasoning-Related Benchmark

‍

This table compares the performance of DeepSeek models with other models on reasoning-related benchmarks, focusing on mathematical problem-solving and coding capabilities. Below is a summary of the key findings:

Maths : DeepSeek-R1 leads on AIME (79.8%) and MATH-500 (97.3%), followed closely by OpenAI-o1 at 79.2% and 96.4%, respectively.
‍General Knowledge Reasoning: OpenAI-o1 excels in GPQA Diamond with 75.7%, while DeepSeek-R1-Zero follows at 73.3%. Other models show lower performance in this domain.‍
Coding and Algorithmic Reasoning: For LiveCode Bench, DeepSeek-R1 achieves the highest score at 65.9% and for CodeForces, OpenAI-o1 leads with a rating of 2061, outperforming other models on this platform.

2. Standard Benchmark

This table compares the performance of DeepSeek models against other models on standard benchmarks. Below is a summary of the key findings:

General Knowledge : DeepSeek-R1 leads on English benchmarks, achieving top scores in MMLU-Redux (92.9), MMLU-Pro (84.0), DROP (92.2), and FRAMES (82.5). It also outperforms competitors in user-centric evaluations such as AlpacaEval2.0 (87.6) and ArenaHard (92.3), showcasing strong capabilities in writing tasks and open-domain question answering. Additionally, DeepSeek-R1 delivers notable results on IF-Eval, demonstrating solid adherence to format instructions.

Coding: In coding tasks, OpenAI o1 stands out, securing the highest scores in crucial benchmarks like LiveCodeBench (63.4) and Codeforces (96.6 percentile, rating 2061). DeepSeek-R1 follows closely with competitive numbers on LiveCodeBench (65.9) and Codeforces (96.3 percentile, rating 2029). Although OpenAI-o1 surpasses DeepSeek-R1 on Aider, they exhibit similar performance on SWE Verified.

Maths: DeepSeek-R1 is the top performer in math-focused evaluations, particularly excelling in MATH-500 (97.3), CNMO (78.8), and AIME 2024 (79.8). OpenAI o1-1217 also performs strongly on MATH-500 (96.4) and AIME 2024 (79.2), placing it second in overall mathematical capabilities.

‍

The Challenge of Advancing AI Reasoning

For a long time, the AI community faced a major challenge: no open-source model could match OpenAI’s o1 in reasoning, particularly in complex tasks like mathematics and coding. Developers eagerly awaited a model that could compete, but no viable alternative emerged.

In December 2024, Qwen attempted to bridge this gap with Qwen-QwQ, an experimental reasoning model that showed promise, especially in mathematical and coding benchmarks. However, as a preview release, it had limitations and wasn't a complete solution.

How DeepSeek Solved It

Recognizing this gap, DeepSeek introduced DeepSeek-R1 in January 2025—a model designed to rival o1 in reasoning performance. They tackled the challenge by optimizing reinforcement learning techniques, enabling the model to develop advanced reasoning behaviors like self-verification and reflection. Unlike its predecessors, DeepSeek-R1 was also engineered for efficiency, requiring fewer computational resources without compromising performance.

The result? An open-source reasoning model that finally matched o1, making high-level AI reasoning more accessible and cost-effective. By prioritizing efficiency and openness, DeepSeek-R1 has significantly impacted the AI landscape, challenging closed-source dominance and advancing the field of reasoning models.

Use Cases and Applications

DeepSeek models can be integrated into a variety of applications across multiple domains, enhancing functionality and user experience. Here are some notable use cases:

Chat Applications: DeepSeek's models are integrated into various chat platforms, enhancing user interactions. For instance, Chatbox offers a desktop client compatible with multiple large language models, including DeepSeek, across Windows, Mac, and Linux systems.
Productivity Tools: In the realm of productivity, applications like LibreChat and Enconvo leverage DeepSeek's AI to enhance user efficiency. Additionally, Cherry Studio utilizes DeepSeek's capabilities to support producers in their creative workflows.
Translation and Language Support: DeepSeek's models are employed in translation tools to make information more accessible globally. RSS Translator uses DeepSeek to translate RSS feeds into multiple languages, broadening the reach of content.
Developer Tools: For developers, DeepSeek enhances coding efficiency through tools like Continue, which is an open-source autopilot integrated into Integrated Development Environments (IDEs), assisting developers by leveraging DeepSeek's advanced coding capabilities.
Browser Extensions: For the browser extensions, DeepSeek's models power tools like Lulu Translate, which provides functionalities such as mouse selection translation, paragraph-by-paragraph comparison translation, and PDF document translation, utilizes DeepSeek.

These use cases illustrate the diverse applications of DeepSeek models across various platforms, significantly enhancing functionality and user experience in multiple sectors.

Open-Source Community Engagement

DeepSeek maintains a open-source presence by offering its models on platforms like HuggingFace. Their Discord server fosters an active community where developers can access resources, share experiences, and collaborate on solutions.

The DeepSeek team provides several essential resources for developers:

Their GitHub repositories containing integration examples and detailed documentation.
They also provide guides on fine-tuning for optimizing model performance and customizing it for specific applications.
Comprehensive deployment guides covering various inference libraries, including performance optimization tips.

For more detailed information, you can refer to DeepSeek's official website, which offers an overview of their models and resources.

Inference Options for DeepSeek Models

SGLang: SGLang delivers cutting-edge latency and throughput performance by incorporating MLA optimizations, DP Attention, FP8 (W8A8), FP8 KV Cache, and Torch Compile. It supports both NVIDIA and AMD GPUs and enables multi-node tensor parallelism for scalability. Multi-Token Prediction (MTP) is under development, enhancing its future capabilities.
LMDeploy: A flexible, high-performance inference framework tailored for large language models. LMDeploy supports offline pipeline processing and online deployment, seamlessly integrating with PyTorch-based workflows for streamlined model serving.
TensorRT-LLM: NVIDIA’s TensorRT-LLM offers precision options like BF16 and INT4/INT8 weight-only, with FP8 support coming soon. It provides optimized inference for DeepSeek-V3 on NVIDIA GPUs, leveraging advanced techniques such as layer fusion and precision calibration.
vLLM: A framework optimized for memory-efficient and high-speed inference, vLLM supports FP8 and BF16 precision modes. It offers pipeline parallelism for multi-machine deployments, making it a strong choice for large-scale applications.
TGI: Hugging Face's Text Generation Inference (TGI) is an open-source inference library facilitate the deployment and serving of Large Language Models (LLMs) in production environments.
Ollama: Ollama simplifies the deployment and inference of DeepSeek models on local setups, making it accessible even for those with limited technical expertise.

These frameworks and hardware options cater to diverse deployment needs, offering scalable and efficient inference for DeepSeek models.

Step-by-Step Guide for Inference of DeepSeek Models

This guide provides a comprehensive approach to deploying the DeepSeek model using the vLLM framework. Follow these steps to set up and utilize the DeekSeek model effectively.

Prerequisites

Python Environment: Ensure you have Python installed (preferably Python 3.8 or later).
Install Required Packages: Install the required libraries using pip:

pip install vllm==0.6.6.post1

Step 1: Initialize the Tokenizer and Model

Begin by importing necessary libraries and initializing the tokenizer and model.

from vllm import LLM
from vllm.sampling_params import SamplingParams
from transformers import AutoTokenizer

# Define the Model name
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model
llm = LLM(model=model_id)

# Set up sampling parameters for generation
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.8,
    repetition_penalty=1.05,
    max_tokens=512,
    top_k=40
)

Step 2: Prepare Your Prompt

Define the prompt that you want to use for generating responses from the model.

# Prepare your prompt
prompt = "Tell me something about large language models."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
]

Step 3: Generate Responses

Use the LLM instance to generate responses based on your prepared messages.

# Apply the message template
input_text = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
response = llm.generate(input_text, sampling_params=sampling_params)
result_output = [output.outputs[0].text for output in response]
print(result_output)

Step 4: If You are Using Quantized Models

If you're working with quantized models for efficiency, you can specify quantization parameters when initializing your model.

For example, to use an GPTQ quantized model:

llm = LLM(model="kaitchup/DeepSeek-R1-Distill-Qwen-7B-AutoRound-GPTQ-4bit",quantization="gptq")

Now you can just follow the same process to use the Quantized model.

Conclusion

DeepSeek’s research and development covers cutting-edge MoE architectures, advanced RL training techniques, and extensive community support. With the help of all these DeepSeek models to near state-of-the-art performance across an impressive range of tasks.

With open-source releases like DeepSeek-R1 and DeepSeek-V3, they continue to close the gap between proprietary closed-source models and open-source models, fostering broad adoption and research. From coding assistance to formal theorem proving and multilingual comprehension, DeepSeek’s suite of models demonstrate both technological ambition and community driven development, marking a pivotal moment in the evolution of LLMs.

DeepSeek prioritizes robust deployment support for even its massive architectures such as the 671B-parameter MoE models through frameworks like LMDeploy, TensorRT-LLM, vLLM and others. This ensures that anyone, from individuals on consumer-grade GPUs to enterprises using high-performance clusters, can harness DeepSeek’s capabilities for cutting-edge ML applications.

‍

Resources:

‍

Introduction

Recently, they have introduced the DeepSeek-R1 models and DeepSeek-V3, based on MoE architectures of 671 billion parameters. DeepSeek-R1 achieves performance comparable to OpenAI-o1.

Type of DeepSeek Models

DeepSeek-R1: DeepSeek-R1 is their latest first-generation reasoning model, which matches OpenAI's o1 in benchmarks. They also have DeepSeek-R1-Zero trained solely through large-scale reinforcement learning without supervised fine-tuning. It naturally developed reasoning behaviors such as self-verification and reflection.
DeepSeekMoE: DeepSeekMoE is an innovative architecture within the DeepSeek model family, specifically designed to enhance the performance and specialization of LLMs through a MoE approach. This architecture has evolved through multiple iterations, including DeepSeek-V2, DeepSeek-V2.5, and the latest DeepSeek-V3.
DeepSeek LLM: The DeepSeek LLM is a language model for text generation. With versions like DeepSeek LLM 7B and DeepSeek LLM 67B, these models are trained on extensive datasets comprising 2 trillion tokens in both English and Chinese.
DeepSeek-Coder: DeepSeek-Coder models represent a significant progress in coding-specific task comparable to closed-source models. The first version of DeepSeek-Coder model is engineered to assist programmers by providing code generation capabilities in over 80 programming languages. The latest DeepSeek-Coder-V2, marks a significant leap forward in capabilities. With an impressive 236 billion parameters, this model has been pre-trained on an extensive dataset of 6 trillion tokens, enhancing its coding and mathematical reasoning abilities.
DeepSeek-VL: DeepSeek-VL models designed to enhance multimodal understanding capabilities. The DeepSeek-VL model is built upon the DeepSeek-LLM-1.3B-base model. The model has been trained on approximately 500 billion text tokens and 400 billion vision-language tokens. DeepSeek-VL2 introduces significant improvements in performance and efficiency. This advanced series consists of multiple variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small, and DeepSeek-VL2, with activated parameters ranging from 1.0 billion to 4.5 billion.
DeepSeek-Math: DeepSeek-Math is specifically designed to tackle complex mathematical reasoning tasks. Building upon the DeepSeek-Coder-Base-v1.5 7B model, it undergoes continuous pre-training on a substantial dataset comprising 120 billion math-related tokens sourced from Common Crawl, along with natural language and code data.
DeepSeek-Prover: DeepSeek-Prover is an open-source language model developed to advance automated theorem proving within the Lean 4 proof assistant framework. By leveraging large-scale synthetic data and innovative training techniques, it aims to enhance the efficiency and accuracy of formal mathematical proofs. It has undergone significant advancements from its initial version (V1) to the enhanced V1.5, resulting in improved performance in formal theorem proving tasks.
Janus: Janus introduced as a novel autoregressive framework that unifies multimodal understanding and generation. Janus decouples visual encoding into separate pathways for understanding and generation tasks. This series of multimodal consists of multiple variants, JanusFlow and Janus-Pro.

DeepSeek ‘s Key Milestones

DeepSeek has consistently pushed AI research boundaries. Below are some major releases:

DeepSeek-Coder (Nov 2, 2023) – A commercial-grade coding model (1.3B–33B parameters) based on the Llama architecture.
DeepSeek LLM (Nov 29, 2023) – A 67B model outperforming LLaMA-2 70B in reasoning, coding, math, and Chinese comprehension.
DeepSeekMoE 16B (Jan 11, 2024) – First MoE model with 2.8B active parameters, boosting efficiency.
DeepSeek-Math (Feb 6, 2024) – A 7B model scoring 51.7% on MATH benchmarks, approaching Gemini-Ultra/GPT-4 performance.
DeepSeek-VL (Mar 11, 2024) – A vision-language model handling 1024×1024 images with low computational cost.
DeepSeek-V2 (May 6, 2024) – A 236B MoE model ranking top 3 on AlignBench, competing with GPT-4-Turbo.
DeepSeek-Coder-V2 (June 17, 2024) – A coding MoE model surpassing GPT-4 Turbo, supporting 338 languages with 128K context length.
DeepSeek-Prover-V1.5 (Aug 15, 2024) – Achieved SOTA results in theorem proving via RLPAF and RMaxTS algorithms.
DeepSeek-V2.5 (Sep 6, 2024) – Combined strengths of DeepSeek-V2-0628 and DeepSeek-Coder-V2-0724, outperforming both.
DeepSeek R1-Lite-Preview (Nov 20, 2024) – A reasoning model excelling in logical inference, math, and problem-solving.
DeepSeek-VL2 (Dec 13, 2024) – A multimodal MoE model with competitive performance at lower computational cost.
DeepSeek-V3 (Dec 27, 2024) – A 671B MoE model (37B active parameters), outperforming LLaMA 3.1 and Qwen 2.5 while rivaling GPT-4o.
DeepSeek-R1 & DeepSeek-R1-Zero (Jan 20, 2025) – R1 rivals OpenAI’s o1, while R1-Zero explores reinforcement learning-only training.
Janus-Pro (Jan 27, 2025) – A multimodal model excelling in text-to-image generation, outperforming DALL-E 3 and Stable Diffusion.

‍

More about DeepSeek-R1

DeepSeek-R1 is a flagship reasoning model that excels at solving math and reasoning problems. The training process incorporates multi-stage training and cold-start data before RL.

Here’s an overview of their training process:

First stage Reinforcement Learning:They trained a model with large-scale RL without fine-tuning, which resulted in the DeepSeek-R1-Zero. DeepSeek-R1-Zero shows steady performance improvement during RL training, with its AIME 2024 average pass@1 score rising from 15.6% to 71.0%, matching OpenAI-o1-0912 levels. Further, they have used this model to create a synthetic dataset for supervised fine-tuning (SFT). This dataset consists of reasoning problems generated by DeepSeek-R1-Zero itself, providing a strong initial foundation for the model.
‍
SFT on Synthetic Data:Using the synthetic dataset from DeepSeek-R1-Zero, the base model which is DeepSeek-V3-Base undergoes supervised fine-tuning. This step ensures that the model starts improvement in readability before reinforcement learning is applied. ‍‍
Large-Scale Reinforcement Learning on Reasoning Tasks: After fine-tuning DeepSeek-V3-Base on cold-start data, reinforcement learning is applied, following the same large-scale training process as DeepSeek-R1-Zero. This phase aims to improve reasoning-intensive tasks like coding, mathematics, science, and logic reasoning.
‍‍
Rejection Sampling for Further Optimization: After reasoning-oriented RL converges, the resulting model checkpoint is used to collect Supervised Fine-Tuning (SFT) data for the next training phase. This stage integrates data beyond reasoning, including writing, role-playing, and other general tasks.
- Reasoning Data: Uses rejection sampling from the RL-trained model to curate high-quality reasoning samples (~600k). This includes additional data evaluated via a generative reward model and filtering out unclear outputs (e.g., mixed languages, long paragraphs, code blocks).
- Non-Reasoning Data: Includes tasks like writing, factual QA, self-cognition, and translation (~200k). Some responses use chain-of-thought (CoT) via DeepSeek-V3, while simpler ones don’t. The DeepSeek-V3-Base model is fine-tuned on ~800k samples over two epochs to improve its general capabilities.
‍‍
Final RL Training: In the last stage, to better align the model with human preferences, they introduced a second RL stage to enhance helpfulness, harmlessness, and reasoning. Helpfulness is judged by the final summary, ensuring relevance, while harmlessness is assessed across the entire response to mitigate risks and biases. This approach refines reasoning while prioritizing safety and user benefit.

Dataset used for DeepSeek Models Training

Contributions of DeepSeek

DeepSeek AI has made significant contributions through its research, particularly DeepSeek-R1 and DeepSeek-V3.

Development Pipeline of DeepSeek-R1: The development pipeline for DeepSeek-R1 incorporates two reinforcement learning (RL) stages aimed at discovering improved reasoning patterns and aligning with human preferences. Additionally, it includes two supervised fine-tuning (SFT) stages that serve as the seed for the model’s reasoning and non-reasoning capabilities. This comprehensive pipeline is designed to create better models by combining RL and SFT approaches.
Distilling Reasoning Capabilities: DeepSeek has developed an innovative methodology to distill reasoning capabilities from the DeepSeek-R1 series models into standard large language models (LLMs), particularly DeepSeek-V3. This process involves integrating the verification and reflection patterns of R1 into DeepSeek-V3, resulting in improved reasoning performance.
FP8 Mixed Precision Training Framework: DeepSeek-V3 is notable for its implementation of an FP8 mixed precision training framework. This approach involves using 8-bit floating-point (FP8) precision during training, which reduces memory usage and accelerates computation. The adoption of FP8 mixed precision in training large-scale models like DeepSeek-V3 represents a pioneering effort in the field, demonstrating both feasibility and effectiveness.
Auxiliary-Loss-Free Strategy: In traditional MoE models, load balancing is often achieved by incorporating auxiliary loss functions, which can inadvertently degrade model performance. DeepSeek-V3 addresses this challenge by introducing an innovative auxiliary-loss-free strategy for load balancing. This approach eliminates the need for additional loss functions, thereby minimizing potential performance degradation. By optimizing load distribution among experts without relying on auxiliary losses, DeepSeek-V3 maintains high efficiency and effectiveness in processing tasks.

Performance Benchmarks

1. Reasoning-Related Benchmark

‍

Maths : DeepSeek-R1 leads on AIME (79.8%) and MATH-500 (97.3%), followed closely by OpenAI-o1 at 79.2% and 96.4%, respectively.
‍General Knowledge Reasoning: OpenAI-o1 excels in GPQA Diamond with 75.7%, while DeepSeek-R1-Zero follows at 73.3%. Other models show lower performance in this domain.‍
Coding and Algorithmic Reasoning: For LiveCode Bench, DeepSeek-R1 achieves the highest score at 65.9% and for CodeForces, OpenAI-o1 leads with a rating of 2061, outperforming other models on this platform.

2. Standard Benchmark

This table compares the performance of DeepSeek models against other models on standard benchmarks. Below is a summary of the key findings:

General Knowledge : DeepSeek-R1 leads on English benchmarks, achieving top scores in MMLU-Redux (92.9), MMLU-Pro (84.0), DROP (92.2), and FRAMES (82.5). It also outperforms competitors in user-centric evaluations such as AlpacaEval2.0 (87.6) and ArenaHard (92.3), showcasing strong capabilities in writing tasks and open-domain question answering. Additionally, DeepSeek-R1 delivers notable results on IF-Eval, demonstrating solid adherence to format instructions.

Coding: In coding tasks, OpenAI o1 stands out, securing the highest scores in crucial benchmarks like LiveCodeBench (63.4) and Codeforces (96.6 percentile, rating 2061). DeepSeek-R1 follows closely with competitive numbers on LiveCodeBench (65.9) and Codeforces (96.3 percentile, rating 2029). Although OpenAI-o1 surpasses DeepSeek-R1 on Aider, they exhibit similar performance on SWE Verified.

Maths: DeepSeek-R1 is the top performer in math-focused evaluations, particularly excelling in MATH-500 (97.3), CNMO (78.8), and AIME 2024 (79.8). OpenAI o1-1217 also performs strongly on MATH-500 (96.4) and AIME 2024 (79.2), placing it second in overall mathematical capabilities.

‍

The Challenge of Advancing AI Reasoning

How DeepSeek Solved It

Use Cases and Applications

DeepSeek models can be integrated into a variety of applications across multiple domains, enhancing functionality and user experience. Here are some notable use cases:

Chat Applications: DeepSeek's models are integrated into various chat platforms, enhancing user interactions. For instance, Chatbox offers a desktop client compatible with multiple large language models, including DeepSeek, across Windows, Mac, and Linux systems.
Productivity Tools: In the realm of productivity, applications like LibreChat and Enconvo leverage DeepSeek's AI to enhance user efficiency. Additionally, Cherry Studio utilizes DeepSeek's capabilities to support producers in their creative workflows.
Translation and Language Support: DeepSeek's models are employed in translation tools to make information more accessible globally. RSS Translator uses DeepSeek to translate RSS feeds into multiple languages, broadening the reach of content.
Developer Tools: For developers, DeepSeek enhances coding efficiency through tools like Continue, which is an open-source autopilot integrated into Integrated Development Environments (IDEs), assisting developers by leveraging DeepSeek's advanced coding capabilities.
Browser Extensions: For the browser extensions, DeepSeek's models power tools like Lulu Translate, which provides functionalities such as mouse selection translation, paragraph-by-paragraph comparison translation, and PDF document translation, utilizes DeepSeek.

These use cases illustrate the diverse applications of DeepSeek models across various platforms, significantly enhancing functionality and user experience in multiple sectors.

Open-Source Community Engagement

The DeepSeek team provides several essential resources for developers:

Their GitHub repositories containing integration examples and detailed documentation.
They also provide guides on fine-tuning for optimizing model performance and customizing it for specific applications.
Comprehensive deployment guides covering various inference libraries, including performance optimization tips.

For more detailed information, you can refer to DeepSeek's official website, which offers an overview of their models and resources.

Inference Options for DeepSeek Models

SGLang: SGLang delivers cutting-edge latency and throughput performance by incorporating MLA optimizations, DP Attention, FP8 (W8A8), FP8 KV Cache, and Torch Compile. It supports both NVIDIA and AMD GPUs and enables multi-node tensor parallelism for scalability. Multi-Token Prediction (MTP) is under development, enhancing its future capabilities.
LMDeploy: A flexible, high-performance inference framework tailored for large language models. LMDeploy supports offline pipeline processing and online deployment, seamlessly integrating with PyTorch-based workflows for streamlined model serving.
TensorRT-LLM: NVIDIA’s TensorRT-LLM offers precision options like BF16 and INT4/INT8 weight-only, with FP8 support coming soon. It provides optimized inference for DeepSeek-V3 on NVIDIA GPUs, leveraging advanced techniques such as layer fusion and precision calibration.
vLLM: A framework optimized for memory-efficient and high-speed inference, vLLM supports FP8 and BF16 precision modes. It offers pipeline parallelism for multi-machine deployments, making it a strong choice for large-scale applications.
TGI: Hugging Face's Text Generation Inference (TGI) is an open-source inference library facilitate the deployment and serving of Large Language Models (LLMs) in production environments.
Ollama: Ollama simplifies the deployment and inference of DeepSeek models on local setups, making it accessible even for those with limited technical expertise.

These frameworks and hardware options cater to diverse deployment needs, offering scalable and efficient inference for DeepSeek models.

Step-by-Step Guide for Inference of DeepSeek Models

This guide provides a comprehensive approach to deploying the DeepSeek model using the vLLM framework. Follow these steps to set up and utilize the DeekSeek model effectively.

Prerequisites

Python Environment: Ensure you have Python installed (preferably Python 3.8 or later).
Install Required Packages: Install the required libraries using pip:

pip install vllm==0.6.6.post1

Step 1: Initialize the Tokenizer and Model

Begin by importing necessary libraries and initializing the tokenizer and model.

from vllm import LLM
from vllm.sampling_params import SamplingParams
from transformers import AutoTokenizer

# Define the Model name
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model
llm = LLM(model=model_id)

# Set up sampling parameters for generation
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.8,
    repetition_penalty=1.05,
    max_tokens=512,
    top_k=40
)

Step 2: Prepare Your Prompt

Define the prompt that you want to use for generating responses from the model.

# Prepare your prompt
prompt = "Tell me something about large language models."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
]

Step 3: Generate Responses

Use the LLM instance to generate responses based on your prepared messages.

# Apply the message template
input_text = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
response = llm.generate(input_text, sampling_params=sampling_params)
result_output = [output.outputs[0].text for output in response]
print(result_output)

Step 4: If You are Using Quantized Models

If you're working with quantized models for efficiency, you can specify quantization parameters when initializing your model.

For example, to use an GPTQ quantized model:

llm = LLM(model="kaitchup/DeepSeek-R1-Distill-Qwen-7B-AutoRound-GPTQ-4bit",quantization="gptq")

Now you can just follow the same process to use the Quantized model.

Conclusion

‍

Resources:

‍

Table of contents

Text Link

The Ultimate Guide to DeepSeek Models

Introduction

Type of DeepSeek Models

DeepSeek ‘s Key Milestones

More about DeepSeek-R1

Dataset used for DeepSeek Models Training

Contributions of DeepSeek

Performance Benchmarks

1. Reasoning-Related Benchmark

2. Standard Benchmark

The Challenge of Advancing AI Reasoning

How DeepSeek Solved It

Use Cases and Applications

Open-Source Community Engagement

Inference Options for DeepSeek Models

Step-by-Step Guide for Inference of DeepSeek Models

Prerequisites

Step 1: Initialize the Tokenizer and Model

Step 2: Prepare Your Prompt

Step 3: Generate Responses

Step 4: If You are Using Quantized Models

Conclusion

Introduction

Type of DeepSeek Models

DeepSeek ‘s Key Milestones

More about DeepSeek-R1

Dataset used for DeepSeek Models Training

Contributions of DeepSeek

Performance Benchmarks

1. Reasoning-Related Benchmark

2. Standard Benchmark

The Challenge of Advancing AI Reasoning

How DeepSeek Solved It

Use Cases and Applications

Open-Source Community Engagement

Inference Options for DeepSeek Models

Step-by-Step Guide for Inference of DeepSeek Models

Prerequisites

Step 1: Initialize the Tokenizer and Model

Step 2: Prepare Your Prompt

Step 3: Generate Responses

Step 4: If You are Using Quantized Models

Conclusion