Open Source LLM Architectures Compared: GPT OSS vs. DeepSeek vs. Quen 3
Key Insight:
Despite similar benchmark performance, OpenAI's GPT OSS, Alibaba's Quen 3, and DeepSeek V3 use fundamentally different architectural approaches to achieve efficiency, long-context handling, and reasoning capabilities.
GPT OSS: OpenAI's Return to Open Weights
OpenAI's first open-weights release since GPT-2 features a Mixture of Experts (MoE) architecture available in two variants:
- Scale: 120B parameter model (4 experts/token) and 20B parameter model
- Core Tech: Grouped Query Attention, SwiGLU activations, Rotary Positional Embeddings (RoPE), RMS norm
- Breakthrough: 131K token context via YARN scaling during pre-training (not post-hoc)
- Tokenizer: Open-source 0200K Harmony tokenizer (200K+ tokens)
- Deployment: Quantized by default for consumer hardware
Quen 3: Alibaba's Multi-Stage Innovator
Alibaba's April 2024 release offers both dense and MoE architectures across seven model sizes:
Architectural Highlights
- QK Norm replaces QKV bias for attention stability
- Shared tokenizer handles any text/symbol without pre-processing
- MoE models match dense performance with 1/5 active parameters
Three-Stage Training
- General stage: 30T tokens across 119 languages
- Reasoning stage: 5T high-quality STEM/coding tokens
- Long-context: ABF+YARN optimizations to reach 32K tokens
Post-Training Breakthroughs
- Thinking Mode Fusion: Single-model toggle between reasoning/non-reasoning modes
- Minimal Data RL: Achieved complex reasoning with only 4,000 query-verifier pairs
- Strong-to-Weak Distillation: Smaller models inherit larger model capabilities
DeepSeek V3: The Efficiency Pioneer
DeepSeek's 671B parameter MoE model (37B active/token) focuses on hardware-aware optimizations:
Core Innovations
- Native 8-bit training (vs 16/32-bit standards)
- Multi-head Latent Attention (MLA) compresses KV cache by 90%
- V3.1 update adds hybrid thinking mode and enhanced tool use
Long-Context Strategy
- Staged fine-tuning: First to 32K tokens, then to 128K
- MLA outperforms Grouped Query Attention in memory efficiency
Critical Architectural Comparisons
Feature |
GPT OSS |
Quen 3 |
DeepSeek V3 |
Model Type |
MoE only |
Dense + MoE |
MoE only |
Active Params/Token |
3.6B-5.1B |
1/5 total params (MoE) |
37B |
Attention Mechanism |
Grouped Query Attention |
Grouped Query Attention |
Multi-head Latent Attention |
Long-Context Approach |
Native YARN in pre-training (131K) |
Inference-time YARN scaling (128K) |
Staged fine-tuning (128K) |
RLHF Efficiency |
Substantial alignment layers |
Effective with 4K pairs |
Advanced agent tuning |
Industry-Wide Trends & Implications
The Empirical Nature of LLM Development
Despite similar benchmark results, labs combine tools differently without first-principles justification. For example:
- DeepSeek's MLA vs. mainstream GQA attention
- Quen's QK Norm vs. traditional QKV bias
- Divergent YARN implementation strategies
The Hidden Moat: Data Engineering
While architecture is public, data strategies remain proprietary:
- Quen 3 used 36T tokens + synthetic data generation
- GPT OSS trained on "trillions of tokens" with STEM focus
- All models implement sophisticated content filtering
Conclusion: Beyond Benchmarks
These models demonstrate that architectural diversity persists even as performance converges. Key differentiators include:
- Context extension techniques (pre-training vs. fine-tuning vs. inference scaling)
- Hardware-aware optimizations (8-bit training, KV cache compression)
- Post-training innovations (thinking mode toggles, minimal-data RL)
The open-source LLM field remains a laboratory of empirical experimentation where similar results emerge from fundamentally different engineering approaches.
id: 01990fdfd5e978c4871d5216ff6cd16f