Open Source LLM Architectures Compared: GPT OSS vs. DeepSeek vs. Quen 3

Key Insight:

Despite similar benchmark performance, OpenAI's GPT OSS, Alibaba's Quen 3, and DeepSeek V3 use fundamentally different architectural approaches to achieve efficiency, long-context handling, and reasoning capabilities.

GPT OSS: OpenAI's Return to Open Weights

OpenAI's first open-weights release since GPT-2 features a Mixture of Experts (MoE) architecture available in two variants:

Scale: 120B parameter model (4 experts/token) and 20B parameter model
Core Tech: Grouped Query Attention, SwiGLU activations, Rotary Positional Embeddings (RoPE), RMS norm
Breakthrough: 131K token context via YARN scaling during pre-training (not post-hoc)
Tokenizer: Open-source 0200K Harmony tokenizer (200K+ tokens)
Deployment: Quantized by default for consumer hardware

Quen 3: Alibaba's Multi-Stage Innovator

Alibaba's April 2024 release offers both dense and MoE architectures across seven model sizes:

Architectural Highlights

QK Norm replaces QKV bias for attention stability
Shared tokenizer handles any text/symbol without pre-processing
MoE models match dense performance with 1/5 active parameters

Three-Stage Training

General stage: 30T tokens across 119 languages
Reasoning stage: 5T high-quality STEM/coding tokens
Long-context: ABF+YARN optimizations to reach 32K tokens

Post-Training Breakthroughs

Thinking Mode Fusion: Single-model toggle between reasoning/non-reasoning modes
Minimal Data RL: Achieved complex reasoning with only 4,000 query-verifier pairs
Strong-to-Weak Distillation: Smaller models inherit larger model capabilities

DeepSeek V3: The Efficiency Pioneer

DeepSeek's 671B parameter MoE model (37B active/token) focuses on hardware-aware optimizations:

Core Innovations

Native 8-bit training (vs 16/32-bit standards)
Multi-head Latent Attention (MLA) compresses KV cache by 90%
V3.1 update adds hybrid thinking mode and enhanced tool use

Long-Context Strategy

Staged fine-tuning: First to 32K tokens, then to 128K
MLA outperforms Grouped Query Attention in memory efficiency

Critical Architectural Comparisons

Feature	GPT OSS	Quen 3	DeepSeek V3
Model Type	MoE only	Dense + MoE	MoE only
Active Params/Token	3.6B-5.1B	1/5 total params (MoE)	37B
Attention Mechanism	Grouped Query Attention	Grouped Query Attention	Multi-head Latent Attention
Long-Context Approach	Native YARN in pre-training (131K)	Inference-time YARN scaling (128K)	Staged fine-tuning (128K)
RLHF Efficiency	Substantial alignment layers	Effective with 4K pairs	Advanced agent tuning

Industry-Wide Trends & Implications

The Empirical Nature of LLM Development

Despite similar benchmark results, labs combine tools differently without first-principles justification. For example:

DeepSeek's MLA vs. mainstream GQA attention
Quen's QK Norm vs. traditional QKV bias
Divergent YARN implementation strategies

The Hidden Moat: Data Engineering

While architecture is public, data strategies remain proprietary:

Quen 3 used 36T tokens + synthetic data generation
GPT OSS trained on "trillions of tokens" with STEM focus
All models implement sophisticated content filtering

Conclusion: Beyond Benchmarks

These models demonstrate that architectural diversity persists even as performance converges. Key differentiators include:

Context extension techniques (pre-training vs. fine-tuning vs. inference scaling)
Hardware-aware optimizations (8-bit training, KV cache compression)
Post-training innovations (thinking mode toggles, minimal-data RL)

The open-source LLM field remains a laboratory of empirical experimentation where similar results emerge from fundamentally different engineering approaches.

id: 01990fdfd5e978c4871d5216ff6cd16f

popular textlized insights

Is China Heading Toward a "Lost Decades" Scenario? Deflation, Property Crisis, and Consumer Gloom