News

DeepSeek V4: 1T Parameters, 32B Active. What MoE Means for Your GPU Budget.

March 18, 2026 · 19:02 UTC · News

TL;DR

DeepSeek V4 is a 1-trillion-parameter Mixture-of-Experts model that only activates ~32 billion parameters per token. That's 3% of the total model doing the actual work. For homelab runners, this means you might squeeze quantized inference onto dual RTX 4090s or a single RTX 5090. But "runs" and "runs well" are different conversations. The full model needs ~690 GB+ of VRAM. The distilled 7B and 33B variants are where most homelabbers will actually live.

A Trillion Parameters Walk Into a Homelab

A year ago, DeepSeek V3 turned heads with 671B total parameters and ~37B active per token. V4 cranks the total to a round trillion while lowering active parameters to ~32B. That's not a typo. They made the model 50% larger and the per-token compute cheaper.

This is the core promise of Mixture-of-Experts: you don't run the whole model. You route each token to a handful of specialized expert modules (roughly 8 out of 256+ available) and let the rest sit idle. The routing got smarter in V4 with 16 expert pathways per token (up from V3's top-2/top-4), which means better specialization with fewer active params.

The result is a model that benchmarks against frontier-class dense models while theoretically fitting on hardware that would choke on a 70B dense model at full precision.

How MoE Actually Works (Without the Hand-Waving)

Every transformer layer in a standard dense model has one feed-forward network (FFN). Every token passes through that same FFN. In MoE, that single FFN is replaced by dozens or hundreds of smaller expert FFNs, and a learned router decides which experts handle each token.

Here's what matters for your GPU budget:

Total parameters: ~1T. This is the sum of all expert weights across every layer. It determines your storage and VRAM requirements for loading the model.
Active parameters: ~32B. This is how much compute runs per token. It determines your inference speed and power draw.
KV cache: Scales with context length, not model size. DeepSeek's Multi-head Latent Attention (MLA) compresses the KV cache by 90%+ via low-rank projection, which is the only reason a 1M-token context window is remotely feasible.

The key insight: you need enough VRAM to store the full model (or a quantized version of it), even though you only compute through 3% of it per token. This is where homelab dreams collide with physics.

The VRAM Math: What You Actually Need

Let's do the napkin math for the full V4 model at different precisions:

Precision	Bytes/Param	Weight Memory	Realistic Total (+ overhead)	Hardware
BF16	2	~2 TB	~2.2 TB	Multi-node H100 cluster
FP8	1	~1 TB	~1.1 TB	8×H200 141GB
Q8	1	~1 TB	~1.1 TB	8×H100 80GB (tight)
Q4	0.5	~500 GB	~600 GB	8×H100 80GB

For the active parameter slice only (what actually computes per token):

Precision	Active Weight Memory	Fits On
BF16	~64 GB	1×H100 80GB
Q8	~32 GB	1×RTX 4090 (tight)
Q4	~16 GB	1×RTX 4090 (comfortable)

Here's the catch: you can't just load the active slice. The router needs access to all expert weights to decide which ones to activate. Every token might route differently. So you need the full model resident in memory. Otherwise you're doing constant offloading, which murders your tokens-per-second.

The Homelab Reality Check

Dual RTX 4090s (48 GB total)

DeepSeek claims V4 can "run on consumer hardware." Let's stress-test that claim.

At Q4 quantization, the active parameter path fits comfortably. But you still need the full expert weights accessible. With 48 GB across two cards, you're looking at aggressive quantization of the full model with heavy CPU offloading for non-active experts. Expect single-digit tokens per second, context windows capped around 2-4K tokens, and quality degradation from Q4 quantization that drops 3-6 points on knowledge benchmarks.

Verdict: it "runs" the way a Civic "runs" the Nürburgring. Technically possible. Not recommended for production.

Single RTX 5090 (32 GB)

Worse than dual 4090s for the full model. Better for the distilled 33B variant, which is a dense model that actually fits. This is probably what DeepSeek means by "consumer-grade."

4× RTX 4090 (96 GB total)

Now we're talking. Q4 of the full model with decent batch sizes and 4-8K context. Still not a great experience for the 1T model, but serviceable for prototyping. You're spending ~$8,000 on GPUs alone though, at which point cloud credits start looking rational.

The Honest Answer

Most homelabbers should target the distilled variants: the 7B and 33B dense models that DeepSeek will release alongside the full V4. These are trained to mimic V4's behavior and will run on a single consumer GPU. A Q4 33B model on a 4090 gives you a genuinely useful local AI at ~20-30 tokens/second.

What's Actually New in V4 (That Matters for Local Deployment)

Three architectural innovations change the calculus:

Manifold-Constrained Hyper-Connections (mHC)

A training stability technique for trillion-scale models. You don't care about this unless you're fine-tuning, and if you're fine-tuning a 1T model on your homelab, I have questions about your power bill.

Engram Conditional Memory

This is interesting. Engram decouples static knowledge lookups from dynamic reasoning, performing O(1) retrieval for factual knowledge instead of burning GPU cycles on attention-based recall. DeepSeek calls this "silent LLM waste": cycles spent on lookups that don't need reasoning. For local deployment, this means fewer wasted FLOPS per token on stuff the model already "knows."

DeepSeek Sparse Attention (DSA)

The reason the 1M-token context window doesn't require 1M tokens worth of VRAM. DSA focuses compute on the most relevant portions of context, cutting attention costs by ~50%. Combined with MLA's KV cache compression, this makes long contexts viable even on constrained hardware, though "viable" here means enterprise hardware, not your homelab.

The Self-Hosting Economics

Here's the math nobody wants to do:

Cloud API cost: $0.10-$0.30 per million input tokens (DeepSeek's projected pricing)
Self-hosting break-even: ~300-800M tokens/month, depending on GPU utilization and quantization quality you'll accept
4× RTX 4090 setup cost: ~$8,000 GPUs + $2,000 system + electricity

If you're running fewer than 5M tokens/month, the API is cheaper. Full stop. Self-hosting only makes sense if you're doing batch generation, running fine-tuned models, or have compliance requirements that prohibit sending data to external APIs.

For context: 5M tokens/month is roughly 3,750 pages of text. If you're generating that much content locally, you probably already know whether self-hosting makes sense for you.

How This Compares to the Competition

Model	Total Params	Active Params	Architecture	License	Homelab-Friendly?
DeepSeek V4	1T	32B	MoE	Apache 2.0	Distilled only
DeepSeek V3.2	685B	37B	MoE	Apache 2.0	Distilled only
Llama 4 Scout	109B	17B	MoE	Meta License	Q4 on 1× H100
Kimi K2.5	1T	32B	MoE	Restricted	No
Qwen 2.5 72B	72B	72B (dense)	Dense	Apache 2.0	Q4 on 1× 4090

The MoE advantage is clear at the benchmark level. V4's 80%+ SWE-bench score is in frontier territory. But for homelab deployment, dense models like Qwen 2.5 72B remain more practical unless you're targeting the distilled variants.

The Apache 2.0 Factor

DeepSeek V4 ships under Apache 2.0. That's the most permissive open-source license you'll find on a frontier model. No usage restrictions, no commercial limitations, no "open-weight but closed-license" games.

This matters because it means the community will immediately start producing GGUF quantizations, LoRA fine-tunes, and optimized serving configs. The homelab ecosystem around DeepSeek V3 is already mature: V4 will inherit all of that tooling on day one.

What MoE Means for the Future of Local AI

MoE is the architecture that makes "local frontier AI" a coherent phrase instead of an oxymoron. The trajectory is clear:

Total parameters keep growing: V3 was 671B, V4 is 1T, V5 will probably be 2-3T
Active parameters stay flat or shrink: Better routing means less compute per token
The gap between "big" and "usable" keeps widening: The full model is for cloud, the distilled variants are for you

The practical implication: stop watching the total parameter count and start watching the active parameter count and distilled model quality. A 33B distilled model trained from a 1T teacher is a fundamentally different beast than a 33B model trained from scratch. That's where MoE delivers value to homelabbers, not by running the full model locally, but by producing better small models.

Key Takeaways

DeepSeek V4 has 1T total parameters but only activates ~32B per token: a 50% increase in total params over V3 with fewer active params, thanks to improved expert routing.
You need the full model in memory even though you only compute 3% of it: MoE doesn't save you on VRAM, it saves you on compute. The full Q4 model still needs ~600 GB.
Dual RTX 4090s "work" but barely: Expect heavy offloading, single-digit tok/s, and quality loss. The distilled 33B variant is the realistic homelab target.
Self-hosting breaks even at 300-800M tokens/month: Below that, the API at $0.10-$0.30/M tokens is cheaper than electricity and depreciation.
Apache 2.0 licensing means the community will optimize fast: GGUF quants and serving configs will appear within days of release.
Watch the distilled models, not the full model: A 33B model distilled from a 1T teacher outperforms a 33B model trained from scratch. That's the real homelab win.

DeepSeek V4 represents the clearest evidence yet that MoE is the path to democratizing frontier AI, just not in the way the marketing suggests. The trillion parameters aren't for you. The 33B distilled model trained on a trillion parameters? That's for you.

AILLMopen sourcehomelablocal AIDeepSeekmixture-of-experts