Local or Cloud AI? The Real Math Nobody's Doing

Local or Cloud AI? The Real Math Nobody's Doing
Everyone talks about local AI. Few people know how much VRAM it actually takes to run a 70-billion parameter model. Even fewer have done the math between a graphics card's cost and their monthly cloud bill.
Why Local Inference Deserves Your Attention
For two years, the dominant AI conversation has revolved around cloud APIs: OpenAI, Anthropic, Google. It's the simplest path. It's also the path where your organization has zero control over three critical elements: its data, its unit costs, and its dependency on a single vendor.
Meanwhile, the open-source ecosystem has caught up — and in some cases surpassed — proprietary models. Qwen 3.5, DeepSeek R1, Llama 4, Mistral Large 3: these models compete with GPT-4 and Claude on multiple benchmarks. And they can run on your own hardware.
For decision-makers, the question is no longer "is open-source AI good enough?" It is. The question is: do you control the infrastructure it runs on?
Three Strategic Reasons to Act Now
1. Data Sovereignty
Every cloud API call sends your data to a third party, often American. For organizations subject to data protection legislation (Law 25 in Quebec, GDPR in Europe), SOC 2 requirements, or simply concerned about not sending strategic documents to a third party, local inference eliminates this risk at the source: your data never leaves your infrastructure.
2. Cost Predictability
Cloud APIs charge per token. The more your teams adopt AI, the higher the bill climbs — often unpredictably. Local inference transforms a variable cost into a fixed investment, amortized over 24 months, with marginal electricity consumption.
3. Operational Independence
No vendor outages. No unilateral terms-of-service changes. No network latency. For critical use cases, that's a real advantage. But let's be honest: if your own infrastructure goes down, the result is the same. Operational independence is only an argument if you have the capacity to maintain that infrastructure.
The Economic Calculation: What's It Worth?
Let's take a concrete case. A team of 5 developers uses an AI model via cloud API for code assistance, document review, and content generation. Estimated consumption: 50 million input tokens and 15 million output tokens per month.
Cloud scenario (typical rates March 2026): approximately $150–250/month depending on the provider and model. Simple, predictable, no hardware investment.
Local scenario (used graphics card): investment of ~$700, electricity consumption of about $30/month, amortized over 24 months = ~$60/month all-in.
💡 Break-Even Point
Local becomes cost-effective in 4 to 6 months if your usage is sufficiently regular. Below that, cloud remains more economical — and that's perfectly fine. The right choice depends on your actual volume, not a theoretical ideal.
What You Need to Know About Hardware
You don't need to understand GPU architecture in detail to make the right decision. Here's the essential takeaway.
Only one number matters: VRAM — the graphics card's memory. The larger the AI model, the more you need. With modern compression techniques (quantization), requirements are divided by four.
In practice, here's what different card tiers enable:
| Budget | Recommended Card | What You Can Run | For Whom |
|---|---|---|---|
| ~$700 | Used RTX 3090 (24 GB) | Models up to 32B parameters | Solo developer, small team |
| ~$2,000 | RTX 5090 (32 GB) | 32B models comfortably | Technical team of 5-15 |
| ~$3,000 | Project DIGITS (128 GB) | 70B models comfortably | Serious SMB, R&D |
| ~$8,500 | RTX PRO 6000 (96 GB) | 70B models, beginning of 100B+ | Shared inference server |
ℹ️ Translation for Non-Technical Readers
A 9-billion parameter (9B) model handles chat, code assistance, and common writing tasks well. A 32B model offers a significant quality jump in reasoning. A 70B model rivals the best cloud models. Beyond that, you're entering cluster and cloud territory.
Recommendations by Organizational Profile
The Solo Developer or Freelancer
Used RTX 3090 (~$700) or RTX 5060 Ti 16 GB (~$450). You'll run 9B to 14B models comfortably. With the 3090, you scale up to 32B — a significant quality jump for code assistance and writing.
The Technical Team (5-15 People)
RTX 5090 (32 GB, ~$2,000) or 2× used RTX 3090s. You cover models up to 32B comfortably. If you need 70B quality, consider a workstation with an RTX PRO 6000 or keep an eye on NVIDIA's Project DIGITS.
The Organization (Internal Server / Shared Inference)
RTX PRO 6000 (96 GB, ~$8,500) for a single inference server capable of serving a 70B model to multiple simultaneous users. In multi-GPU configuration, you access 100B+ models. For the largest models, the cloud remains essential — but for 90% of use cases, one or two workstation cards suffice.
The Hybrid Approach (Our Recommendation)
In practice, most organizations will benefit from combining both approaches. Local inference for daily tasks, sensitive data, and predictable volumes. Cloud APIs for occasional requests requiring frontier models (Claude Opus, GPT-4o) or to absorb demand spikes.
This is the architecture we deploy at Byrnu with our Cognito framework — an agentic system that can intelligently route requests between local models and cloud APIs based on task complexity, data sensitivity, and available budget.
What Changes in the Next 12 Months
Three trends will make the choice even more nuanced.
Memory is finally increasing. Project DIGITS at 128 GB for $3,000 is a strong signal. Rumors of RTX 50 Super would double the VRAM on several consumer cards. More local memory = more models accessible without cloud.
Models are becoming more efficient. New architectures deliver the quality of a 70B model with the speed of a model 5× smaller. The Qwen 3.5 9B from March 2026 beats models 3× its size from 12 months ago. This makes local more viable — but it also makes cloud cheaper per request.
Tooling is simplifying. Ollama, vLLM, llama.cpp: local inference has gone from "research project" to "one terminal command." Meanwhile, cloud APIs are also becoming simpler and more competitive. The friction gap between both approaches is shrinking.
The Decision Is Yours
There's no universal answer. Cloud is the right choice if your usage is sporadic, if you have no confidentiality constraints, or if you don't want to manage infrastructure. It's also the only access to the latest frontier models.
Local is the right choice if you handle sensitive data, want budget predictability at regular volume, and have a minimum of internal technical capacity.
In practice, most organizations in 2026 would benefit from a mix of both. The hardware, models, and tools are there on both sides. What's often missing is having done the math honestly.
The real question isn't "all local" or "all cloud." It's: for each use case, which one makes the most sense?
For the More Technical: The Complete GPU and Model Guide
The Essentials
For local inference, the only metric that matters is VRAM (video memory). A model that doesn't fit in VRAM falls back to CPU offloading — 5 to 30× slower.
Q4 quantization (4-bit compression) divides the memory footprint by four with marginal quality loss. It's the standard in 2026.
ℹ️ VRAM Calculation Formula
VRAM required ≈ (Parameters in billions × bytes per parameter) × 1.18
- FP16: 2 bytes/param → Llama 70B = ~165 GB
- Q8: 1 byte/param → Llama 70B = ~83 GB
- Q4: 0.5 bytes/param → Llama 70B = ~41 GB
Watch out for Mixture of Experts (MoE) models: DeepSeek R1 advertises 671B parameters but only activates 37B per token. The trap: all parameters must still be in VRAM. DeepSeek R1 in Q4 = ~396 GB — cloud territory only.
Interactive Compatibility Grid
The component below cross-references commercial NVIDIA GPUs (consumer, workstation, data center) with major open-source models. You can filter by card category and change the quantization level to instantly see what fits in which card.
GPU NVIDIA vs Modèles OSS
Compatibilité VRAM des cartes graphiques commerciales NVIDIA avec les principaux LLM open-source — mars 2026
| Modèle LLM | RTX 5060 Ti 8GB 8 GB · GDDR7 ~400 $ | RTX 5060 Ti 16GB 16 GB · GDDR7 ~450 $ | RTX 5070 12 GB · GDDR7 ~550 $ | RTX 5070 Ti 16 GB · GDDR7 ~750 $ | RTX 5080 16 GB · GDDR7 ~1 000 $ | RTX 5090 32 GB · GDDR7 ~2 000 $ | RTX 4090 (usagé) 24 GB · GDDR6X ~1 400 $ | RTX 3090 (usagé) 24 GB · GDDR6X ~700 $ | RTX A6000 48 GB · GDDR6 ECC ~4 500 $ | RTX 6000 Ada 48 GB · GDDR6 ECC ~6 800 $ | RTX PRO 6000 96 GB · GDDR7 ECC ~8 500 $ | L40S 48 GB · GDDR6 Cloud | A100 80GB 80 GB · HBM2e Cloud | H100 SXM 80 GB · HBM3 Cloud | H200 141 GB · HBM3e Cloud | B200 192 GB · HBM3e Cloud | B300 288 GB · HBM3e Cloud | Project DIGITS 128 GB · Unifiée ~3 000 $ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Qwen 3.5 9B Dense · 9B · ~5 GB Q4 Chat, code, multimodal | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Phi-4 14B / Qwen 2.5 14B Dense · 14B · ~8 GB Q4 Raisonnement, code | 2x+ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Mistral Small 3.1 24B Dense · 24B · ~14 GB Q4 Multilingue, chat | ✗ | ~ | 2x+ | ~ | ~ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Gemma 3 27B / Qwen 2.5 32B Dense · 32B · ~19 GB Q4 Chat avancé, code | ✗ | 2x+ | ✗ | 2x+ | 2x+ | ✓ | ~ | ~ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Llama 3.3 70B / Qwen 2.5 72B Dense · 72B · ~42 GB Q4 Frontier open-source | ✗ | ✗ | ✗ | ✗ | ✗ | 2x+ | ✗ | ✗ | ~ | ~ | ✓ | ~ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Llama 4 Scout 109B MoE MoE · 109B · ~64 GB Q4 10M contexte, général | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 2x+ | 2x+ | ✓ | 2x+ | ~ | ~ | ✓ | ✓ | ✓ | ✓ |
Qwen3-235B-A22B MoE MoE · 235B · ~139 GB Q4 Multilingue, raisonnement | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 2x+ | ✗ | ✗ | ✗ | 2x+ | ✓ | ✓ | 2x+ |
Qwen 3.5 397B MoE MoE · 397B · ~234 GB Q4 Frontier, multimodal | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 2x+ | ~ | ✗ |
DeepSeek R1 671B MoE MoE · 671B · ~396 GB Q4 Raisonnement CoT | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 2x+ | ✗ |
Mistral Large 3 675B MoE MoE · 675B · ~398 GB Q4 80+ langues, 256k ctx | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 2x+ | ✗ |
* MoE = Mixture of Experts (paramètres actifs entre parenthèses). VRAM estimé = poids du modèle + ~18% overhead (KV cache, activations, framework). Prix approximatifs mars 2026. Les modèles MoE chargent tous les paramètres en VRAM même si seule une fraction est active par token.