Local or Cloud AI? The Real Math Nobody's Doing

Frederick Chapleau
Local or Cloud AI? The Real Math Nobody's Doing

Local or Cloud AI? The Real Math Nobody's Doing

Everyone talks about local AI. Few people know how much VRAM it actually takes to run a 70-billion parameter model. Even fewer have done the math between a graphics card's cost and their monthly cloud bill.


Why Local Inference Deserves Your Attention

For two years, the dominant AI conversation has revolved around cloud APIs: OpenAI, Anthropic, Google. It's the simplest path. It's also the path where your organization has zero control over three critical elements: its data, its unit costs, and its dependency on a single vendor.

Meanwhile, the open-source ecosystem has caught up — and in some cases surpassed — proprietary models. Qwen 3.5, DeepSeek R1, Llama 4, Mistral Large 3: these models compete with GPT-4 and Claude on multiple benchmarks. And they can run on your own hardware.

For decision-makers, the question is no longer "is open-source AI good enough?" It is. The question is: do you control the infrastructure it runs on?

Three Strategic Reasons to Act Now

1. Data Sovereignty

Every cloud API call sends your data to a third party, often American. For organizations subject to data protection legislation (Law 25 in Quebec, GDPR in Europe), SOC 2 requirements, or simply concerned about not sending strategic documents to a third party, local inference eliminates this risk at the source: your data never leaves your infrastructure.

2. Cost Predictability

Cloud APIs charge per token. The more your teams adopt AI, the higher the bill climbs — often unpredictably. Local inference transforms a variable cost into a fixed investment, amortized over 24 months, with marginal electricity consumption.

3. Operational Independence

No vendor outages. No unilateral terms-of-service changes. No network latency. For critical use cases, that's a real advantage. But let's be honest: if your own infrastructure goes down, the result is the same. Operational independence is only an argument if you have the capacity to maintain that infrastructure.

The Economic Calculation: What's It Worth?

Let's take a concrete case. A team of 5 developers uses an AI model via cloud API for code assistance, document review, and content generation. Estimated consumption: 50 million input tokens and 15 million output tokens per month.

Cloud scenario (typical rates March 2026): approximately $150–250/month depending on the provider and model. Simple, predictable, no hardware investment.

Local scenario (used graphics card): investment of ~$700, electricity consumption of about $30/month, amortized over 24 months = ~$60/month all-in.

💡 Break-Even Point

Local becomes cost-effective in 4 to 6 months if your usage is sufficiently regular. Below that, cloud remains more economical — and that's perfectly fine. The right choice depends on your actual volume, not a theoretical ideal.

What You Need to Know About Hardware

You don't need to understand GPU architecture in detail to make the right decision. Here's the essential takeaway.

Only one number matters: VRAM — the graphics card's memory. The larger the AI model, the more you need. With modern compression techniques (quantization), requirements are divided by four.

In practice, here's what different card tiers enable:

BudgetRecommended CardWhat You Can RunFor Whom
~$700Used RTX 3090 (24 GB)Models up to 32B parametersSolo developer, small team
~$2,000RTX 5090 (32 GB)32B models comfortablyTechnical team of 5-15
~$3,000Project DIGITS (128 GB)70B models comfortablySerious SMB, R&D
~$8,500RTX PRO 6000 (96 GB)70B models, beginning of 100B+Shared inference server

ℹ️ Translation for Non-Technical Readers

A 9-billion parameter (9B) model handles chat, code assistance, and common writing tasks well. A 32B model offers a significant quality jump in reasoning. A 70B model rivals the best cloud models. Beyond that, you're entering cluster and cloud territory.

Recommendations by Organizational Profile

The Solo Developer or Freelancer

Used RTX 3090 (~$700) or RTX 5060 Ti 16 GB (~$450). You'll run 9B to 14B models comfortably. With the 3090, you scale up to 32B — a significant quality jump for code assistance and writing.

The Technical Team (5-15 People)

RTX 5090 (32 GB, ~$2,000) or 2× used RTX 3090s. You cover models up to 32B comfortably. If you need 70B quality, consider a workstation with an RTX PRO 6000 or keep an eye on NVIDIA's Project DIGITS.

The Organization (Internal Server / Shared Inference)

RTX PRO 6000 (96 GB, ~$8,500) for a single inference server capable of serving a 70B model to multiple simultaneous users. In multi-GPU configuration, you access 100B+ models. For the largest models, the cloud remains essential — but for 90% of use cases, one or two workstation cards suffice.

The Hybrid Approach (Our Recommendation)

In practice, most organizations will benefit from combining both approaches. Local inference for daily tasks, sensitive data, and predictable volumes. Cloud APIs for occasional requests requiring frontier models (Claude Opus, GPT-4o) or to absorb demand spikes.

This is the architecture we deploy at Byrnu with our Cognito framework — an agentic system that can intelligently route requests between local models and cloud APIs based on task complexity, data sensitivity, and available budget.

What Changes in the Next 12 Months

Three trends will make the choice even more nuanced.

Memory is finally increasing. Project DIGITS at 128 GB for $3,000 is a strong signal. Rumors of RTX 50 Super would double the VRAM on several consumer cards. More local memory = more models accessible without cloud.

Models are becoming more efficient. New architectures deliver the quality of a 70B model with the speed of a model 5× smaller. The Qwen 3.5 9B from March 2026 beats models 3× its size from 12 months ago. This makes local more viable — but it also makes cloud cheaper per request.

Tooling is simplifying. Ollama, vLLM, llama.cpp: local inference has gone from "research project" to "one terminal command." Meanwhile, cloud APIs are also becoming simpler and more competitive. The friction gap between both approaches is shrinking.

The Decision Is Yours

There's no universal answer. Cloud is the right choice if your usage is sporadic, if you have no confidentiality constraints, or if you don't want to manage infrastructure. It's also the only access to the latest frontier models.

Local is the right choice if you handle sensitive data, want budget predictability at regular volume, and have a minimum of internal technical capacity.

In practice, most organizations in 2026 would benefit from a mix of both. The hardware, models, and tools are there on both sides. What's often missing is having done the math honestly.


The real question isn't "all local" or "all cloud." It's: for each use case, which one makes the most sense?


For the More Technical: The Complete GPU and Model Guide

The Essentials

For local inference, the only metric that matters is VRAM (video memory). A model that doesn't fit in VRAM falls back to CPU offloading — 5 to 30× slower.

Q4 quantization (4-bit compression) divides the memory footprint by four with marginal quality loss. It's the standard in 2026.

ℹ️ VRAM Calculation Formula

VRAM required ≈ (Parameters in billions × bytes per parameter) × 1.18

  • FP16: 2 bytes/param → Llama 70B = ~165 GB
  • Q8: 1 byte/param → Llama 70B = ~83 GB
  • Q4: 0.5 bytes/param → Llama 70B = ~41 GB

Watch out for Mixture of Experts (MoE) models: DeepSeek R1 advertises 671B parameters but only activates 37B per token. The trap: all parameters must still be in VRAM. DeepSeek R1 in Q4 = ~396 GB — cloud territory only.

Interactive Compatibility Grid

The component below cross-references commercial NVIDIA GPUs (consumer, workstation, data center) with major open-source models. You can filter by card category and change the quantization level to instantly see what fits in which card.

GPU NVIDIA vs Modèles OSS

Compatibilité VRAM des cartes graphiques commerciales NVIDIA avec les principaux LLM open-source — mars 2026

Catégorie GPU
Quantisation
Modèle LLM
RTX 5060 Ti 8GB
8 GB · GDDR7
~400 $
RTX 5060 Ti 16GB
16 GB · GDDR7
~450 $
RTX 5070
12 GB · GDDR7
~550 $
RTX 5070 Ti
16 GB · GDDR7
~750 $
RTX 5080
16 GB · GDDR7
~1 000 $
RTX 5090
32 GB · GDDR7
~2 000 $
RTX 4090 (usagé)
24 GB · GDDR6X
~1 400 $
RTX 3090 (usagé)
24 GB · GDDR6X
~700 $
RTX A6000
48 GB · GDDR6 ECC
~4 500 $
RTX 6000 Ada
48 GB · GDDR6 ECC
~6 800 $
RTX PRO 6000
96 GB · GDDR7 ECC
~8 500 $
L40S
48 GB · GDDR6
Cloud
A100 80GB
80 GB · HBM2e
Cloud
H100 SXM
80 GB · HBM3
Cloud
H200
141 GB · HBM3e
Cloud
B200
192 GB · HBM3e
Cloud
B300
288 GB · HBM3e
Cloud
Project DIGITS
128 GB · Unifiée
~3 000 $
Qwen 3.5 9B
Dense · 9B · ~5 GB Q4
Chat, code, multimodal
Phi-4 14B / Qwen 2.5 14B
Dense · 14B · ~8 GB Q4
Raisonnement, code
2x+
Mistral Small 3.1 24B
Dense · 24B · ~14 GB Q4
Multilingue, chat
~2x+~~
Gemma 3 27B / Qwen 2.5 32B
Dense · 32B · ~19 GB Q4
Chat avancé, code
2x+2x+2x+~~
Llama 3.3 70B / Qwen 2.5 72B
Dense · 72B · ~42 GB Q4
Frontier open-source
2x+~~~
Llama 4 Scout 109B MoE
MoE · 109B · ~64 GB Q4
10M contexte, général
2x+2x+2x+~~
Qwen3-235B-A22B MoE
MoE · 235B · ~139 GB Q4
Multilingue, raisonnement
2x+2x+2x+
Qwen 3.5 397B MoE
MoE · 397B · ~234 GB Q4
Frontier, multimodal
2x+~
DeepSeek R1 671B MoE
MoE · 671B · ~396 GB Q4
Raisonnement CoT
2x+
Mistral Large 3 675B MoE
MoE · 675B · ~398 GB Q4
80+ langues, 256k ctx
2x+
Confortable (marge pour contexte)Serré (modèle charge, peu de marge)Impossible (VRAM insuffisant)Multi-GPU requis

* MoE = Mixture of Experts (paramètres actifs entre parenthèses). VRAM estimé = poids du modèle + ~18% overhead (KV cache, activations, framework). Prix approximatifs mars 2026. Les modèles MoE chargent tous les paramètres en VRAM même si seule une fraction est active par token.