Build and service custom inference servers under €10,000 — private, on-premise AI without cloud costs.
Open-source models that deliver real value for common business tasks — and run on affordable hardware.
| Model | Size | VRAM (4-bit) | VRAM (8-bit) | Est. tok/s (V100) | Best For |
|---|---|---|---|---|---|
| Llama 3.2 3B | 3B | 2.5 GB | 4 GB | 80-100 | Simple tasks, chatbots, classification |
| Mistral 7B | 7B | 5 GB | 8 GB | 50-70 | General purpose, balanced performance |
| Llama 3.1 8B | 8B | 6 GB | 10 GB | 45-60 | General purpose, excellent instruction following |
| Gemma 2 9B | 9B | 7 GB | 11 GB | 40-55 | Enterprise tasks, structured output |
| Qwen 2.5 14B | 14B | 10 GB | 16 GB | 25-35 | Coding, reasoning, multilingual |
| Mistral NeMo 12B | 12B | 9 GB | 14 GB | 30-40 | Long context, document analysis |
| Qwen 2.5 Coder 7B | 7B | 5 GB | 8 GB | 50-70 | Code generation, debugging |
| Qwen 2.5 Coder 32B | 32B | 22 GB | 36 GB | 12-18 | Professional coding, complex reasoning |
| Mixtral 8x7B | 47B* | 24 GB | 45 GB | 15-25 | High-quality output, MoE efficiency |
| Llama 3.3 70B | 70B | 40 GB | 75 GB | 5-8 | Complex reasoning, premium quality |
*Mixtral uses Mixture of Experts: 47B total params, ~13B active per token
Recommended: Mistral 7B, Llama 3.1 8B, Gemma 2 9B
| Requirement | Minimum | Recommended |
|---|---|---|
| Response quality | Good | Excellent |
| Speed needed | 30+ tok/s | 50+ tok/s |
| VRAM | 8 GB | 16 GB |
What it does:
Recommended: Mistral NeMo 12B, Llama 3.1 8B, Gemma 2 9B
| Document Size | Model | VRAM |
|---|---|---|
| Short (<4K tokens) | Mistral 7B | 8 GB |
| Medium (4K-16K) | Mistral NeMo 12B | 16 GB |
| Long (16K-128K) | Llama 3.1 8B | 16 GB |
What it does:
Recommended: Qwen 2.5 14B, Llama 3.1 8B
| Languages | Model | Notes |
|---|---|---|
| EN/DE/FR/ES | Llama 3.1 8B | Good for European languages |
| 30+ languages | Qwen 2.5 14B | Excellent multilingual support |
| Technical docs | Qwen 2.5 14B | Better at preserving terminology |
What it does:
Recommended: Qwen 2.5 Coder 7B, Qwen 2.5 Coder 32B
| Task | Model | VRAM | Quality |
|---|---|---|---|
| Simple functions | Qwen 2.5 Coder 7B | 8 GB | ⭐⭐⭐⭐ |
| Bug fixes | Qwen 2.5 Coder 7B | 8 GB | ⭐⭐⭐⭐ |
| Code review | Qwen 2.5 Coder 14B | 12 GB | ⭐⭐⭐⭐⭐ |
| Complex refactoring | Qwen 2.5 Coder 32B | 24 GB | ⭐⭐⭐⭐⭐ |
| Architecture decisions | Qwen 2.5 Coder 32B | 24 GB | ⭐⭐⭐⭐⭐ |
What it does:
Recommended: Llama 3.1 8B, Mistral 7B, Gemma 2 9B
| Content Type | Model | Why |
|---|---|---|
| Social media posts | Mistral 7B | Fast, creative |
| Blog articles | Llama 3.1 8B | Better structure |
| Email campaigns | Gemma 2 9B | Professional tone |
| Product descriptions | Llama 3.1 8B | Consistent quality |
What it does:
Recommended: Llama 3.1 8B, Qwen 2.5 14B
| Task | Model | Speed | Quality |
|---|---|---|---|
| SQL generation | Qwen 2.5 Coder 7B | Fast | ⭐⭐⭐⭐ |
| Data interpretation | Qwen 2.5 14B | Medium | ⭐⭐⭐⭐⭐ |
| Report writing | Llama 3.1 8B | Fast | ⭐⭐⭐⭐ |
| Chart descriptions | Gemma 2 9B | Fast | ⭐⭐⭐⭐ |
What it does:
| Use Case | Minimum | Comfortable | Excellent |
|---|---|---|---|
| Chatbot | 20 tok/s | 40 tok/s | 60+ tok/s |
| Document processing | 15 tok/s | 30 tok/s | 50+ tok/s |
| Batch processing | 10 tok/s | 20 tok/s | 40+ tok/s |
| Real-time assistance | 30 tok/s | 50 tok/s | 80+ tok/s |
Human reading speed: ~250 words/minute = ~4 words/second ≈ 5-6 tok/s
| Hardware | 7B model tok/s | 14B model tok/s | 70B model tok/s |
|---|---|---|---|
| Tesla P100 | 35-50 | 15-25 | 3-6 |
| Tesla V100 | 50-70 | 25-40 | 5-10 |
| Tesla T4 | 30-45 | 12-20 | 3-5 |
| 2x V100 | 90-120 | 45-70 | 10-18 |
✅ Mistral 7B (fast, reliable)
✅ Llama 3.1 8B (great instruction following)
✅ Qwen 2.5 Coder 7B (coding)
✅ Gemma 2 9B (enterprise tasks)
⚠️ Qwen 2.5 14B (tight fit, single GPU)
❌ Mixtral, 70B models
✅ All models from above (faster)
✅ Qwen 2.5 14B (comfortable)
✅ Qwen 2.5 Coder 32B (tight fit)
⚠️ Mixtral 8x7B (quantized, slow)
❌ 70B models
✅ All smaller models (very fast)
✅ Qwen 2.5 Coder 32B (comfortable)
✅ Mixtral 8x7B (good speed)
⚠️ Llama 3.3 70B (4-bit, usable)
Llama 3.1 8B — Excellent instruction following, good at most tasks, fits on any GPU.
Qwen 2.5 Coder 32B — Matches GPT-4 on coding benchmarks. Needs 24GB+ VRAM.
Qwen 2.5 14B — Supports 30+ languages, great translation quality.
Mistral NeMo 12B — 128K context window, excellent for long documents.
Mistral 7B — Punches above its weight, very fast, runs on anything.
All models are available through:
ollama run llama3.1:8bWe pre-install and configure these on all our servers.