AI Architecture6 min read · 3 April 2026

vLLM, Ollama, or llama.cpp? Picking Your On-Premise Inference Server by Concurrency, Not Hype

Most UAE SMEs pick an on-premise LLM server off a demo with one user, then hit a wall when three staff members query it at once. At a single request the gap between Ollama and vLLM is invisible. At ten requests it is catastrophic. Here is the one number that should drive the whole decision: count your concurrent users at peak. Below five, Ollama is fine. Above five, you need vLLM, and no amount of tuning closes that gap. This guide maps each server to the real concurrency load of UAE clinics, law firms, and brokerages using measured throughput, not vendor claims.

The Concurrency Gap That Catches Everyone Off Guard

Run a one-user demo and Ollama actually wins. It does roughly 45 tokens per second against vLLM's 38 on a Llama 3.1 8B model. That single-request number is what teams see, and it's why they keep reaching for Ollama in production. The trouble starts the moment a second and third user show up. Ollama processes requests sequentially by default, so each one waits behind the last. Watch what happens as the load climbs. At 8 concurrent requests, vLLM produces 187 tokens per second to Ollama's 82, a 2.3x gap. At 10 concurrent users it's 485 versus 148. Push to 50 concurrent users on a single GPU and vLLM reaches about 920 tokens per second while Ollama flattens out near 155. The crossover sits at roughly 5 simultaneous users. A clinic running one front-desk system alongside one clinical notes assistant stays under that line, and Ollama will hold up. A 20-person law firm where paralegals and partners hammer a document assistant all day does not. There, vLLM is a functional requirement, not an upgrade you can defer. None of this is projection. It's how each server's architecture handles concurrent attention computation, measured on real hardware.

Why the Numbers Diverge: PagedAttention Versus a FIFO Queue

The architecture explains the whole gap. vLLM implements PagedAttention, which splits the KV cache (the memory holding attention state for each token) into fixed-size, non-contiguous blocks. The idea is borrowed straight from operating-system virtual memory paging. That one design choice kills KV cache fragmentation, recovers up to 55% of the VRAM that naive implementations waste on padding and over-allocation, and makes continuous batching trivial. With continuous batching on by default, vLLM works tokens from several requests in the same forward pass instead of draining one request before it touches the next. So adding concurrent users costs you 10 to 20 percent of throughput up to moderate load, rather than dropping it off a cliff. Ollama works the other way. It wraps llama.cpp internally and serialises requests by default. Set OLLAMA_NUM_PARALLEL=4 and throughput rises three to four times, but per-request latency climbs 20 to 40 percent, and you still don't approach vLLM's concurrent efficiency. The worse problem is model eviction: under mixed-model traffic Ollama unloads and reloads models, and latency spikes by several seconds. Picture a voice-AI system in a Dubai clinic freezing for three seconds mid-sentence while a doctor is with a patient. That alone takes Ollama off the table for any multi-model production deployment.

Hardware Requirements and Where llama.cpp Belongs

vLLM needs NVIDIA hardware: driver 525 or above, CUDA 12.1 minimum, and GPU compute capability 7.0 or higher. That covers the V100 forward, including the RTX 20, 30, and 40 series, A10G, A100, and H100. VRAM scales with model size. An 8GB card runs a 7B to 8B model at Q4_K_M quantisation; 16 to 24GB handles 14B to 32B models; 70B models want 48 to 96GB. Budget another 500MB to 2GB of VRAM for framework overhead before the model even loads, because it's easy to undersize a GPU by forgetting it. If your UAE site has no NVIDIA GPU at all, llama.cpp is the answer. Think of a branch office on a workstation with an integrated GPU, or an edge box at a remote clinic. On an NVIDIA GPU it does roughly 122 to 128 tokens per second on Llama 3 8B Q4_K_M. On an Apple M4 Max through Metal it manages about 75. On pure CPU with AVX-512 or ARM NEON it falls to 10 to 15 tokens per second, fine for one person summarising a document, useless for anything real-time. The strength of llama.cpp is reach, not concurrency. It runs on CUDA, ROCm, Intel SYCL, Apple Metal, and Vulkan. Use it for CPU-only edge deployments and developer machines. Do not put it in front of a roomful of staff sharing one system.

The Decision Matrix and UAE Compliance Reality

It comes down to three cases. Development, demos, and single-user testing go to Ollama: setup is under ten minutes, GGUF quantised models load with one command, and sequential throughput is plenty. Any production deployment serving three or more concurrent users goes to vLLM, and that's most of the clients we see. A five-doctor clinic running separate sessions for front desk and clinical documentation. A law firm where two paralegals and a partner query a contract analysis system at once. A brokerage with agents pulling property summaries during client meetings. CPU-only or mixed-hardware edge sites go to llama.cpp, either as a single-user endpoint or embedded in a pipeline with pre-processing to shed load. UAE regulation sharpens all of this. The UAE Personal Data Protection Law covers all processing of personal data belonging to UAE residents, so patient records, case files, and client transaction data running through an LLM inference server are in scope. DIFC Regulation 10 on Automated Processing has been in full enforcement since January 2026, and a Data Protection Impact Assessment is required before you deploy AI against personal data. Running the inference server on-premise in the UAE removes cross-border transfer questions entirely and gives you a clean posture for any audit. So the server choice carries past performance. It decides whether the deployment clears a DPIA review at all.

Questions about your setup?

We help UAE SMEs build AI systems that are compliant, on-premise, and actually useful. Free initial conversation.

Talk to us on WhatsApp →Book a call