On-Premise LLM for a Clinic: The Real Bill of Materials (GPU, Power Draw, and the Maintenance Nobody Quotes You)
Every vendor quoting you an on-premise LLM for your DHA-licensed clinic shows you the GPU price. Almost none show you the electricity meter running in the background, the Ollama update that breaks your integration every two weeks, or the IT hours that quietly land on your payroll. So here is the position the rest of this piece defends: at low query volume, on-premise is a PDPL compliance decision, not a cost-saving one. This teardown gives you the actual line items behind that call. Hardware, power, software maintenance, and what PDPL compliance really costs to get right.
GPU Options and What Your VRAM Budget Actually Buys
Most clinic deployments start with the RTX 4090. At AED 9,999–11,550 from Dubai retailers like MindTech and GCC Gamers, it works for a specific class of models. The 24 GB of VRAM runs any 8B model in GGUF Q4_K_M format without complaint, eating around 6–8 GB and leaving plenty of headroom for context. That covers patient triage chatbots, appointment-intent classification, and summarising short clinical notes. Where it stops is Llama 3.1 70B or Mixtral 8x7B. The 70B model in Q4_K_M needs 40–43 GB of VRAM to run fully on-GPU, and a single 4090 cannot do it. Your two options are both bad. Offload layers to CPU and you cut throughput by 60–80%. Split the model across two GPUs without NVLink and you add memory-copy latency that makes real-time inference feel sluggish. The NVIDIA RTX A6000 at AED 17,300 per card (48 GB VRAM) solves this. One A6000 holds a 70B model with room left for a 4–8K KV cache. A full dual-A6000 workstation, built on Threadripper or Xeon with 256 GB RAM and NVMe RAID, runs USD 30,000–33,000 ex-US. Add UAE import duties and 5% VAT, and you should budget AED 121,000–140,000 for a complete local procurement. The A100 (80 GB HBM2e, AED 80,000+ per card) buys you more bandwidth and batch throughput, but it is rarely justified for a single-clinic deployment serving fewer than 50 concurrent queries.
DEWA Will Send You a Bill: Running the Power Numbers
A single RTX 4090 idles at roughly 80 W and peaks between 350–450 W under inference load. The A6000 sits closer to 300 W TDP at load. A dual-A6000 workstation pulls 600–750 W at the GPU level alone. Then factor in PSU conversion losses, CPU draw, storage, and the cooling you need even in a server closet. Real wall-draw for a production-grade single-node deployment lands between 900 W and 1,500 W. Now the meter. Dubai commercial electricity via DEWA runs AED 0.23/kWh on the base slab. Add the fuel surcharge (about AED 0.06/kWh), the municipality fee, and 5% VAT, and a small business drawing above-baseline power pays an effective AED 0.40–0.43/kWh. Run that server 12 hours a day through a standard clinic schedule and you burn 3,942–6,570 kWh per year, or AED 1,700–2,800 annually in electricity. Keep it available 24/7 for background document processing or after-hours query batching and that figure doubles. This is not the cost that breaks the business case. But it is real, it compounds year over year, and it almost never shows up in the vendor's proposal. Build it into your five-year TCO model from day one.
The Maintenance Calendar Vendors Forget to Show You
Ollama is the most common local inference server for clinic deployments. It has shipped roughly 587 releases over three years, about one every two days, and as of June 2026 it sits at v0.30.10. At that pace, model format support, API behaviour, and CUDA compatibility shift constantly. vLLM, the alternative favoured for higher throughput, runs a roughly four-to-six-week major release cycle: v0.11.0 in October 2025, v0.12.0 and v0.13.0 in December 2025, v0.14.0 in January 2026. Any one of them can break a quantization backend or change how the server handles multi-turn context. And that is just the inference layer. The full stack also carries NVIDIA driver patches tied to CUDA releases, OS-level hardening for a machine sitting on your clinic's internal network, model weight updates when a new fine-tune improves clinical NLP accuracy, and the monitoring pipeline that tells you when the GPU is thermal-throttling at 3 PM on a Dubai summer afternoon. Industry estimates put a single-node LLM deployment at 0.5–1.0 FTE-equivalent hours per week. For a clinic, that means 3–5 hours per month of billable IT time if you outsource it, or invisible load on your existing technical staff if you do not. Neither is free. Neither appears on the GPU cost sheet.
PDPL, NABIDH, and Why On-Premise Is the Compliance Argument, Not the Cost Argument
Here is the honest framing for a UAE clinic. On-premise LLM deployment is a PDPL compliance strategy, not a cost-reduction strategy at low utilisation. Patient data is health data under the Federal Personal Data Protection Law, which carries heightened processing obligations, consent documentation requirements, and data localisation pressure. Run your LLM on a server inside the clinic and patient records never transit to a third-party cloud inference endpoint. That is a clean architectural answer to a regulator asking where your data went. The rest of the regulatory picture reinforces it. NABIDH, the Dubai Health Authority's Health Information Exchange, imposes additional data governance obligations on DHA-licensed facilities, and on-premise processing sidesteps the contractual complexity of qualifying a cloud AI vendor as a NABIDH-compliant data processor. For DIFC-registered entities specifically, DIFC Regulation 10 on AI impact assessments has been in full enforcement since January 2026, requiring documented assessments for high-risk AI use. A locally hosted model with no external API calls makes that documentation far simpler. So the regulatory case for on-premise is strong. Where it falls apart is as a cost play. Query the model fewer than a few hundred times a day and a well-negotiated Azure or AWS API contract, with contractual in-region data residency clauses, is cheaper per query. The compliance argument weakens at that point too, though it does not disappear. Know which problem you are actually solving before you order the hardware.
The Decision Framework: When the Numbers Actually Work
Take a single RTX 4090 build for a DHA clinic handling 8B-class models. Roughly AED 10,500 in hardware, AED 1,700–2,800 in annual electricity, and AED 3,000–6,000 a year in IT maintenance at market rates. Call it AED 16,000–19,000 in year one, dropping to AED 5,000–9,000 a year from year two onward. At 300–500 daily queries, a realistic triage-assistant volume for a mid-size clinic, those unit economics beat cloud API pricing while keeping data fully on-premises. The dual-A6000 build for 70B capabilities is a different conversation. AED 121,000–140,000 in hardware alone, the same power and maintenance costs on top, and a payback period that needs either very high query volume or a strong compliance mandate from your DHA licensing body. The A100 path only makes sense for a multi-clinic network sharing one inference endpoint, or for processing large batches of medical imaging reports overnight. Which is why the first question is never which GPU to buy. It is what model size your clinical use case actually requires. For most UAE clinic deployments, an 8B model fine-tuned on clinical Arabic and English handles 80% of the workflow at a quarter of the infrastructure cost. Start there. Validate query volume. Then scale the hardware to what the data tells you, not what a vendor's benchmark deck suggests.
هل لديك أسئلة حول إعدادك؟
نساعد الشركات الإماراتية الصغيرة والمتوسطة على بناء أنظمة ذكاء اصطناعي متوافقة ومحلية وفعّالة فعلاً. محادثة أولى مجانية.