Secure On-Premise AI Infrastructure

A private, air-gapped LLM infrastructure inside the customer data centre — vLLM serving on NVIDIA A100s — so every AI workload (RAG, summarisation, internal copilots) runs without sending data to public AI APIs.

01. Challenge

The customer industrial data — equipment telemetry, technical documentation, internal procedures — could not legally or commercially be sent to public AI APIs.

At the same time, multiple internal teams wanted to ship RAG assistants, summarisers and copilots. Without a shared platform, every team would have rebuilt the same serving stack from scratch.

02. Solution

A shared LLM-serving platform on NVIDIA A100 GPUs running vLLM behind an OpenAI-compatible API gateway.

Internal teams consume it like any LLM provider, but every byte stays inside the corporate network. Quota, audit logging and model routing are handled at the gateway so teams do not reinvent infrastructure.

03. Results

100% on-premData residency
Zero data leaves the corporate network
Multi-teamShared infra
Single GPU pool powers multiple internal AI applications
FullAudit
Every prompt and response logged for compliance review

04. Constraints

Air-gapped data centre — no outbound internet access
Multi-tenant: several internal AI applications share the same GPU pool
Must support both batch and low-latency interactive workloads
Compliance: full audit log of every prompt and response

05. Architecture

vLLM serves multiple open-weight models (Llama 3 70B, Mixtral) on a pool of A100 GPUs.

An OpenAI-compatible API gateway handles authentication, per-team quotas, model routing and audit logging.

Storage and observability run on the customer existing on-prem Kubernetes and Prometheus/Grafana stack. The whole environment is air-gapped with controlled artifact ingress for model and dependency updates.

06. Tech Stack

vLLMNVIDIA A100CUDALlama 3 70BMixtralOpenAI-compatible API gatewayKubernetesNVIDIA GPU OperatorPostgreSQLPrometheusGrafana

Portfolio Details