Models as a Service: Run AI Like Your Own Cloud

Key insights

MaaS shifts AI from renting to owning: instead of paying tokens to third-party APIs, you run your own AI infrastructure with full control over which models are deployed and when.
Model deprecation is a hidden risk of public APIs: when a provider retires a model version, your applications can break overnight, forcing urgent rewrites.
Air-gapped AI is now possible: healthcare and finance organizations can run RAG and agentic AI entirely on-premise, with no data ever leaving their environment.

SourceYouTube

Published March 24, 2026

IBM Technology

Hosts:Cedric Clyburn

This is an AI-generated summary. The source video may include demos, visuals and additional context.

Watch the video · How the articles are generated

In Brief

Every time developers use ChatGPT or another public AI API, they send both their data and their money to a third-party server they don't control. Cedric Clyburn, Senior Developer Advocate at Red Hat presenting on the IBM Technology channel, explains how IBM and the broader industry are adopting a pattern called Models as a Service (MaaS): organizations run their own AI models behind a shared API gateway, keeping full control over cost, privacy, and which models are available.

What is Models as a Service?

Think of it like SaaS (Software as a Service), the model behind Gmail or Dropbox, where a provider hosts software and you access it over the internet, but applied to AI models. MaaS serves multiple models through one API, whether language or vision models, with billing transparency, data privacy controls, and usage tracking all built in. The IT team manages the infrastructure; developers and end users consume the models through a standard interface.

The key difference from public APIs: instead of renting from OpenAI or Google and paying per token (a unit of text the AI processes), you own and operate the infrastructure yourself. You decide which models are available, who can access them, and what happens when something changes.

The hidden problem with public APIs

Public AI providers retire old model versions regularly, sometimes with little notice. When a provider deprecates a model, say version 5 gets replaced by version 6, any application built on version 5 may break. Prompt templates (the instructions developers write to shape AI behavior) that worked perfectly on the old version might produce different results on the new one. That means developers have to stop what they're doing and fix their applications.

With MaaS, your organization controls the model lifecycle. You choose when to upgrade, you test the new version before deploying it, and you can keep running the old version until everything is ready. If you find a better open-source model on Hugging Face (a platform for sharing AI models), your IT team can add it to the MaaS infrastructure and manage it alongside the rest.

The privacy argument

For healthcare, finance, and other regulated industries, sending data to a third-party AI provider is not just risky. It can be illegal. Patient records and financial information are covered by strict regulations that limit where data can go and who can see it.

With MaaS, organizations can run a fully air-gapped environment, meaning the entire system is cut off from the public internet. The AI models, the databases, the applications, and the hardware all run on-premise or in a private hybrid cloud (a mix of on-site servers and controlled cloud infrastructure). No data ever leaves the building. Yet the organization still gets the full benefits of modern AI: retrieval-augmented generation (RAG, where the AI looks up documents from your own data before answering) and agentic AI (AI that can take actions like calling internal databases or APIs).

How the architecture works

The MaaS stack has three layers:

Infrastructure layer. Tools like Kubernetes or OpenShift (open-source platforms for managing software workloads) tie together on-premise servers, cloud providers, and edge environments. GPUs (graphics processing units, the chips that power AI workloads) can be pooled and scaled dynamically as demand changes.

AI platform layer. On top of Kubernetes sits an AI-specific layer. Tools like vLLM (an open-source engine for running large language models efficiently) and KServe (a tool that manages AI models like independent microservices, small self-contained units of software) handle model serving. Each model runs as its own service, which means you can scale, update, or replace individual models without affecting the rest.

API gateway layer. This is the front door for all AI access inside the organization. The gateway handles authentication (making sure only authorized users and apps can connect), rate limiting (preventing any one team from monopolizing GPU resources), and usage tracking. It also collects observability data using open-source tools like Prometheus, Grafana, and Jaeger, so teams can see exactly what the AI is doing, why it made a decision, and where something went wrong.

Why this matters

MaaS is quickly becoming the standard for in-house and sovereign AI infrastructure, meaning AI that a country or organization fully controls without depending on foreign cloud providers. Organizations that adopt it get cost transparency, data privacy, and stable model access. Those that don't remain exposed to the risks of model deprecation, vendor lock-in, and data leaving environments they can't audit.

Glossary

Term	Definition
Models as a Service (MaaS)	A pattern where an organization runs its own AI models behind a shared API gateway, like having a private AI cloud instead of renting from OpenAI or Google.
RAG (Retrieval-Augmented Generation)	A technique where the AI looks up relevant documents from your own data before answering, so it can give accurate answers based on private information.
Agentic AI	AI systems that can take actions on their own, like calling databases or APIs, rather than just generating text.
API gateway	A front door that sits between users and AI models, handling authentication, rate limiting, billing, and tracking who uses what.
Air-gapped	A system completely disconnected from the public internet, used in sensitive environments to prevent any data from leaking out.
vLLM	An open-source engine for running large language models efficiently on GPUs.
KServe	An open-source tool that manages AI models on Kubernetes, treating each model like an independent microservice.