Skip to content
Back to articles

How Multimodal AI Actually Works

April 6, 2026/4 min read/891 words
IBMGenerative AIAI ResearchMachine Learning
Martin Keen explaining multimodal AI with diagrams on a whiteboard
Image: Screenshot from YouTube.

Key insights

  • Shared vector spaces let AI reason across senses simultaneously instead of translating between separate systems
  • Feature-level fusion still dominates enterprise AI because it's cheaper and modular, even though it loses information
  • Video understanding requires time as a dimension, not just more frames. Motion gets baked into the token itself.
  • Any-to-any generation means AI can take in any modality and respond in any other, all from the same shared space
Published April 6, 2026
IBM Technology
IBM Technology
Hosts:Martin Keen

This is an AI-generated summary. The source video may include demos, visuals and additional context.

Watch the video · How the articles are generated

In Brief

Multimodal AI is any AI model that works with more than one type of data at once: text, images, audio, video, and more. Martin Keen, Senior Inventor at IBM, explains how this actually works under the hood: the older modular approach, native multimodality where everything shares the same mathematical space, and models that can generate video from a text prompt.

Related reading:

What is a modality?

Before any of this makes sense, you need to know one word: modality. It just means a type of data. Text is one modality. Images are another. Audio, video, thermal imaging, LIDAR (laser-based distance scanning) are all modalities too.

Keen makes it simple: "when we say modal, we are talking about data." AI models that only work with text are single-modality models. A multimodal model can handle several types at once. That's it. The word sounds fancy, but the idea is straightforward.

The older approach: two models duct-taped together

The first way engineers built multimodal AI was to take an existing text model and bolt a second model onto it. The second model, called a vision encoder, would look at an image and translate it into a list of numbers: a numerical summary the text model could process.

Think of it like having a friend describe a painting to you over the phone. You get the gist, but you're working from their description, not the actual painting. Some information always gets lost in that handoff.

As Keen puts it, "the LLM is essentially only seeing a summarized description of the data, instead of the raw signal." An LLM (large language model) is the core AI model that processes text. In feature-level fusion, the LLM never sees the actual image — only numbers extracted from it.

This approach is called feature-level fusion, and it's still widely used in enterprise AI today. Why? Because it's cheaper to build and easier to maintain. You can swap out the vision encoder without rebuilding the whole system. It's the practical choice, even if it's not the best one.

Native multimodality: everything in the same room

The better approach is called native multimodality. Instead of two separate models passing notes to each other, everything gets processed in one shared mathematical space called a shared vector space.

Here's the core idea: in a regular text model, every word gets converted into a point in a giant mathematical space. The word "cat" becomes a specific point. "Dog" becomes a nearby point. Words with similar meanings cluster together. This is called an embedding, a way of representing meaning as a location in space.

With native multimodality, images go through the same process. An image is chopped into small tiles (called patches), and each patch gets its own point in that same space. Same with audio. Everything lives in the same mathematical neighborhood, so the model can reason about all of it at once with no translation needed.

The cat analogy Keen uses makes this click: if you drop a picture of a cat into this shared space, it lands close to the word "cat" because they mean similar things. The model doesn't need to "translate" the image into text — it already speaks the same language.

This also solves a real problem with the older approach. With feature-level fusion, "the vision encoder processes your image before it knows what question you're asking." It might discard exactly the detail you needed. With a shared vector space, the model looks at your question and the image at the same time, so it knows where to focus.

Video: when time is part of the data

Images are two-dimensional: width and height. But video adds a third dimension: time. This is where things get interesting.

Early multimodal systems handled video by grabbing a handful of individual frames and running each one through the vision encoder. Fast and cheap, but it throws away motion entirely. Keen's example makes the problem obvious: "show me a single frame of somebody holding a water bottle, and I can tell you that there is a person and a water bottle, but I can't tell you if they're putting it down or if they're picking it up."

That information lives in the sequence of frames, not any single one.

Newer models solve this with spatiotemporal patches: instead of flat 2D image tiles, they use small 3D cubes that capture a small area of the screen across a short window of time, say 8 frames at once. Motion gets baked into the token itself, so the model doesn't have to guess what happened between two frames. It just sees it.

Any-to-any: respond in whatever format fits

The final piece is output. Most of this article has covered what goes into a multimodal model. But because everything lives in the same shared vector space, the model can also output across modalities.

Any modality in, any modality out. Keen's example: ask a model how to tie a tie, and it could respond with a few sentences of text and then generate a short video clip showing the steps, because both live in the same space and can be generated from it.

This is what "any-to-any generation" means, and it's what makes native multimodal models fundamentally different from older systems. It's not just about understanding more types of input. It's about being able to respond in whatever format actually helps.

Glossary

TermDefinition
ModalityA type of data. Text, images, audio, and video are all different modalities
LLM (large language model)The core AI model that processes and generates text
Vision encoderA separate model that converts images into numbers a text model can process
Feature-level fusionThe approach of combining a vision encoder and a text model, passing numerical summaries between them
Shared vector spaceA single mathematical space where all data types are represented as points, so the model can reason about them together
EmbeddingThe process of converting a word, image patch, or audio chunk into a point in a vector space
PatchA small tile of an image (or a 3D cube of video frames) that gets its own embedding in the vector space
Spatiotemporal patchA 3D video token that captures pixels across both space and time. Motion is baked in, not inferred.
Any-to-any generationThe ability to take in any combination of modalities and output any combination: text in, video out, for example

Sources and resources

Share this article