How Google's Nano Banana Taught AI to Look Like You

Key insights

Character consistency is not just about scale. It required obsessive data curation and team members dedicated to single problems like text rendering. Craft matters in AI, not just compute.
Human evaluation is foundational for visual AI. Benchmarks can say '10% better' but miss what makes a model feel right. You can only judge face consistency on faces you know, including your own.
Fun became a deliberate entry point to utility. People came for figurines and red carpet selfies, stayed for photo editing, sketch notes, and learning tools. Google's parents and aunts are now using Gemini.
The chat interface is becoming a bottleneck for visual creation. Specialized UIs that match user intent are the next competitive frontier, for Google and for startups.

SourceYouTube

Published November 11, 2025

Sequoia Capital

Hosts:Stephanie Zhan, Pat Grady

Guest:Nicole Brichtova, Hansa Srinivasan — Google DeepMind

This is an AI-generated summary. The source video may include demos, visuals and additional context.

Watch the video · How the articles are generated

In Brief

When Google DeepMind launched Nano Banana, people suddenly could put themselves on the red carpet, turn into a 3D figurine, or star in their own movie poster. And it actually looked like them. In this episode from Sequoia Capital, Nicole Brichtova (Product Lead) and Hansa Srinivasan (Engineering Lead) explain how they got something no other model could: images that actually look like you, from a single photo. The recipe? Obsessive data quality, Gemini's ability to hold lots of context at once, and real humans judging the results. They also talk about the accidental name, why fun is a serious product strategy, and what AI-powered learning could look like in one to three years.

The problem nobody had solved

Every image model before Nano Banana had the same flaw: the generated person didn't quite look like you. Maybe close. But not actually you.

Nicole Brichtova had her own aha moment during an internal demo. She took a photo of herself, typed a simple prompt ("put me on the red carpet") and compared the result against every previous model. "It looked like me," she said. "No other model actually looked like me."

That sounds simple. But character consistency (making an AI-generated person look the same across different scenes, angles, and styles) is surprisingly hard. And the reason is also what makes it hard to measure.

"You can really only judge face consistency on yourself," Nicole explained. If someone shows you an AI version of a stranger, you might think it looks fine. But if you see an AI version of yourself, you'll immediately notice if something is slightly off. The team started running evaluations on their own faces for exactly this reason.

Why scale alone doesn't explain it

Character consistency doesn't just appear when you make a bigger model. The team had a clear goal from the start, and they had a specific recipe to reach it.

Three things made the difference:

Good data that teaches generalization. The model needed to understand what makes a person look like themselves, not just memorize patterns. High-quality data was, as Hansa put it, "the secret sauce."

Gemini's multimodal context window. Because Nano Banana is built on Gemini (Google's main AI model that handles text, images, and more in one system), the model can hold a large amount of information in memory at once. Users can provide multiple images, iterate across turns, and have a real conversation with the model. Two years ago, getting an AI to render your face accurately required 10 photos and 20 minutes of fine-tuning. Now it works from one image.

Obsession with specific problems. The team found that most improvements came down to one thing: "there's a person on the team who's obsessed with making them work." Text rendering kept getting better because someone on the team was personally obsessed with text rendering. That attention to craft, not just raw computing power, was what made the difference.

Human evals beat benchmarks

For visual AI, numbers don't tell the whole story. You can measure that a new model is "10% better" than the previous one. But that number won't capture whether someone feels seen when they look at the output.

"Human evals have been a big game changer for us," Hansa said. The team uses internal testing where team members, artists at Google and Google DeepMind, and even executives play with the models and share what they think. This kind of feedback (how a model makes someone feel) can't easily be automated.

The contrast with math or logic is sharp: for those, you can check if an answer is right or wrong. For image quality, aesthetics, and whether a face looks like a specific person, you need human eyes.

The name was a 2 AM accident

The story behind the name is exactly what it looks like: someone was tired, it was late, and the name just came out.

At 2 AM, during a deadline crunch before releasing the model on Arena (a platform where anonymous AI models compete and users vote on which is better), one of the PMs on Nicole's team got a message asking what to name it. She didn't overthink it. Nano Banana was the result.

Once it went public, the name worked. It was easy to pronounce, emoji-friendly, and felt "very googly," as the team put it. The accident turned into a marketing phenomenon, and Google eventually added banana emojis throughout the Gemini app to help users find the model. (Many users knew they wanted "Nano Banana" but didn't realize it was just Gemini with an image prompt.)

Fun as a gateway to utility

The figurines and red carpet selfies were more than just viral moments.

Hansa's mother started using Gemini to make fun images. Then she discovered she could remove people from the background of old photos. Then she started using it for everyday tasks. That progression, from playful to practical, is exactly what the team hoped for.

"It's a really nice path to fun being a gateway to utility," Nicole said. People came in for figurines and stayed to learn, solve math problems, and create sketch notes of complex topics.

One user told the team they had used Nano Banana to generate visual sketch notes of their father's university chemistry lectures. For the first time in years, they could have a real conversation about his work, because they finally had a visual way in.

Today, roughly 95% of AI output is still text. The team sees that as unfinished business. Most humans don't primarily learn through text alone.

What's next

For consumers, the priority is getting past the prompt engineering phase. Users currently copy and paste hundred-word prompts to get the best results, which is too much friction for most people.

For professionals, the model needs to reach 100% consistency, not just most of the time. Advertisers and designers need pixel-level precision.

For learning, Nicole's vision for one to three years is clear: "personalized tutors, personalized textbooks." There is no reason two people with different learning styles should learn from the same textbook. Visual AI could change that.

For startups, the advice is to stop building more chatbots. The chat interface works well as an entry point, but it's becoming a bottleneck for visual creation. The next competitive frontier is specialized UIs that match what users actually need to do: designing a room, creating a presentation, or building a learning tool.

Safety: visible and invisible

Every image Nano Banana generates carries two watermarks. One is visible: a "Generated with Gemini" label on the output. The other is SynthID, Google's invisible watermarking technology that can verify whether a piece of content was AI-generated, even after the image has been shared or modified.

SynthID applies to all Google model outputs: images, video, and audio. The team frames it as a tool that lets them release powerful capabilities while maintaining a way to detect and combat misinformation.

Glossary

Term	Definition
Character consistency	When an AI-generated person looks the same across different images and scenes: same face, skin tone, and proportions. Hard to achieve and even harder to measure.
Multimodal	An AI model that understands and generates multiple types of content (text, images, video, audio) in the same system. Gemini is a multimodal model.
Human evals	Having real people judge AI output instead of relying only on automated scores. Especially important for visual quality, where "better" is hard to quantify.
SynthID	Google's invisible watermarking technology for AI-generated content. Lets you verify whether something is AI-made without any visible mark on the image.
Arena (Chatbot Arena)	A platform where anonymous AI models compete against each other. Users see two outputs and vote on which is better, without knowing which model made which.