NVIDIA's Nemotron 3 Super Is Free, Open and Fast

Key insights
- NVIDIA released not just the model but a 51-page technical report detailing every step of training, including the dataset. Full transparency at this scale is rare and sets a new benchmark for open AI development.
- Speed is the real headline. Matching the best closed models from 18 months ago is impressive. Being 7x faster than comparable open models while doing so is a different category of achievement.
- Four technical optimizations (compression, parallel text generation, smarter memory usage, and error correction) work together to deliver the speed gains. None is new on its own; the achievement is making all four work together without accuracy loss.
- Jensen Huang's investment of tens of billions in open AI is a business strategy, not just generosity. Every popular open model drives demand for the NVIDIA hardware needed to run it.
This is an AI-generated summary. The source video may include demos, visuals and additional context.
In Brief
NVIDIA just released Nemotron 3 Super, a 120-billion-parameter AI assistant that is free for everyone, forever. What makes it unusual is not just the model itself. They also published a 51-page technical report explaining every step of how it was built, including the training data. Dr. Károly Zsolnai-Fehér of Two Minute Papers breaks down why this matters and why the speed numbers are the real story.
Related reading:
Full transparency, not just free code
Most AI systems are closed and owned by the company that built them. You pay a subscription, and nobody tells you what data the model was trained on, how it was built, or what decisions shaped it. Open models often improve on this, but even many "open" releases are incomplete: the weights are shared while the training recipe is kept private.
Nemotron 3 Super is different. NVIDIA didn't just release the model. They released what Dr. Zsolnai-Fehér describes as the holy bible of creating such a system: a full technical report with every step documented and the training data disclosed. That kind of transparency at this scale is genuinely unusual. It means researchers and developers can learn from it, verify it, build on it, and improve it.
The scale itself is worth noting: 25 trillion tokens (roughly words or word-pieces) of training data fed into a model with 120 billion parameters (the internal values a model learns during training; more parameters generally means greater capability). The result roughly matches the best closed top-tier models (the most advanced AI systems available at the time) from about 18 months ago. Models that cost billions of dollars to build and kept every detail secret. Now you can just download this.
The speed numbers change the equation
The benchmark scores (standardized tests used to compare AI models) show Nemotron 3 Super near the top of the open model rankings across most tests. That alone would be a solid result. But there is a second story in the data.
NVIDIA released two variants: BF16 (the standard high-precision format) and NVFP4 (their compressed format). They perform at roughly the same accuracy level. But the NVFP4 version is 3.5x faster than NVIDIA's previous comparable model, and up to 7x faster than similarly capable open models. The story is not just the "similarly smart" part. The story is 7x faster while being similarly smart.
That changes the economics of running AI. Faster inference (the process of running a model to get answers) means lower cost per query, which means more people and organizations can afford to deploy it.
Four engineering tricks behind the speed
How do you make a 120-billion-parameter model run 7x faster without losing accuracy? NVIDIA used four techniques in combination. None is entirely new on its own; the achievement is making all four work together cleanly.
NVFP4 quantization is the most visible one. Quantization means compressing the math an AI uses by rounding off digits in calculations, so it runs faster with less memory. Normally, this causes accuracy to degrade fast: round too aggressively and the model outputs nonsense. NVIDIA's approach applies rounding selectively, only where it causes no meaningful harm, leaving the sensitive calculations intact. The result is dramatically less computation with no significant accuracy loss.
Multi-token prediction changes how the model generates text. Standard AI models write one token (roughly one word) at a time. This model calculates 7 tokens simultaneously, then verifies them in a single step. Writing seven words in the time it used to take to write one is, as Dr. Zsolnai-Fehér puts it, another massive speed-up.
Mamba layers address memory. Traditional AI systems work like a student who re-reads the entire textbook every time they get a question. Mamba layers change this: read the material once, take compressed notes, keep what matters, discard the filler. The system can process large amounts of context without memory overhead growing proportionally.
Stochastic rounding solves a problem the other three techniques introduce. When you round numbers at each step, small errors accumulate. Over many calculation steps, those small errors compound into something significant, like walking 100 steps toward your car but each step being slightly shorter than it should be: you never quite arrive. Stochastic rounding adds carefully designed random noise so some steps are slightly longer and some slightly shorter, making the errors cancel out across many steps. Precise arrival, every time.
Jensen's open-source bet
Jensen Huang at NVIDIA is reportedly investing tens of billions of dollars into fully open AI systems like this. That sounds like extraordinary generosity from one of the most valuable companies in the world.
It also makes complete business sense. Every open model NVIDIA releases creates demand for NVIDIA hardware. Running a 120-billion-parameter model requires serious compute: GPUs (graphics processing units, repurposed for AI), specifically. The more people and organizations run powerful open models, the more high-end chips they buy. Open AI is, for NVIDIA, a hardware sales strategy dressed as community contribution.
That does not diminish what Nemotron 3 Super represents for everyone else. Closed AI used to dominate. That is shifting. Consumers and developers now have access to top-tier performance, for free, with a full technical explanation of how it works. Whatever the motivation behind it, that is a genuine change.
Glossary
| Term | Definition |
|---|---|
| Parameters | The internal values an AI model learns during training. More parameters generally means a more capable model. Nemotron 3 Super has 120 billion. |
| Quantization (NVFP4) | Compressing the math an AI uses by rounding off digits in calculations, so it runs faster with less memory. NVFP4 is NVIDIA's approach that applies this selectively to avoid accuracy loss. |
| Multi-token prediction | Instead of generating one word at a time, the model predicts several words simultaneously and verifies them in a single step. |
| Mamba layers | A memory architecture that reads input once and takes compressed notes, rather than re-reading everything for every new question. |
| Stochastic rounding | Adding carefully designed random noise during calculations so that small rounding errors cancel out over many steps rather than compounding. |
| Open-weight model | An AI model where the trained weights are publicly available for anyone to download and run. |
Sources and resources
- NVIDIA's New AI Just Changed Everything — Two Minute Papers — Original video by Dr. Károly Zsolnai-Fehér, April 7, 2026
- NVIDIA Nemotron 3 Super Technical Report — Full 51-page paper detailing architecture, training data, and benchmarks
- Jensen Huang — Wikipedia — CEO and co-founder of NVIDIA
- Dr. Károly Zsolnai-Fehér — TU Wien — Host of Two Minute Papers, researcher at TU Wien
Want to go deeper? Watch the full video on YouTube →