Anthropic lets developers control Claude's thinking time

Published May 8, 2026

Anthropic

Hosts:Matt Bleifer

This is an AI-generated summary. The source video may include demos, visuals and additional context.

Watch the video · How the articles are generated

In Brief

Matt Bleifer at Anthropic asked Claude to build the same traffic simulation three times. First quickly, then thoroughly, then at maximum effort. With ten times more time and ten times more tokens, the model delivered a radically better result: realistic driving patterns, varied car types, and a traffic light placed correctly.

That's the backdrop for his talk at Anthropic's developer conference Code with Claude. The topic: how developers can control exactly how hard Claude should work on a problem. Anthropic calls it the "thinking lever," an effort scale from "low" to "max" that balances time, cost, and quality.

The traffic simulation that explains it all

Bleifer is a product manager on Anthropic's research team. He picked a concrete example to make his point: run the same task with the same model, but let it work for different lengths of time. The task was to build a realistic simulation of cars approaching a traffic light.

At "low" effort, Opus 4.7 used around 50 seconds and 4,600 tokens. The cars drove. They stopped at red. The traffic light was placed right in the middle of the road, but the simulation worked.

Turned up to "high," the model used roughly twice the time and twice the tokens. The result had varied car types, and the traffic light had moved to the side of the road.

At "max" effort, it took ten times as long and ten times as many tokens. The model built what Bleifer called an "intelligent driver model," where every car responded in its own way to the dynamics around it. The result was significantly better, but it cost significantly more.

The point is simple: let Claude spend more time and tokens on a problem, and the result often improves. The only question is how much that's worth.

Three kinds of tokens Claude uses

To understand what's happening when Claude "thinks," you need to know about three types of tokens.

Thinking tokens are the model's internal scratchpad. Here Claude reasons step by step, weighs alternatives, and works through the problem before answering. You don't see these directly, but they're the foundation underneath the entire response.

Tool calling tokens are used when Claude has to talk to the world outside. Search code, read files, call an API. It's how the model interfaces with its environment.

Text tokens are the answer you actually read. Status updates along the way, summaries, or direct responses to your question.

All three cost something: dollars you pay for tokens, and time you wait. So developers need a dial to turn when they decide how much of each Claude should spend.

The effort scale and budgets

Anthropic provides two dials. The first is the effort scale with five levels: low, medium, high, extra high, and max. You tell Claude how hard to work, and the model decides on its own how to distribute tokens across the three types.

The second is budgets. A budget is an upper bound. You can say: "Spend at most 100,000 tokens before stopping to check in with me." It could just as easily be a time or cost limit.

Bleifer thinks budgets become more important as Claude works longer and longer on the same problem. Today the model thinks in seconds or minutes. He predicts it will soon work for days, weeks, months, even years on a single problem.

Adaptive thinking: Claude decides for itself

Earlier reasoning models followed a fixed pattern: think first, then call tools, then respond. Anthropic updated this with interleaved thinking, which lets Claude reason between each tool call.

Now Anthropic takes another step. With adaptive thinking, Claude decides for itself when and how much to think. It can start with a text response, call a tool, think about the result, call new tools, give an update, and so on. Or it can choose not to think at all for simple queries.

Adaptive thinking is the default from Opus 4.6 onward, and Anthropic now runs all internal benchmarks in this mode. Bleifer describes it as intelligence-maximizing: the same performance or better than interleaved thinking, with a better user experience.

How to pick the right level

Bleifer offered practical guidance for how developers should choose.

Max is for the hardest tasks, but it can show diminishing returns. Test it for your most intelligence-demanding use cases, but don't assume it's always the best value for the money.

Extra high was introduced with Opus 4.7 and is now the default in Claude Code and claude.ai. Bleifer recommends this for most coding and agentic tasks.

High is a good starting point for tasks that require strong intelligence. Test upward if you need more.

Medium fits when you need to keep cost down and can tolerate slightly lower quality.

Low is for short tasks or when latency is critical. But there's a surprise here.

When Anthropic ran Claude Plays Pokemon on low effort, the model treated the game like a speedrun. It skipped trainer battles to save time, used healing items it had stocked up on instead of returning to Pokemon Centers, and spammed an item called "repel" to avoid random encounters in caves.

Bleifer's observation is sharp: low effort doesn't necessarily mean lower intelligence. It takes a certain kind of cleverness to find ways to minimize effort. Claude's interpretation of "low effort" became a creative strategy to beat the game as fast as possible.

His closing advice, if you don't run your own evals: go with extra high for code. It's usually good value for the money.

Small model or big model at low effort?

Test-time compute is one of two dials developers have. The other is model choice. Should you use a big model at low effort, or a small model at max?

Bleifer's rule of thumb: low effort on a big model is great when the task demands intelligence but you need speed. In the traffic example, Opus 4.7 at low effort used about the same number of tokens as Haiku 4.5 at max, but delivered a better result.

Small models are best when the task is simpler and you need to do it at scale. Classification, information extraction, basic summarization. They also give faster time to first token, so they're good when a user is waiting for a response.

Bleifer's summary: "Use small models for fast time to first token. Use bigger models at lower effort for fast time to last token."

What this means in practice

Anthropic frames this as a step toward one goal: that Claude allocates compute incredibly well when asked. You set a quality bar and a budget. Claude figures out the rest.

For developers, the change is practical. Until now, the choice has been between models: a big slow one, or a small fast one. Now you have an extra dimension within the same model.

The main lesson from Bleifer: build evals where you measure performance, time, and cost against one another. Chart the curve. Pick the level that gives the best value for your use case. And read the transcripts to understand how Claude actually works at each level.

Glossary

Term	Definition
Test-time compute	The compute the model uses when actually responding, not when it was trained.
Thinking tokens	The model's internal "scratchpad" where it reasons before the answer comes.
Tool calling tokens	Tokens used when Claude calls external tools like search, files, or APIs.
Effort level	Scale from low to max that controls how much time and how many tokens Claude spends.
Adaptive thinking	Claude decides for itself when and how much to think. Default from Opus 4.6 onward.
Task budget	Upper bound on tokens, time, or cost Claude can use on a task.
Reasoning model	A language model trained to think through problems step by step.
Chain of thought	When the model writes out the full reasoning process before giving the answer.
Time to first token	How quickly the model starts producing the answer after a prompt.
Interleaved thinking	Claude can think between each tool call, not only before or after.