Build a 4-Layer AI Browser Automation Stack

Key insights
- Skills are just capabilities. The real power comes from layering agents, commands, and task runners on top
- Agentic UI testing uses natural-language user stories instead of brittle selectors and test configuration
- CLI tools beat MCP servers for browser automation because they use fewer tokens and allow custom layering
This article is a summary of My 4-Layer Claude Code Playwright CLI Skill (Agentic Browser Automation). Watch the video โ
Read this article in norsk
In Brief
Dan Disler, the developer behind the IndyDevDan YouTube channel, walks through a four-layer architecture for building AI agents that automate browsers and run UI tests. The system is built on Playwright CLI and Claude Code Chrome tools, and shown through a reference codebase called Bowser. The core idea: stop thinking about individual skills, and start stacking them into layered systems that solve entire classes of problems.
The two classes of browser work
Disler identifies two types of work that agents can handle in the browser (0:41). The first is browser automation: having an agent complete tasks like shopping, data entry, or information gathering. The second is UI testing (user interface testing): having agents check that your application works by acting like real users.
The demo shows both in action. An Amazon shopping workflow runs via Claude Code Chrome, adding items to a cart and stopping just short of purchase (11:18). Meanwhile, three parallel Playwright agents open invisible browsers (called "headless" because there is no visible window) to test user stories (step-by-step descriptions of what a real user would do) on Hacker News (1:03).
Explained simply: Think of it like hiring two types of assistants. One runs errands for you (browser automation). The other checks that your shop looks right before customers arrive (UI testing). Both use the same ability to operate a browser, but for different purposes. Unlike real assistants, these agents can be cloned instantly, so you can run ten of them at once.
The four layers
The architecture that Disler calls "Bowser" has four distinct layers, each building on the one below (2:04). Here is how they stack up:
Layer 1: Skills (capability)
Skills are the foundation. They give your agent the raw ability to control a browser. Disler uses two skills: Playwright CLI for invisible, parallel browser sessions that use fewer tokens (the units AI models process, roughly 3-4 characters each), and Claude Code Chrome (activated with the --chrome flag) for using your existing logged-in browser session (2:40).
The key distinction: Playwright CLI runs headless and supports parallel sessions with persistent profiles, making it ideal for testing. Claude Code Chrome uses your actual browser, which means it has access to your logins and cookies, but cannot run in parallel (7:38).
Layer 2: Subagents (scale)
Subagents take a skill and wrap it in a specialized prompt. The simplest version is a generic "Playwright browser agent" that can handle any browser task (6:33). The more powerful version is a browser QA agent with a structured workflow: parse a user story into steps, create a screenshot directory, execute each step, take screenshots along the way, and report pass or fail (7:53).
This is where specialization happens. A skill just says "you can use a browser." A subagent says "here is exactly how to validate a user story, step by step, with evidence."
Layer 3: Slash commands (orchestration)
Slash commands (custom commands in .claude/commands/) are the coordination layer. They tell multiple subagents what to do and when (12:43). The /ui-review command, for example, finds all user story files, starts one browser QA agent per story, runs them in parallel, collects results, and generates a summary report (15:01).
Disler also introduces "higher-order prompts" (HOPs), commands that take another prompt as a parameter (18:37). The /automate command wraps any browser workflow in a consistent execution pattern, so you can swap out the specific task while keeping the logging and error handling the same.
Layer 4: Just (reusability)
Just is a task runner (similar to make, but simpler) that sits at the top of the stack (20:03). A justfile in the project root lists every available workflow as a one-line command with overridable parameters. Type just to see all available commands. Type just ui-review or just automate-amazon to kick off a complete workflow.
The benefit: you, your team, and your agents always know what is available. No digging through directories to find the right command.
Why CLI over MCP?
Disler makes a strong case for using CLI tools instead of MCP (Model Context Protocol) servers for browser automation (3:02). MCP servers use more tokens because they have fixed interfaces and wordy back-and-forth communication. CLIs use fewer tokens and let you build your own custom layer on top.
With a CLI skill, you control exactly what defaults are set, what output format is used, and how sessions are managed. With an MCP server, you are locked into the server's design decisions.
Explained simply: Think of the difference between ordering from a fixed menu (MCP server) versus having access to a kitchen where you cook whatever you want (CLI). The menu is convenient but limited. The kitchen takes more setup, but you can make exactly what you need. The analogy breaks down because CLIs are actually less setup than MCP servers once the skill file is written.
Why agentic UI testing?
Traditional UI testing with frameworks like Jest or Vitest requires writing test configuration, targeting specific page elements by their code-level identifiers (CSS selectors), and maintaining fragile test code that breaks when the UI changes. Agentic testing replaces all of that with natural-language user stories (16:11).
A user story file is simple: a name, a URL, and a step-by-step workflow written in plain English. The agent visits the URL and works through the steps like a real user would. If something fails, it reports what went wrong and saves screenshots of every step (3:36).
The trade-off is determinism. Traditional tests run the same way every time. Agent-based tests are non-deterministic, meaning the agent might take slightly different paths. Disler acknowledges this tension but argues that for many use cases, the speed of writing and updating tests outweighs the loss of perfect repeatability (16:53).
Solve classes of problems, not individual tasks
Disler closes with what he calls the meta-theme behind the whole architecture: stop solving individual tasks, and start solving entire classes of problems (23:03). The Bowser codebase is not built for one specific project. It is a template you can bring to any codebase and adapt.
The idea is that every time you solve a problem with agents, the solution should be reusable. Next time you encounter the same problem, the agents should do more and you should do less. "Code is fully commoditized (anyone can produce it cheaply)," Disler says. "Anyone can generate code. That is not an advantage anymore. What is an advantage is your specific solution to the problem you're solving" (5:22).
He also warns against outsourcing the learning to pre-made plugins and other people's prompts. If you cannot build up the layers yourself, you will always be limited by what others have made (25:00).
Checklist: Common pitfalls
- Are you using an MCP server when a CLI would work? MCP servers consume more tokens and are less flexible. Check if a CLI alternative exists first.
- Running Claude Code Chrome in parallel? It does not support parallel sessions. Use Playwright CLI for parallel testing and reserve Chrome mode for personal automation that needs your login sessions (7:38).
- Skipping the screenshot trail? Without screenshots, you cannot verify what the agent actually did. Always configure your QA agent to save screenshots at each step.
- Writing overly vague user stories? The agent works best with concrete, step-by-step instructions. "Test the homepage" is too vague. "Navigate to the homepage, verify the headline contains X, click the first link, verify the page loads" is actionable.
- Jumping straight to commands without testing skills first? Build and verify each layer before stacking the next one. Test the skill, then the subagent, then the command (12:07).
Practical implications
For beginners
Start with Layer 1 only. Write a single Playwright CLI skill that lets your agent navigate to a URL and take a screenshot. Once that works, try wrapping it in a subagent with a specific task.
For teams shipping fast
The /ui-review pattern is immediately useful. Write user stories for your critical workflows, point them at your staging URL (the test version of your site before it goes live), and run the review before each deploy. The non-deterministic nature is a feature here: agents catch visual and interaction issues that rigid selectors miss.
For solo developers
The justfile layer pays off fast when you juggle multiple projects. One place to find and run every automation, every test suite, every browser workflow. No need to remember which command lives where.
Test yourself
- Trade-off: When would you choose Claude Code Chrome over Playwright CLI for a browser task, and when would it be the wrong choice?
- Architecture: You need to test a web app across three browsers (Chrome, Firefox, Safari) and five user stories. How would you structure the four layers to run this efficiently?
- Transfer: How could you adapt the four-layer pattern (skill, subagent, command, task runner) to a completely different domain, like automated code review or database migration?
Glossary
| Term | Definition |
|---|---|
| Agentic engineering | Building software by designing systems of AI agents that collaborate and execute tasks, rather than writing all code manually. |
| Browser automation | Using software to control a web browser programmatically, performing actions like clicking buttons, filling forms, and navigating pages without a human doing it. |
| CLI (Command-Line Interface) | A text-based way to interact with a program by typing commands. Think of it like texting instructions instead of clicking buttons. |
| Claude Code Chrome | A Claude Code feature (activated with --chrome) that lets the AI agent control your actual Chrome browser, with access to your logins and extensions. |
| Headless browser | A browser that runs without a visible window. It does everything a normal browser does, but invisibly in the background. Useful for automated testing. |
| Higher-order prompt (HOP) | A prompt that takes another prompt as input. Like a function that takes a function as a parameter. It wraps variable tasks in a consistent execution pattern. |
| Just | A simple task runner (like make but easier to use) for defining and running project commands from a single justfile. |
| MCP (Model Context Protocol) | A protocol that lets AI models connect to external tools and services. Useful but more token-heavy than CLI alternatives for some tasks. |
| Parallel sessions | Running multiple browser instances at the same time, each handling a different task independently. |
| Persistent profile | Saving browser state (cookies, login sessions) between runs so the agent does not have to log in every time. |
| Playwright CLI | A command-line tool from Microsoft for browser automation. Supports headless mode, parallel sessions, and multiple browser engines. |
| Skill (Claude Code) | A markdown file in .claude/skills/ that teaches the AI agent how to use a specific tool or capability. The foundational building block. |
| Slash command | A reusable prompt file in .claude/commands/ that can be invoked with /command-name. Used for orchestrating complex workflows. |
| Subagent | A child AI agent spawned by a primary agent to handle a specific task. Multiple subagents can run in parallel and report results back. |
| User story | A plain-language description of a workflow from the user's perspective. In agentic testing, it replaces traditional test code with natural language steps. |
Sources and resources
Want to go deeper? Watch the full video on YouTube โ