The emergence of AI-powered testing tools is reshaping how developers approach quality assurance for web applications. Magnitude, an open-source testing framework that leverages visual AI agents, has recently sparked significant discussion among developers about the balance between deterministic testing and AI adaptability.
The Two-Model Architecture: Planning vs. Execution
At the core of Magnitude's approach is a distinct separation between planning and execution functions. The framework employs two different AI models: a planner (typically a larger, more capable model like Gemini 2.5 Pro) that develops the overall test strategy, and an executor (Moondream, a smaller 2B parameter model) that handles the actual UI interactions with pixel-level precision.
This architecture addresses a fundamental challenge in AI-based testing: how to make tests both adaptable and consistent. As explained by the Magnitude team in the community discussions, the planner builds a general plan which the executor runs. The key innovation is that this plan can be saved and re-run using only the executor for subsequent tests, making repeated runs faster, cheaper, and more consistent.
Where it gets interesting, is that we can save the execution plan that the big model comes up with and run with ONLY Moondream if the plan is specific enough. Then switch back out to the big model if some action path requires adjustment.
When interface changes occur that might break traditional tests, the system can dynamically revert to the planner model to adjust the test strategy, providing a blend of consistency and adaptability that traditional testing frameworks struggle to achieve.
Magnitude's Two-Model Testing Architecture
-
Planner Model
- Recommended: Gemini 2.5 Pro
- Alternatives: Models from Anthropic, OpenAI, AWS Bedrock, etc.
- Function: Develops overall test strategy and adapts to interface changes
-
Executor Model
- Currently only supports Moondream (2B parameters)
- Function: Handles UI interactions with pixel-level precision
- Benefits: Fast, cheap, consistent execution
- Pricing: Moondream offers 5,000 free requests per day (cloud version)
Key Features
- Natural language test case creation
- Plan caching for consistent test execution
- Dynamic fallback to planner when interfaces change
- CI/CD integration similar to Playwright
- Self-hosting options available for Moondream
The Determinism Debate
One of the most prominent concerns raised in community discussions centers on test determinism. Traditional automated tests are valued for their consistency and predictability, while AI-based approaches inherently introduce some level of non-determinism.
Magnitude's developers have addressed this concern by explaining that their architecture is specifically designed with determinism in mind. Rather than generating brittle code-based tests that break when interfaces change, Magnitude caches a plan of web actions described in natural language. For example, a cached typing action might include a natural language description of the target and the content to type, allowing the executor model to reliably find the target without relying on DOM selectors.
This approach means that as long as the interface remains largely unchanged, tests can run consistently using the cached plan. When significant interface changes occur, the system intelligently falls back to the planner model to adapt the test, creating a new cached plan that can be executed consistently until the next major change.
Beyond Traditional Testing: Accessibility and Usability
An interesting thread in the community discussion explores how AI-based testing might extend beyond traditional functional testing into accessibility and usability evaluation. One commenter pointed out that relying solely on visual testing might let developers off the hook regarding accessibility concerns.
In response, the Magnitude team acknowledged this limitation and expressed interest in developing parallel accessibility tests that would run alongside visual tests but be restricted to using only the accessibility tree. This approach could help developers identify accessibility issues more effectively by simulating different types of disabilities or constraints.
Some community members have also suggested that the non-deterministic nature of AI testing could actually be leveraged as an advantage for usability testing. By analyzing success rates across multiple test runs, developers might gain insights into how both AI agents and humans interact with their interfaces, potentially revealing usability issues that deterministic tests would miss.
Cost and Performance Considerations
The community has shown particular interest in how Magnitude balances cost and performance. The two-model approach addresses this concern directly: the expensive, powerful planner model is used sparingly to develop and adjust test strategies, while the smaller, faster executor model handles the majority of test executions.
This approach significantly reduces costs compared to solutions that rely exclusively on large models like those used in OpenAI's Computer Use or Anthropic's Claude. Moondream, being only a 2B parameter model, is both faster and cheaper to run, with self-hosting options available for teams with specific deployment requirements.
As web application testing continues to evolve, frameworks like Magnitude represent an interesting middle ground between traditional automated testing and fully AI-driven approaches. By intelligently combining the strengths of different AI models and caching execution plans, they offer a glimpse of how testing might evolve to become both more adaptable and more efficient in the future.
Reference: Magnitude: The open source, Al-native testing framework for web apps