A specialized approach to the verification gap
Today we are releasing the alpha version of KeelTest, a VS Code extension designed to handle the most repetitive aspect of the development lifecycle: writing and maintaining unit tests. While general-purpose AI assistants have become standard in the editor, we believe there is a significant gap in how these tools handle verification. KeelTest is our attempt to bridge that gap with a specialized, agentic pipeline built specifically for pytest environments.
The Verification Gap: Why single-shot prompts fail
The industry is currently split between two extremes: autocomplete engines that suggest lines of code, and autonomous engineers that try to handle entire features. KeelTest sits intentionally in the middle. We aren't building a general-purpose agent; we are building a specialized tool that focuses on a single, high-friction task: generating production-ready test suites that actually pass.
Most AI test generators simply pass a file to an LLM and hope the resulting test is correct. This "one-shot" approach inevitably leads to several failure modes:
- Context Drift: The model forgets which fixtures were defined at the top of the file.
- Dependency Hallucinations: The model assumes mocks exist that were never configured.
- Verification Blindness: The AI has no way of knowing if the code it just wrote actually executes.
The KeelTest Architecture: Plan, Generate, Triage
To solve these issues, KeelTest implements a multi-stage agentic pipeline that treats test generation as a formal engineering process rather than a text completion task.
Phase 1: The Senior Architect (Planning)
Before a single line of test code is written, KeelTest uses a high-reasoning model (Think Claude Opus, GPT-5 etc.) to perform Semantic Planning. We use static analysis to map the control flow and identify external dependencies (databases, APIs, services) that require mocking.
The result isn't code - it's a JSON Test Specification. This spec outlines every edge case, required fixture, and mock strategy for every function in the file. By decoupling "thinking" from "writing," we achieve a much higher baseline pass rate (averaging 85% in initial generation).
Phase 2: The Per-Function Generation Loop
Instead of generating 500 lines of test code at once, KeelTest focuses on individual functions. We isolate function_A, provide the model with the architectural spec, and generate a specific fragment. This keeps the context window tight and prevents the "hallucination drift" common in large file generation.
{
"imports": {...},
"functionPlans": [
{
"functionName": "calculate_risk_score",
"testCases": [
{
"name": "test_risk_score_with_negative_input",
"category": "edge_case",
"mocksNeeded": ["logger"]
}
]
}
]
}
Phase 3: The Triage Loop (The Secret Sauce)
This is our core differentiator. KeelTest executes every generated fragment in a secure sandbox immediately after generation. If a test fails, our **Triage Agent** performs a deep analysis of the traceback to categorize the failure:
- Hallucination: The test is wrong (e.g., wrong mock setup). Action: Targeted regeneration with the error injected as feedback.
- Source Bug: The test is CORRECT, but your code is broken (e.g., missing
await). Action: Halt retries and flag the bug for the user. - Mock Issue: Complex technical hurdle (e.g., AsyncMock vs Mock). Action: Automated fix-up logic.
Real-World Performance: Beyond the 70% Ceiling
In our benchmarks, general-purpose LLMs hit a "quality ceiling" at around 70% pass rate for complex enterprise Python code. By implementing the Triage Loop and the Semantic Planning phase, KeelTest pushes this ceiling to 90%+. We aren't just giving you more code; we are giving you a verified suite where the failures are often actual bugs in your implementation discovered during the generation process.
Alpha Strengths and Roadmap
Alpha Strengths:
- Async Logic: High success rate in generating complex
asynciotests and fixtures. - Dependency Injection: Automatically identifies and mocks standard Python dependencies (FastAPI, SQLAlchemy, Redis).
- State Management: Correctly handles fixture sharing and cleanup between test cases.
Known Limitations & Future:
- Project Scope: Currently optimized for Python/pytest. Support for JS/TS (Vitest/Jest) is in experimental beta.
- Performance: Real verification takes time. A full verified suite takes 30-60 seconds to generate - longer than "chat," but significantly faster than writing it manually.
Get Started
We are actively looking for feedback from the engineering community. You can install the extension directly from the Marketplace and start using it on your Python projects today. Every generation helps refine our triage algorithms and makes the agentic loop smarter.
- Install: VS Code Marketplace
- Documentation: Product Overview
- Community: Join our Discord to share your feedback and ask questions.
