When Anthropic announced Project Glasswing, the framing was that Mythos-level capability was a prerequisite for serious vulnerability discovery. AISLE tested that assumption directly. Eight models were run against the same vulnerability types Mythos found, including the FreeBSD NFS buffer overflow that featured in Anthropic’s showcase. All eight found it. A 3.6B parameter model costing $0.11 per million tokens recovered the exploit. A 5.1B active-parameter open model reconstructed the complete vulnerability chain for a 27-year-old OpenBSD bug that required reasoning about integer overflow across multiple code paths.
The finding is not that Mythos is overhyped. It is that capability in this domain does not scale smoothly with model size or cost. The frontier across security tasks is jagged rather than graduated: a small model that finds one class of vulnerability confidently may completely miss another, and ranking shifts between models differ entirely by task type. This makes blanket claims about frontier model exclusivity for security work difficult to sustain. The specific capabilities that matter depend on the specific vulnerability surface you are scanning.
Where small models do consistently underperform is false positive rate. Several models in the test flagged patched code as still vulnerable, producing noise that drowns genuine findings. For a human analyst reviewing results, a high false positive rate is workable but expensive. For an automated pipeline that needs to triage at scale, it becomes a system design problem. The frontier models have better precision in addition to broader recall. But AISLE’s central observation is that the architecture of how you deploy models matters more than which model you pick: a team running inexpensive models broadly across a codebase finds more bugs in aggregate than a team running one expensive model selectively, because coverage compounds faster than per-query quality.
The consequence for teams doing internal security review is that waiting for exclusive access to frontier security models, or budgeting as though frontier-model pricing is the floor for meaningful AI-assisted scanning, is the wrong frame. The scaffolding around the model, including how you structure the search, how you triage results, and how you maintain relationships with upstream maintainers to route findings, is where the durable advantage lives. That work does not require Mythos.
What does require frontier capability is the harder edge of the task: novel vulnerability classes that require synthesising across large context windows, exploit chain construction for complex multi-step bugs, and anything that demands deep semantic reasoning about concurrent or architecture-specific behaviour. For those tasks, the capability gap between small and frontier models is real. The mistake is assuming that everything in security review sits at that edge, when the distribution of actual bugs skews heavily toward patterns that a well-scaffolded small model will find reliably.