LangChain ran 138 tests across seven agentic task categories (file operations, tool use, retrieval, conversation, memory management, summarisation, and unit tests) comparing Claude Opus 4.6 against two open-weight models hosted via third-party inference providers. The results: Opus scored 68%, GLM-5 via Baseten scored 64%, MiniMax M2.7 via OpenRouter scored 57%. The performance gap is real and probably matters for high-stakes tasks. The cost gap is also real and almost certainly matters for production deployments at scale.
GLM-5 on Baseten runs at 0.65 seconds time-to-first-token and 70 tokens per second. Opus 4.6 runs at 2.56 seconds and 34 tokens per second. On throughput alone, the open model is roughly twice as fast. On cost, MiniMax M2.7 works out to approximately $12 per day at 10 million tokens versus around $250 per day for Opus — an $87,000 annual difference at that volume. For many production agentic workloads, especially high-volume automation that does not require peak reasoning quality on every step, that cost difference changes the architecture decision.
The framing of “crossing a threshold” is accurate but needs precision. The threshold is not “open models are as good as frontier models.” It is “open models are good enough for a large and growing subset of production agentic tasks, and the cost/latency delta now makes them the default choice for those tasks.” File operations, structured retrieval, and deterministic tool use are where the gap narrows most. Complex multi-hop reasoning and tasks that depend on broad world knowledge are still where frontier APIs have a meaningful advantage.
Teams running agents in production face a routing question: which steps actually require frontier-quality reasoning, and which are well-served by a faster, cheaper model? Many agentic pipelines are built with a single model for all steps because that is the simplest architecture. Introducing model routing adds complexity but potentially eliminates most of the cost. LangChain notes the switch is a one-line code change in their SDK — the infrastructure friction is low.
The benchmark was conducted in early April 2026. Model capabilities are changing quickly enough that specific numbers will date fast. The directional conclusion is more durable: open-weight models have crossed from “viable for experimentation” to “viable for production” on core agentic tasks, and the cost argument for using them where appropriate is now compelling.