H Company released Holo3 this week: a 35-billion parameter model with roughly 10 billion active parameters (mixture-of-experts architecture) that scores 78.85% on OSWorld-Verified, the current standard benchmark for desktop computer use. That is a new state-of-the-art, and the weights are Apache 2.0 on Hugging Face. The score matters less than how they got there.
The previous generation of computer-use models were trained primarily on human demonstration data: record a human performing tasks, use those recordings as training signal. This scales poorly because collecting diverse, high-quality demonstrations is expensive and slow. Holo3’s training uses what H Company calls an “Agentic Learning Flywheel”: a Synthetic Environment Factory where coding agents programmatically generate new websites and web applications from scratch, complete with end-to-end validation scripts. The model trains on an expanding set of environments it has never seen, rather than a fixed corpus of recorded demonstrations. This is closer in spirit to how AlphaGo learned — generate novel states, evaluate outcomes, learn from the distribution — than to supervised imitation learning.
The more meaningful test is H Company’s internal evaluation: 486 multi-step tasks designed to reflect actual enterprise work, spanning e-commerce operations, business software, team collaboration tools, and multi-application workflows. A representative task: parse supplier pricing from a PDF, cross-reference it against budget constraints in a spreadsheet, then compose and send personalised emails to the relevant vendors. That chain requires the model to maintain context across applications, handle failure states, and make decisions mid-task without human intervention. The 78% OSWorld number tells you the model is capable; the 486-task suite tells you it was built for production use cases.
For teams building browser or desktop automation, Holo3 is immediately relevant. The model is available via Hugging Face Inference API on a free tier. The Apache 2.0 licence means you can fine-tune it for a specific application domain without legal constraints. The MoE architecture keeps inference costs lower than you would expect for a 35B model, since only ~10B parameters are active per forward pass.
OSWorld-Verified covers a well-defined set of environments. Production deployments involve enterprise applications with bespoke UI patterns, security restrictions, and failure modes that benchmarks do not capture. Holo3 is a strong base model; whether it works out of the box for your specific internal tooling depends on how far that tooling deviates from the training distribution. The self-improving synthetic environment approach does at least provide a path for adaptation: generate environments that mirror your specific applications, fine-tune, evaluate.