LM Studio now runs as a headless server with an Anthropic-compatible API

LM Studio 0.4.0 shipped a feature that quietly changes what local inference looks like in practice. The desktop GUI is now optional. The company has extracted its inference engine into llmster, a standalone background daemon with a full command-line interface for downloading, loading, and serving models. That is useful on its own. What makes it more interesting is the addition of an Anthropic-compatible endpoint at POST /v1/messages, which means any tool that targets the Anthropic API can be redirected to a locally running model with two lines of shell configuration.

The practical demonstration is Claude Code running against Gemma 4 on a MacBook Pro with 48GB of RAM. You set ANTHROPIC_BASE_URL=http://localhost:1234 and ANTHROPIC_MODEL=gemma-4-26b-a4b, and Claude Code stops talking to Anthropic’s servers entirely. The reported throughput is 51 tokens per second. That is slower than the cloud API on a fast connection, but fast enough to be usable for focused, single-file tasks. No API costs, no rate limits, no network dependency, no data leaving the machine.

This has legs beyond the cost and privacy angle. The Anthropic-compatible endpoint means the ecosystem of tools built around the Anthropic SDK gains local inference support without any code changes. If you have an internal tool that calls the Claude API, you can test it against a local model by changing an environment variable. For evaluation pipelines or development workflows where you want to iterate without accumulating API costs, that is genuinely useful. The OpenAI-compatible endpoint that most local inference tools ship is less useful here because not everything speaks OpenAI format.

A 26B parameter model running on a MacBook is not a Claude 4.6 Opus replacement for complex multi-step tasks. What you get is a privacy-preserving, zero-cost alternative for the parts of your workflow where a capable but not top-tier model is sufficient: code search, small edits, explaining what a function does, generating boilerplate. Knowing which tasks fall into that category, and routing them locally, is the actual skill.

LM Studio’s headless mode also has a role to play for server deployments. A background daemon that starts on boot and serves models over a local port is a different product category than a desktop app you open when you want to chat. Whether the performance and reliability hold up for sustained workloads is still an open question, but the architecture is now the right shape for embedding local inference into development workflows rather than treating it as a separate tool you switch to.