Forge: guardrails push an 8B local model to near-frontier reliability on agent tasks

Forge is an open-source reliability layer for self-hosted LLM tool-calling that wraps small local models in composable guardrails — rescue parsing for malformed tool calls, retry nudges, step enforcement — alongside VRAM-aware context management with tiered compaction. The headline claim from the project’s 26-scenario eval suite is that Ministral-3 8B Instruct Q8 on llama-server reaches 86.5% overall and 76% on the hardest tier, putting a quantized 8B model in striking distance of frontier APIs for multi-step agentic workflows.

The library exposes three integration surfaces: a full WorkflowRunner with a SlotWorker for multi-agent GPU sharing, middleware that plugs the guardrail stack into an existing orchestration loop, and an OpenAI-compatible proxy that transparently upgrades any client (opencode, Continue, aider) pointed at a local backend. A key design trick in the proxy is a synthetic respond tool injected when tools are present — forcing the model to stay in tool-calling mode rather than choosing between text and tool output, which small models do poorly. Backends include Ollama, llama.cpp’s llama-server, Llamafile, and Anthropic as a frontier baseline.

The work ships with 865 deterministic unit tests, a batch eval harness with resumable JSONL output, and an accompanying ACM-published paper documenting the framework and ablations. The practical pitch: teams running local inference no longer need a 70B-class model to get reliable agent loops — the gap can be closed at the scaffolding layer.