Launch notes
Today we are opening the doors to LeemerGLM, a discovery-first model built on Gemma 3 4B and paired with a vision specialist so it can read the web, interpret screens, and answer with grounded, testable steps.
We benchmarked frontier and open models for latency, safety, and tool-handling. Gemma 3 4B hit the sweet spot: small enough for responsive inference on our clusters, aligned enough to avoid hallucinated links, and flexible enough to fine-tune on our expert traces. It is also fully multimodal, so we can ship a single brain that reasons over pixels and tokens together.
The result is an expert that feels like an on-call teammate. Ask it to inspect a graph, critique onboarding copy, or reason through search results; LeemerGLM will combine the visuals with retrieved text and cite the sources it trusts.
We picked Gemma 3 4B as the spine because it is small enough to serve instantly yet trained with frontier-scale safety, retrieval, and tool-use priors. It gives us a balanced reasoning core that refuses to hallucinate under load.
LeemerGLM ingests screenshots, product docs, graphs, and code snippets without leaving the flow. Vision is not a bolt-on; it is how the model reasons about interfaces, diagrams, and noisy real-world data.
Inside LeemerChat, LeemerGLM sits beside Grok-4.1, GPT-5.1, and Gemini 3 Pro. Our router chooses the right expert for each step: Grok for speed, Gemini for world knowledge, GPT-5.1 for longform reasoning, and LeemerGLM for grounded multimodal synthesis.
We fine-tuned Gemma 3 4B on our routing traces to teach refusal patterns, citation-heavy answers, and low-latency search planning. This gave us a stable text specialist that could be trusted as the default brain for UI assistance.
Next we co-trained on product screenshots, debugging traces, and Figma exports. The goal was fast layout recognition and the ability to narrate what matters on a canvas—buttons, errors, and user flows—without verbose noise.
Finally we wired LeemerGLM into our expert router. It now pairs with Perplexity Sonar for retrieval, hands code to Grok-4.1-Fast, and defers deep synthesis to GPT-5.1. Each expert returns citations so the fusion layer can reconcile answers transparently.
The production route mirrors the marketing promise from /leemer-glm: authenticated access, transparent expert routing, and fast token streaming. We documented the flow so you can plug the endpoint straight into your product.
Authenticated by design
The /api/leemer-glm route checks for a signed-in Clerk session before work begins. That protects GPU time and keeps your conversations scoped to your account so history, preferences, and files stay private.
API pricing + base model
Public API access is priced at $0.10 input / $0.30 output per million tokens for leemerchat/leemer-glm, which runs on our GLM-4.1-9B base so responses stay multimodal by default.
Streaming router events
Responses are streamed as server-sent events. The first payload includes the router's chosen experts and the reasoning behind the selection, followed by the synthesis stream from the orchestrator.
Graceful failure modes
If the synthesis stream is unavailable, the route emits a structured error event before closing the connection. We also guard against missing OpenRouter credentials, returning a clear 503 so clients can retry elsewhere.
API flow
Use case
Upload a dashboard screenshot and ask for the three fastest UX improvements. LeemerGLM identifies drop-off points, highlights mismatched typography, and drafts component-level fixes with Tailwind-ready code.
Use case
Paste a CSV preview and a chart image. The model cross-references both, suggests a narrative arc, and drafts a one-page memo with footnotes so you can ship executive updates without editing.
Use case
Share a stack trace plus a screenshot of the failing page. LeemerGLM outlines probable causes, routes code suggestions to Grok, and hands back a concise runbook you can paste into PagerDuty notes.
Use case
Call the /api/leemer-glm endpoint directly from your product. The SSE stream surfaces the chosen experts and reasoning so you can log, debug, and replay responses without guessing what happened inside the orchestrator.