This page is for consumer projects — apps and dev containers that need to use the shared LLM runtime (and, eventually, the public edge proxy) provided by platform-services.
Consumer projects reach platform-services through the llm-gateway — a priority-aware async job queue that is the single entry point for both LLM (ollama) and document parsing (docling) work. The gateway API and wiring are identical regardless of which deployment you point at.
| Deployment | Where it runs | Models loaded | When to use |
|---|---|---|---|
| OCI (default) | Oracle Cloud Ampere A1 | Production models (72B-class LLM, VLM-based docling) | Default for everything — production consumer deployments and local consumer development alike. |
| Local platform-services | Developer machine | Small models only (8B-class) | Special cases: working offline, intentionally exercising the small models, or when OCI is unavailable. |
Concretely:
Consumers reach platform-services through the gateway via the
published host port. Your consumer doesn’t join the
platform_default network — it stays decoupled from
platform-services’ lifecycle and keeps starting/running when the
platform is rebuilding or down.
services:
app:
extra_hosts:
- "llm-gateway:host-gateway"Behind the scenes, llm-gateway resolves via
/etc/hosts to the host gateway IP; TCP connections then
land on the published host port (11435).
URLs:
http://llm-gateway:11435 (private; via
host-gateway + tunnel — see Pointing local dev at OCI below).
If your machine also runs a local platform-services stack on
:11435, the OCI tunnel binds :21435
instead to avoid the port collision — point your consumer at
21435 for OCI work. See Sharing localhost with a local
platform-services for the full pattern. make tunnel
from platform-services already does this; the trap is consumer URLs
pinned to :11435 silently landing on the local stack
instead of the tunnel.https://soggplatform.dedyn.io/ (public
TLS edge — reachable from anywhere, no extra_hosts wiring or tunnel
needed)Why the two endpoints have different reachability stories: the gateway is a write-capable LLM dispatch API with no auth/rate-limit yet, so it’s deliberately kept off the public internet. Docs are read-only and safe to publish, so they’re served straight from the public TLS edge — no per-distro setup required.
Don’t hardcode these in app code — parameterize them via environment
variables (LLM_GATEWAY_URL,
PLATFORM_DOCS_URL). Same code then works under local dev,
deployed prod, and the SSH-tunnel variant documented further down.
When platform-services is up, your consumer reaches it. When it’s down, the consumer still starts — gateway calls fail at the call site with a connection error, not at compose-up time. That’s the design intent.
Not supported. ollama and docling are internal services of platform-services and have no consumer contract. The gateway owns all backend dispatch, including:
LIFECYCLE_IDLE_TIMEOUT_SECONDS of no activityIf you were previously wired to ollama:11434 or
docling:5005, see Migrating from direct backend
access below.
| Service | URL |
|---|---|
| llm-gateway | http://llm-gateway:11435 (private — see note
below) |
| platform-docs (page) | https://soggplatform.dedyn.io/ |
| platform-docs (model list) | https://soggplatform.dedyn.io/models.json |
| platform-docs (deploy info) | https://soggplatform.dedyn.io/status.json |
Notes:
extra_hosts or tunnel needed. The same content is also
served on http://platform-docs:5006/ to containers on the
OCI host itself (in-host shortcut); local-dev consumers should ignore
that and use the public URL.host-gateway
alias plus either a local platform-services stack or the SSH tunnel
documented below./models.json is proxied live from ollama’s
/api/tags via platform-docs — hit this for model discovery
rather than hardcoding model tags in your config.:11434 and docling’s :5001 are
not part of the consumer contract. On the OCI
deployment they aren’t reachable from consumers at all (large host is on
a private VCN). On a local platform-services stack the runtime still
publishes them for ops debugging, but consumers should not use them —
see What about direct ollama / docling access? above.The gateway exposes a uniform async job API for both backends. Submit a job, get an ID back immediately, poll for the result. Jobs are expected to run roughly 5–15 minutes on the current hardware (longer once we cut over to 72B-class models) — long enough that holding an HTTP connection open across the work is the wrong default.
| Verb | Path | Purpose |
|---|---|---|
POST |
/v1/jobs |
Submit a job. Returns
{ id, status, tier, backend, queue_position }. |
GET |
/v1/jobs |
List recent jobs (ops visibility). Query params: status
(default active = queued+running+failed; or
queued / running / completed /
failed / all), backend
(ollama / docling), limit (1–200,
default 20). Returns a compact view per job — result
payloads are replaced by result_bytes, errors trimmed to
500 chars. Use /v1/jobs/{id} for the full payload. |
GET |
/v1/jobs/{id} |
Status + result (or error). Poll this. |
DELETE |
/v1/jobs/{id} |
Cancel a job. Only valid while status=queued. |
GET |
/queue |
Aggregate stats: in-flight (one global slot), per-tier queue depths,
last-24h counters. Includes a timestamp field for
cross-sample correlation. |
GET |
/diagnostics |
Live backend health and a single-field state
summarizing the whole platform (stopped,
starting, ready_cold, warming,
ready_warm, busy, failed,
backend_unhealthy, disabled). Also exposes
warm (boolean), loaded_models (in RAM right
now), available_models (on disk), plus the raw per-backend
probes. Use state and warm for routing
decisions; use the raw fields for ops debug. |
{
"endpoint": "/api/chat",
"payload": { "model": "qwen2.5:72b-instruct-q4_K_M", "messages": [...] },
"priority": "interactive",
"backend": "ollama"
}endpoint — the upstream path to call on the chosen
backend. For ollama: /api/chat, /api/generate,
/api/embeddings, /v1/chat/completions, … For
docling: /v1/convert/source/async.payload — the body you would have POSTed directly to
the backend, verbatim (with caveats noted per-backend below).priority — optional. interactive or
batch. Omitted → recorded as batch. See
Choosing a priority for your job below. An unknown value
(e.g. urgent) is rejected with 400 rather than silently
falling back to batch.backend — optional. ollama or
docling. Omitted → the gateway derives it from the endpoint
path (/v1/convert/* → docling, anything else → ollama). If
you supply backend explicitly and it contradicts the
endpoint (e.g. endpoint=/api/chat +
backend=docling), the submit is rejected with 400 — there
is one unambiguous dispatch path per job, and the gateway loud-fails
rather than silently routing to the wrong backend.Set priority on the submit body to one of:
interactive — drained ahead of any queued
batch work. For human-facing, latency-sensitive calls
(editor assistant, chat UI).batch — drained when no interactive work is queued. For
bulk ingestion, indexing, scheduled jobs. This is the default when
priority is omitted.Tier is per-job, not per-caller: a single consumer can mix
interactive and batch submissions freely. There is no central
caller→tier mapping; the gateway trusts whatever priority
value the consumer declares (the trust model assumes a small number of
fully-trusted consumers on the internal network).
Set the X-Caller-Id request header to a short, stable
identifier for your consumer project
(e.g. my-editor-assistant). This is recorded on every job
and surfaces in /v1/jobs/{id} and the gateway log — useful
for tracing your own traffic and for the operator when debugging — but
it does not drive tier assignment. If the header is
absent, the gateway falls back to the peer IP as the caller identifier;
don’t rely on the fallback, it’s a backstop.
Preemption is not supported. An interactive job submitted while a batch job is in flight waits for that batch job to finish (can be 5–15 min for docling work, longer for big inference). Once the in-flight slot frees, the interactive job dispatches before any queued batch. If sub-batch-duration latency matters for your use case, raise it — the gateway design will need to change.
The gateway runs one job at a time globally across all backends. An ollama job in flight blocks a docling dispatch and vice versa; two queued jobs of any combination run sequentially.
Why one slot, not one per backend: the on-demand large instance is CPU-bound on both workloads (no GPU on the free-tier ARM shape), so running ollama and docling in parallel halves each one’s throughput. Serial dispatch with priority tiers gives each job full CPU and keeps the queue model simple. Rationale and the trade-off in full: plans/priority-queueing.md Queue shape section.
The queue_position returned on submission counts every
queued job ahead of yours globally — 1 means “next to run,”
regardless of which backend any of the queued jobs target.
queued ─────► running ─────► completed
│ │
│ └────► failed (backend error, timeout, gateway restart)
└────► failed (cancelled via DELETE, or platform-initiated bulk cancel)
If the gateway operator clears the queue (incident recovery, stuck
backend, etc.), every queued job transitions to
status=failed with the recorded error set to a
message the operator chose — by default
cancelled by platform services, but operators may include
incident context
(e.g. large instance OOM, restart pending). Treat this as a
normal job-level failure on the consumer side: surface
error to the user, resubmit if your workflow needs to
retry.
While status=queued, the response includes
queue_position (1-indexed, global; 1 means
“next to run,” counting every queued job ahead of yours regardless of
backend). Once a job starts, the response includes
started_at and queue_position is dropped. On
completion, the verbatim backend response is returned under
result.
While status=running, the response also includes a
phase field describing what the gateway is currently doing
with your job. The bundled dispatchers emit:
phase |
Meaning |
|---|---|
waiting_for_backend |
Worker claimed your job; lifecycle is bringing the on-demand large up (only relevant on a cold start) |
ollama_dispatching |
POST to ollama just sent; phase about to refine to one of the two below within ~2 s |
ollama_loading_model |
The model your call requested is being loaded from disk into RAM (cold-load cost) |
ollama_generating |
The model is resident; ollama is actively producing tokens |
docling_submitting |
Sending your PDF to docling’s async submit endpoint |
docling_polling |
docling accepted; gateway is polling for completion |
docling_fetching_result |
docling reported success; gateway is fetching the JSON result |
phase is a UX affordance — a consumer can render
meaningfully different “loading model…” vs “generating…” states without
guessing — but it’s not a control plane signal. Don’t write logic that
depends on phase transitions firing in a specific order or at all
(e.g. a very fast inference may transition straight to
completed before the watcher even writes
ollama_generating). Treat it as a hint, not a contract.
Gateway restarts: queued jobs survive (state is persisted to SQLite).
Any job that was actively running at restart time is marked
failed with an error indicating the gateway restarted — the
partial work is gone, and silently re-running could double-charge a
non-idempotent caller. Resubmit if needed.
Retention. Completed and failed job records are kept
for 72 hours after completed_at, then
deleted by a background sweep that runs every hour. Both knobs are
env-tunable on the gateway service
(GATEWAY_JOB_TTL_SECONDS, default 259200;
GATEWAY_CLEANUP_INTERVAL, default 3600).
Queued and running jobs are never swept regardless of
age — a job that takes a week to drain stays in the queue with
no retention pressure. After sweep, GET /v1/jobs/{id}
returns HTTP 404 with body {"error": "job not found"} —
indistinguishable from an ID that never existed. Poll each result within
72 h of its completed_at, store the result on your side, or
bump the TTL further for batch workloads. The
completed_last_24h / failed_last_24h counters
on /queue are independent of the retention window — they
always report the last 24 h of activity.
The production default is the 72B-class model on the OCI
large host — qwen2.5:72b-instruct-q4_K_M. This is
what consumers should target both in deployed prod and from dev (via the
OCI tunnel). Qwen3 has no 72B size (its ladder jumps 32B → 235B), so the
production target is Qwen2.5-72B; if you see a reference to “qwen3:72b”
anywhere, it’s a docs error.
The 8B small profile
(qwen3:8b-q4_K_M-nothink) is kept for legacy
compatibility and minimal-resource tests only — running ollama
locally on a developer machine that can’t hold a 72B model, or driving a
regression suite where the smaller model’s faster inference is more
important than its quality. Do not depend on the small profile
being available on the OCI large.
Hardcoding a model name in your consumer ties it to one deployment. Parameterize via env var, same way as the gateway URL:
# Production / dev-against-OCI (default — what every consumer
# should target unless they have a specific reason not to)
LLM_GATEWAY_URL=http://llm-gateway:11435 # or 21435 if tunneled
LLM_MODEL=qwen2.5:72b-instruct-q4_K_M
# Local platform-services only (legacy / minimal-resource tests)
LLM_GATEWAY_URL=http://llm-gateway:11435
LLM_MODEL=qwen3:8b-q4_K_M-nothinkThe model set actually loaded on each deployment lives in models/profiles/
(large.sh is the production target; small.sh
is legacy). To discover what’s currently loaded on the gateway you’re
pointing at, hit /diagnostics on the gateway —
ollama.ps[].name lists what’s resident in RAM right now,
ollama.tags lists everything on disk. The legacy
https://soggplatform.dedyn.io/models.json proxy of
/api/tags still works for public model discovery.
Submitting a job whose model name is in
available_models but not in loaded_models is
legal and pays a one-time disk→RAM load cost — the
gateway forwards the request to ollama, ollama loads the model from
local disk (the bind-mounted weights volume), and then runs inference.
The HTTP request just blocks while loading. From the consumer’s
perspective the call takes longer than usual, and the phase
field on /v1/jobs/{id} reports
ollama_loading_model during the load window so you can
render a meaningful UX state instead of “still loading…”. Once loaded,
the model stays resident under OLLAMA_KEEP_ALIVE=-1 and
subsequent calls don’t pay the cost.
Submitting a job whose model name is not in
available_models at all is an operator-side
configuration gap — the host this gateway points at hasn’t pulled the
model. Ollama’s behavior in that case varies by version (recent releases
auto-pull, older releases 404). Either way, don’t rely on
auto-pull for a 40 GB production model — a silent 10–20 minute
background download is the wrong default for an inference path.
Pre-flight check by reading /diagnostics.available_models
before submitting; if the model you need isn’t there, that’s an operator
ask, not a consumer retry.
These are calibration numbers for sizing consumer-side timeouts, not SLOs. They depend on the OCI large’s actual shape (currently 20 OCPU / 140 GB, CPU inference — no GPU) and the specific model loaded. Treat them as ballpark; measure against your own workload once you’ve got something in production.
For the production model (qwen2.5:72b-instruct-q4_K_M) —
numbers below are measured, not estimated, from the
2026-05-23 autostatement verify run on the current OCI large host (20
OCPU, 140 GB, CPU inference):
| Scenario | Measured wall-time | What’s dominating |
|---|---|---|
state=stopped → docling first call ready |
~70 s | OCI boot + reach ready_warm for the docling
backend |
| Docling: 9-page, 660 KB PDF | 86–164 s | docling itself; not gateway-side |
| Ollama cold load (72B disk→RAM) | 20 min 6 s | block-volume read speed for 40 GB of weights |
| Ollama warm typical extraction (small labelling prompt) | 3 min 49 s – 4 min 22 s | token generation on CPU |
| Ollama cold first call (load + typical extraction) | ~24 min | 20-min load + ~4-min generate; fits inside the 60-min
GATEWAY_OLLAMA_TIMEOUT default with headroom |
The cold-load cost is large enough that the gateway pre-warms
the production model in the background after boot. When the
operator sets
LIFECYCLE_WARM_MODEL=qwen2.5:72b-instruct-q4_K_M on the
gateway, the lifecycle controller kicks a warm probe (a single-token
inference against that model) as a fire-and-forget background task
immediately after backend health passes. From the consumer side this
means:
state=starting covers only OCI boot + the backend
health probe (~70 s on the current OCI large), not the 20-min
cold-load.state flips to ready, the typical post-boot state
is ready_cold for a few minutes (model not yet resident)
and then ready_warm once the background probe completes the
load.If LIFECYCLE_WARM_MODEL is not set, no
background probe is spawned and the first consumer job after each cold
start triggers the load inside its own HTTP budget. Same wall-time, just
different attribution.
GET /diagnostics surfaces the probe outcome under
lifecycle.warm:
"lifecycle": {
...
"warm": {
"model": "qwen2.5:72b-instruct-q4_K_M",
"last_attempt_at": "2026-05-25T10:14:23.117+00:00",
"last_outcome": "success",
"last_duration_seconds": 1187.4,
"last_error": null,
"model_loaded_at": "2026-05-25T10:14:23.117+00:00"
}
}
Fields:
last_outcome is one of success,
timeout, transient_error,
fatal_error, cancelled, error, or
null (no probe run yet on the current boot). For consumer
routing, treat anything other than success as “first call
may pay cold-load” — the specific non-success value is operator-facing
detail. Existing consumers that gate on
last_outcome == "success" continue to work; the
cancelled and error values are additive, not
breaking.last_outcome=success plus model_loaded_at
set is the strongest signal that the next call is warm.
loaded_models on the same /diagnostics
response is the live ground truth; model_loaded_at is a
“when did we last positively confirm it” timestamp useful for predicting
freshness over longer windows.success outcome (probe timed out, hit a transient
error, was cancelled by an unrelated lifecycle event, or raised an
unexpected exception) doesn’t block READY — the controller stays ready
and the first consumer call paying cold-load is the recovery path.
Operator signal only.fatal_error indicates a misconfigured
LIFECYCLE_WARM_MODEL (HTTP 4xx — most commonly 404 model
not found). Consumer jobs using a different model name continue
to work; consumer jobs using the same model name will see the same
error. State stays READY in either case.cancelled means the probe was cancelled mid-flight by a
reconcile-driven state transition (OCI reported the large stopped while
the probe was still running). error means an unexpected
exception escaped the probe — treat as a bug signal worth surfacing to
the platform team.The gateway holds the ollama HTTP request server-side until ollama
actually returns, up to GATEWAY_OLLAMA_TIMEOUT (default 60
min). Your consumer is polling /v1/jobs/{id}, not waiting
on that HTTP call — each poll returns in milliseconds with
status=running and the appropriate phase.
phase=ollama_loading_model is now the normal
signal that you’re paying cold-load (rather than the rare fallback it
was when warm-on-boot blocked READY);
phase=ollama_generating flips when ollama starts producing
tokens. Your HTTP client’s per-request timeout only needs to
cover one poll, not the whole job. The knob that matters for
cold starts is how long your polling loop is willing to wait overall —
for the 72B on the current OCI large, budget up to ~25 min for a cold
first call (load + generate) and use the phase field to
render meaningful state in the meantime.
For pre-flight smoke tests: a 5-token output against a trivial prompt
("Say pong.") completes in well under 30 seconds when
state=ready_warm AND the requested model is in
loaded_models.
If your verify runs produce additional measurements (especially on different output sizes or with different prompts), contribute them back — the table above is anchored on one filing’s worth of data plus the cold-load probe.
import os, time, requests
GATEWAY = os.environ["LLM_GATEWAY_URL"] # e.g. http://llm-gateway:11435
MODEL = os.environ["LLM_MODEL"] # e.g. qwen2.5:72b-instruct-q4_K_M
HEADERS = {"X-Caller-Id": "my-editor-assistant"}
submit = requests.post(
f"{GATEWAY}/v1/jobs",
headers=HEADERS,
json={
"endpoint": "/api/chat",
"payload": {
"model": MODEL,
"messages": [{"role": "user", "content": "Summarize ..."}],
},
"priority": "interactive",
},
)
submit.raise_for_status()
job_id = submit.json()["id"]
while True:
r = requests.get(f"{GATEWAY}/v1/jobs/{job_id}", headers=HEADERS)
r.raise_for_status()
job = r.json()
if job["status"] == "completed":
print(job["result"])
break
if job["status"] == "failed":
raise RuntimeError(job["error"])
time.sleep(5)A 5-second poll cadence is fine for 5–15 minute jobs. The gateway
forces stream: false on ollama calls — you receive the full
response in result, never tokens-in-flight.
Docling input goes inline in the payload as base64. The gateway does not accept multipart uploads — that simplifies the gateway and keeps the submission shape uniform between backends. For a 10 MB PDF this means ~13 MB of base64 text in the request body, well inside the gateway’s 64 MB body cap.
The gateway forwards your payload to docling’s
/v1/convert/source/async, polls until docling reports the
task terminal, and returns the docling result JSON verbatim in
result.
Important — conversion options must be nested under
options. docling-serve accepts
conversion knobs (do_ocr, to_formats,
do_table_structure, md_page_break_placeholder,
etc.) under an options sub-object, not as
top-level siblings of sources. The gateway is a verbatim
passthrough; if you put options at the top level they reach docling but
get silently ignored, and you’ll see defaults instead — most visibly: no
\f page-break markers in markdown. Confirmed in production
by an autostatement regression on 2026-05-23.
Images: default is dropped. The gateway interprets a
top-level include_images: bool field on the docling payload
(default false). When false the gateway both (a) sets
docling’s image_export_mode=placeholder to suppress
server-side rendering of images into the result, and (b) strips any
image data that does land in the result before storing it. The strip
nulls result.document.json_content.pages[<n>].image
(the rendered page bitmaps — the main bloat source on real PDFs) and
pictures[].image (semantic-object detections), and replaces
inline base64 data URIs in md_content /
html_content with placeholders. Set
include_images: true if you actually need the image bytes
(e.g. an upcoming vision-model interpreter); be aware that a single PDF
page can produce 1–10 MB of base64 image data at the default 144 dpi and
the gateway stores the full result for the job-TTL window. Background:
plans/docling-image-handling.md.
import base64, os, time, requests
GATEWAY = os.environ["LLM_GATEWAY_URL"]
HEADERS = {"X-Caller-Id": "my-doc-ingest"}
with open("annual-report.pdf", "rb") as f:
pdf_b64 = base64.b64encode(f.read()).decode("ascii")
submit = requests.post(
f"{GATEWAY}/v1/jobs",
headers=HEADERS,
json={
"endpoint": "/v1/convert/source/async",
"backend": "docling",
"priority": "batch",
"payload": {
"sources": [
{
"kind": "file",
"base64_string": pdf_b64,
"filename": "annual-report.pdf",
}
],
# Gateway-level knob. Default is false (drop images).
# Sibling of `sources` / `options`; the gateway extracts
# it from the payload before forwarding to docling.
# "include_images": False,
# All docling conversion knobs go under `options`.
# Putting them at the top level alongside `sources`
# results in docling silently using defaults — see the
# warning above this snippet.
"options": {
"to_formats": ["md", "json"],
"do_ocr": True,
"do_table_structure": True,
"md_page_break_placeholder": "\f",
},
},
},
)
submit.raise_for_status()
job_id = submit.json()["id"]
while True:
r = requests.get(f"{GATEWAY}/v1/jobs/{job_id}", headers=HEADERS)
r.raise_for_status()
job = r.json()
if job["status"] == "completed":
result = job["result"] # the verbatim docling result JSON
break
if job["status"] == "failed":
raise RuntimeError(job["error"])
time.sleep(5)The docling result shape — markdown, JSON document tree, etc. — is
whatever docling returns at /v1/result/{task_id}. The
gateway is a pass-through for that body. A 5-line regression test that
asserts
result["document"]["md_content"].count("\f") >= n_pages
would catch a future shape regression like the 2026-05-23 incident in
seconds.
Job size caps. Two timeouts apply, both server-side
and configured on the gateway / docling services in docker-compose.yml:
GATEWAY_DOCLING_TIMEOUT (default 1200 s) — how long the
gateway will keep polling docling before giving up on a job and marking
it failed. Includes submission + polling + result
fetch.DOCLING_SERVE_MAX_DOCUMENT_TIMEOUT (default 900 s) —
how long any single docling conversion may run before the docling worker
abandons it. Independent of submit endpoint.For ~150-page annual-report PDFs, both caps are comfortable. If you have larger documents, raise both before pushing them through.
GET /queue returns a snapshot without affecting the
queue:
{
"in_flight": 1,
"queued": { "interactive": 0, "batch": 3 },
"running_by_tier": { "interactive": 0, "batch": 1 },
"completed_last_24h": 47,
"failed_last_24h": 2,
"timestamp": "2026-05-23T09:25:01.124+00:00"
}in_flight is 0 or 1 — there is
a single global slot across all backends (see Single global
in-flight slot above). Hit this before a non-urgent submission if
you want to be a good citizen: if queued.batch is deep, you
might choose to defer. timestamp is the gateway’s
wall-clock at snapshot time; use it to correlate samples across time
without relying on the relative
lifecycle.last_activity_age_seconds.
The response also includes a lifecycle block with the
on-demand large-instance state and (when running) two uptime views:
{
"lifecycle": {
"state": "ready",
"in_flight": 0,
"last_activity_age_seconds": 12.4,
"last_error": null,
"session": {
"started_at": "2026-05-23T08:12:01.034+00:00",
"uptime_seconds": 4501.2
},
"month_to_date": {
"wall_hours": 18.42,
"ocpu_hours": 368.4,
"gb_hours": 2578.8,
"month_start": "2026-05-01T00:00:00+00:00",
"next_reset": "2026-06-01T00:00:00+00:00"
}
}
}session is null when the large is stopped.
month_to_date is informational today — useful if you want
to know roughly how much of the monthly OCI free-tier budget has been
spent so far this UTC month. A future release will surface
remaining-budget on each job-status response and add a dedicated
/usage endpoint with the same data; until then
/queue.lifecycle.month_to_date is the place to look.
For decisions that need to happen before you submit — “is
the backend even up?”, “is the model I want already in RAM, or am I
about to pay a cold-load cost?” — GET /diagnostics gives
you a single derived state field plus a few denormalized
lists:
{
"state": "ready_warm",
"warm": true,
"loaded_models": ["qwen2.5:72b-instruct-q4_K_M"],
"available_models": [
"qwen2.5:72b-instruct-q4_K_M",
"qwen3:8b-q4_K_M-nothink"
],
"ollama": { ... raw per-backend detail ... },
"docling": { ... raw per-backend detail ... },
"lifecycle": { ... state, in_flight, last_activity_age_seconds, last_error, warm — same as /queue's lifecycle EXCEPT /queue additionally embeds session / month_to_date / large_shape (the uptime-accounting block) ... },
"timestamp": "2026-05-23T09:25:02.341+00:00"
}state is the single source of truth for “what’s the
platform doing right now,” and is one of:
state |
Meaning | Your next call will… | Wait or submit? |
|---|---|---|---|
stopped |
Large instance is idle-asleep (auto-stopped after
LIFECYCLE_IDLE_TIMEOUT_SECONDS of no activity, default 2 h)
— the next submission will auto-wake it |
trigger a ~70 s boot + warm-on-boot loading the 72B (~20 min) in the background; first call pays cold-load if it races the probe | just submit |
starting |
OCI boot in progress (either triggered by a previous submission or
by make redeploy) |
wait until boot finishes | submit (will queue) |
stopping |
OCI shutdown in progress | wait, or queue and the boot will retrigger | submit (will queue and trigger wake) |
backend_unhealthy |
Instance up, ollama/docling not responding | likely fail; investigate before retry | don’t submit until resolved |
ready_cold |
Backends up, no model in RAM | pay disk→RAM load (~1–3 min for 72B) | submit |
warming |
A model load is in progress (typically the background warm-on-boot probe; can also be an in-flight job’s own cold-load) | wait briefly; first call after ready_warm is fast |
submit (queues briefly) |
ready_warm |
Model resident, idle | run immediately | submit (fast) |
busy |
Model resident, currently generating | queue behind the in-flight job | submit (queues briefly) |
failed |
Lifecycle in failed state (e.g. OCI capacity, IAM revoked, OCID deleted) | submissions will likely fail; the lifecycle auto-clears after the
failed_cooldown_seconds window and retries |
wait, or fix the underlying OCI-side issue |
disabled |
Lifecycle controller off (local dev / non-OCI deploys) | reach the backend directly with no boot logic | submit |
Sleep and wake — the operational model in one
paragraph. The on-demand large instance auto-stops itself after
LIFECYCLE_IDLE_TIMEOUT_SECONDS (default 7200 s / 2 h) of no
gateway dispatches, and auto-wakes on the next job
submission — the gateway intercepts every job submit and, if
the large is stopped, issues an OCI start before
dispatching. Consumers don’t need to explicitly trigger
wake. There is no “operator paused the platform, hold your
submission” state in this API; that scenario is covered by
disabled (controller off entirely) or failed
(lifecycle gave up). For every other state a consumer can safely submit
and let the gateway handle whatever’s needed; the phase
field on /v1/jobs/{id} will then narrate boot → load →
generation.
warm is the convenience boolean:
lifecycle == ready AND at least one model in RAM. True
implies your next ollama inference starts immediately — no boot, no
load. Use warm for the simple “is this a fast path?” check;
use state when you need the full picture.
loaded_models and available_models let you
check whether the specific model you want is resident (fast next call)
or just on disk (will pay ~1–3 min load). If the model you need isn’t in
available_models at all, you’ve targeted the wrong host or
the model hasn’t been refreshed there yet — that’s an operator issue,
not something to retry around.
The raw ollama.ps, ollama.tags,
docling.health, and lifecycle blocks are there
for ops debugging. Each backend reports errors in-band, so a single
broken backend doesn’t blind you to the other — useful when you’re
triaging which side is having a bad day.
The minimal examples earlier in this page elide a few things that
matter in real consumers — pre-flight /diagnostics check,
phase-aware UX, robust timeout, and explicit failure handling.
Copy-paste the snippet below as a starting point rather than re-derive
it. It assumes the model name and Choosing a priority for your
job sections above have been read.
"""Minimal but production-shaped consumer of the platform gateway.
Reads LLM_GATEWAY_URL and LLM_MODEL from env (no defaults — failing
loudly beats defaulting to a wrong gateway, see #picking-a-model-name).
"""
import os
import time
from typing import Callable, Optional
import requests
GATEWAY = os.environ["LLM_GATEWAY_URL"] # e.g. http://llm-gateway:11435
MODEL = os.environ["LLM_MODEL"] # e.g. qwen2.5:72b-instruct-q4_K_M
CALLER = "my-consumer" # short stable identifier; logged
HEADERS = {"X-Caller-Id": CALLER}
PRIORITY = "batch" # "interactive" or "batch"; default batch
class GatewayError(RuntimeError):
"""Anything the gateway tells us went wrong. .job is the recorded
row when the failure was a job-level error (so callers can read
.job['error'], .job['phase'], etc.); None for pre-submit failures."""
def __init__(self, message: str, job: Optional[dict] = None):
super().__init__(message)
self.job = job
def diagnostics() -> dict:
r = requests.get(f"{GATEWAY}/diagnostics", timeout=5)
r.raise_for_status()
return r.json()
def preflight(model: str) -> dict:
"""Confirm the gateway will accept work for the given model.
Raises GatewayError if the platform is in a non-submittable state
or the model isn't available on the target host."""
d = diagnostics()
if d["state"] in ("failed", "backend_unhealthy"):
raise GatewayError(f"platform not ready: state={d['state']}")
if model not in d["available_models"] and d["available_models"]:
# Empty list means we couldn't reach ollama (likely instance
# asleep); submission will trigger wake and we'll learn for
# sure then. Only fail loudly when we have a definitive list.
raise GatewayError(
f"model {model!r} not on disk; available={d['available_models']}"
)
return d
def submit(backend: str, endpoint: str, payload: dict) -> str:
r = requests.post(
f"{GATEWAY}/v1/jobs",
headers=HEADERS,
json={
"backend": backend,
"endpoint": endpoint,
"payload": payload,
"priority": PRIORITY,
},
timeout=10,
)
r.raise_for_status()
return r.json()["id"]
def wait_until_terminal(
job_id: str,
*,
timeout_s: int = 7200,
poll_interval_s: float = 5.0,
on_phase_change: Optional[Callable[[Optional[str]], None]] = None,
) -> dict:
"""Poll /v1/jobs/{id} until completed or failed. Calls
on_phase_change(phase) once per (deduped) transition — the right
place to render UX state."""
deadline = time.monotonic() + timeout_s
last_phase: Optional[str] = "__init__"
while time.monotonic() < deadline:
r = requests.get(f"{GATEWAY}/v1/jobs/{job_id}",
headers=HEADERS, timeout=10)
r.raise_for_status()
job = r.json()
phase = job.get("phase")
if phase != last_phase:
if on_phase_change:
on_phase_change(phase)
last_phase = phase
if job["status"] == "completed":
return job
if job["status"] == "failed":
raise GatewayError(f"job failed: {job.get('error')}", job=job)
time.sleep(poll_interval_s)
raise GatewayError(
f"job {job_id} did not terminate within {timeout_s}s; "
f"last phase was {last_phase!r}"
)
# Example: chat completion
def chat(messages: list, timeout_s: int = 7200) -> str:
preflight(MODEL)
job_id = submit("ollama", "/api/chat", {"model": MODEL, "messages": messages})
print(f"submitted {job_id}")
job = wait_until_terminal(
job_id,
timeout_s=timeout_s,
on_phase_change=lambda p: print(f" phase={p}"),
)
return job["result"]["message"]["content"]
if __name__ == "__main__":
print(chat([{"role": "user", "content": "Say pong."}]))What this snippet encodes that the minimal examples don’t:
/diagnostics with a
defensive empty-list check. When the large is asleep,
available_models is empty and we don’t fail loudly
— submission will wake it and we’ll learn then. Only fail when we have a
definitive list that excludes the requested model.job row (including the final phase),
so failure callers can read error.job["phase"] and render
“failed during ollama_loading_model” instead of generic
“failed.”TimeoutError.Adapt the chat() example to your shape: docling jobs use
submit("docling", "/v1/convert/source/async", {...}),
embeddings use submit("ollama", "/api/embeddings", {...}).
The submit/poll/phase plumbing stays identical.
For day-to-day consumer development, run your consumer locally and point it at the OCI gateway via an SSH tunnel. This gives you local hot-reload, editor tooling, and dev container niceties while the LLM and docling work runs on production-sized models in OCI.
Why this is the default rather than local platform-services:
Reach for local platform-services instead when you have a concrete reason: working offline, intentionally testing the small models, or OCI is unavailable. Switching is one env var change.
Two tunnel patterns — pick by where your consumer runs:
Either way, one-time bootstrap is the same: generate key → authorize on OCI → SSH alias → run tunnel. Errors: Troubleshooting.
Two viable topologies depending on where the consumer process runs.
Same key, same OCI authorization, same SSH alias — only the location of
the ssh -L and the consumer’s URL differ.
Option A — tunnel on the WSL host.
ssh -L runs on the WSL distro; the consumer reaches it via
host-gateway. Works cleanly for consumers running
natively on the distro (no container). For consumer dev
containers under Docker Desktop + WSL2, also requires WSL in mirrored
networking mode — otherwise WSL-host-bound ports are invisible to
containers, regardless of 0.0.0.0 bind.
consumer (in container or native) → host-gateway → ssh -L on WSL → OCI:localhost:11435
Option B — tunnel inside the dev container.
ssh -L runs in the consumer container; the consumer hits
its own loopback. Side- steps host-side networking entirely. Requires
bind-mounting the WSL distro’s ~/.ssh into the container so
the same key/config are available.
consumer (in container) → localhost → ssh -L in same container → OCI:localhost:11435
Each WSL distro (or local machine) that wants the tunnel gets its own key. Don’t copy keys between distros — that defeats their isolation. In the consumer-project distro:
ssh-keygen -t ed25519 -C "<distro-name>-tunnel" -f ~/.ssh/oci_arm
cat ~/.ssh/oci_arm.pubThis is a separate key from the operator’s full-access key documented in ../README.md → SSH access. That key exists for managing the instance; this one is for tunnels only.
The key lives in the WSL distro’s ~/.ssh/ regardless of
which tunnel pattern you use (see Run the
tunnel). Under Option B the dev container bind-mounts that directory
read-only and uses the same key — no copying, no separate identity. “One
key per distro” means per WSL distro.
The public half goes onto OCI’s ~/.ssh/authorized_keys,
prefixed with restrictions so the key cannot be used for anything except
forwarding the gateway port:
command="echo tunnel-only access; exit 1",no-pty,no-agent-forwarding,no-X11-forwarding,no-user-rc,permitopen="localhost:11435" ssh-ed25519 AAAA... <distro-name>-tunnel
Why not the restrict umbrella keyword?
OpenSSH’s restrict is the documented one-word equivalent of
the four no-* options below. But on OpenSSH 9.6p1 (Ubuntu
24.04, current OCI image) we observed that
restrict,permitopen=... parses correctly — sshd’s debug log
shows the permitopen target listed — yet forwarding gets denied with
administratively prohibited anyway. The expanded form (each
restriction named individually) behaves correctly. Verified on
2026-05-19 during the autostatement onboarding; the pattern in this doc
deliberately avoids restrict so future onboardings don’t
repeat the debug session.
Where this command runs. Not from the consumer
distro — that’s the distro we’re granting access, so it can’t
authorize itself yet. The append runs from a machine that already has
admin SSH to OCI — typically the operator’s
platform-services WSL distro. One-liner from there:
echo 'command="echo tunnel-only access; exit 1",no-pty,no-agent-forwarding,no-X11-forwarding,no-user-rc,permitopen="localhost:11435" ssh-ed25519 AAAA... <distro-name>-tunnel' \
| ssh oci-arm 'cat >> ~/.ssh/authorized_keys'Verify it landed:
ssh oci-arm 'tail -1 ~/.ssh/authorized_keys'
ssh oci-arm 'tail -1 ~/.ssh/authorized_keys' | grep -oE 'ssh-ed25519 \S+ \S+$' | ssh-keygen -lf -The grep -oE step strips the long options prefix (which
contains spaces inside command="...") and isolates the bare
<keytype> <keydata> <comment> so
ssh-keygen -lf - can fingerprint it. Cross-check the
printed fingerprint against the one the consumer distro printed in
Generate the key (step 1). If they match, the key is intact
through paste.
After this one-time bootstrap, the consumer distro talks to OCI directly forever — the operator distro is just the “trusted introducer” that vouches for the new key on day one.
What each option does:
command="echo tunnel-only access; exit 1" — forces
every non-tunnel session to print and exit, blocking
ssh oci-arm 'cmd' from running arbitrary commands. Port
forwarding uses a separate SSH protocol path and runs before
the forced command, so forwards still work.no-pty — denies PTY allocation (no interactive
shell).no-agent-forwarding — blocks SSH agent forwarding.no-X11-forwarding — blocks X11 forwarding.no-user-rc — blocks execution of ~/.ssh/rc
at login.permitopen="localhost:11435" — permits forwarding to
llm-gateway only.If this key leaks, the worst an attacker can do is open forwards to the gateway port. They can’t get a shell, run commands, read files, forward agent or X11, or pivot to any other port.
In the consumer-project distro’s ~/.ssh/config:
Host oci-arm
HostName 79.76.60.187
User ubuntu
IdentityFile ~/.ssh/oci_arm
IdentitiesOnly yes
Pick the option matching where your consumer runs.
permitopen on OCI checks only the remote
destination, so the same authorized_keys line works for either
option.
ssh -L 0.0.0.0:11435:localhost:11435 oci-arm -NThe 0.0.0.0 bind (not the default 127.0.0.1) is what
lets a consumer dev container reach the port via
host-gateway. On Docker Desktop + WSL2 this also
requires the distro to be in mirrored networking mode — in Windows
~/.wslconfig:
[wsl2]
networkingMode=mirroredthen wsl --shutdown from PowerShell. Without mirroring,
WSL-bound ports aren’t visible to containers no matter how the tunnel
binds, and Option B is the right path.
Consumer-side wiring:
extra_hosts:
- "llm-gateway:host-gateway"
environment:
LLM_GATEWAY_URL: "http://llm-gateway:11435"Bind-mount the WSL distro’s ~/.ssh into the container so
the same key, config, and known_hosts from the bootstrap are available.
In the consumer’s devcontainer.json, add to
mounts:
"source=${localEnv:HOME}/.ssh,target=/home/vscode/.ssh,type=bind,readonly"(${localEnv:HOME} resolves to the WSL distro’s
$HOME. Adjust /home/vscode to your container
user.) Rebuild the container. Then from a terminal inside it:
ssh -L 11435:localhost:11435 oci-arm -NNo 0.0.0.0 needed — the tunnel and the consumer share
the container’s loopback. Consumer-side wiring:
environment:
LLM_GATEWAY_URL: "http://localhost:11435"No llm-gateway:host-gateway entry in
extra_hosts — the URL points at the container’s own
loopback.
SSH tunnels die with sleep, network changes, or laptop suspend. Under
Option A: wrap with autossh or a
systemd --user unit. Under Option B: re-run in the
container terminal, or add a postStartCommand that
backgrounds it. Skip both until it actually annoys you.
If you run the tunnel from the same machine that already has a local platform-services compose stack up, port 11435 is already claimed by the local stack. Bind the tunnel to an alternate local port instead — pick something far enough from the regular range to be unambiguous (the 21000 prefix is a useful convention):
ssh -L 21435:localhost:11435 oci-arm -NThis is the convention make tunnel in platform-services
itself uses — it binds *:21435 precisely so the local stack
on :11435 and the OCI tunnel can coexist.
The consumer chooses which to hit by switching its
LLM_GATEWAY_URL env var between the two port numbers:
http://llm-gateway:11435 → local platform-serviceshttp://llm-gateway:21435 → OCI via tunnelThe extra_hosts mapping is identical for both; only the
port number discriminates. permitopen on the OCI side only
checks the remote target (localhost:11435), never
the local bind port — so the same authorized_keys line works regardless
of which local port you pick.
Diagnosing a wrong-port hit. If your consumer points
at :11435 and you have a local stack running, the call
lands on the local stack silently — there’s no error, just stale-looking
data. The cleanest tells in a GET /queue response from the
current OCI gateway:
lifecycle block present (with state,
warm, in_flight,
last_activity_age_seconds, …)timestamp field at the top levelin_flight is a scalar integer, NOT an
in_flight_by_backend dictIf any of those are missing, you’re hitting a pre-current gateway —
most likely the local stack on :11435. Switch the consumer
URL to :21435 (assuming make tunnel is up) and
re-check.
How granular the local↔︎OCI switch is depends on how your consumer
reads the env var. If it’s baked in at container start (e.g. fixed in
the compose file), switching means a container restart. If it’s read per
process invocation (common Python pattern:
os.environ.get(...) inside the entry point), you can flip
per command — and even run one process against OCI and another against
local simultaneously from the same container:
LLM_GATEWAY_URL=http://llm-gateway:21435 \
python -m scripts.coverage_probe_pass2 --fullpriority: batch so
they don’t preempt interactive prod work.autossh.extra_hosts: ["llm-gateway:host-gateway"] wiring directly —
same Docker daemon, no tunnel needed.LLM_GATEWAY_URL to http://llm-gateway:11435
and LLM_MODEL to an 8B tag — output quality will differ
from prod, but the wiring exercises end-to-end).| Symptom | Fix |
|---|---|
ssh: Could not resolve hostname oci-arm |
No Host oci-arm block on this distro. Add the SSH alias, or use
-i ~/.ssh/oci_arm ubuntu@<oci-ip> inline. |
Tunnel opens but container curl times out /
connection-refuses |
Under Option A: ssh -L bound to 127.0.0.1, or WSL not
in mirrored networking mode (WSL-bound ports invisible to containers).
Re-bind to 0.0.0.0, enable mirrored mode in
~/.wslconfig, or switch to Option B — the
cleanest fix on Docker Desktop + WSL2. |
administratively prohibited on forward |
authorized_keys missing
permitopen="localhost:11435", or it uses
restrict (broken on OpenSSH 9.6p1 — see Authorize the key). |
GET /v1/jobs/{id} returns HTTP 404 |
72 h retention sweep deleted the record. Poll within 72 h of
completed_at, or persist results client-side. See Job lifecycle. |
If your consumer was wired to call ollama or docling directly, the migration is:
extra_hosts (if it
wasn’t already): "llm-gateway:host-gateway". Drop the old
ollama: and docling: entries from the same
list./v1/jobs with
endpoint: "/api/chat" (or whichever ollama path) and your
old payload as payload. Then poll
GET /v1/jobs/{id} until terminal. Worked example
above./v1/result body verbatim under result.platform_default), drop the networks: block —
pattern A is gone. The extra_hosts: wiring above is the
only supported shape.The streaming-tokens code path (if you had one) doesn’t carry over — the gateway never streams. For 5–15 minute generations nobody is watching tokens form anyway; if you genuinely need streaming, that’s a feature request, not a migration step.
X-Caller-Id header
is trusted, not verified. Suitable for an internal network of
cooperating consumers; not suitable for untrusted clients.