Connecting to platform-services

This page is for consumer projects — apps and dev containers that need to use the shared LLM runtime (and, eventually, the public edge proxy) provided by platform-services.

How this works

Consumer projects reach platform-services through the llm-gateway — a priority-aware async job queue that is the single entry point for both LLM (ollama) and document parsing (docling) work. The gateway API and wiring are identical regardless of which deployment you point at.

Two deployments, one of them is the default

Deployment	Where it runs	Models loaded	When to use
OCI (default)	Oracle Cloud Ampere A1	Production models (72B-class LLM, VLM-based docling)	Default for everything — production consumer deployments and local consumer development alike.
Local platform-services	Developer machine	Small models only (8B-class)	Special cases: working offline, intentionally exercising the small models, or when OCI is unavailable.

Concretely:

Production consumer deployments run on the same Oracle host as platform-services and reach the gateway in-host.
Local consumer development points at the OCI gateway via an SSH tunnel (covered in Pointing local dev at OCI below). This is the day-to-day dev path — the consumer code runs locally with all the usual editor/hot-reload niceties, but the actual LLM and docling work happens on real production-sized models in OCI.
Running consumers against local platform-services is the fallback. The wiring is identical to OCI (same compose, same env vars, just a different gateway URL), so flipping back and forth is one env var change. Just remember that the small models will produce noticeably worse output on prompts tuned for the large ones — local dev validates wiring, not output quality.

Connecting

Consumers reach platform-services through the gateway via the published host port. Your consumer doesn’t join the platform_default network — it stays decoupled from platform-services’ lifecycle and keeps starting/running when the platform is rebuilding or down.

services:
  app:
    extra_hosts:
      - "llm-gateway:host-gateway"

Behind the scenes, llm-gateway resolves via /etc/hosts to the host gateway IP; TCP connections then land on the published host port (11435).

URLs:

llm-gateway → http://llm-gateway:11435 (private; via host-gateway + tunnel — see Pointing local dev at OCI below). If your machine also runs a local platform-services stack on :11435, the OCI tunnel binds :21435 instead to avoid the port collision — point your consumer at 21435 for OCI work. See Sharing localhost with a local platform-services for the full pattern. make tunnel from platform-services already does this; the trap is consumer URLs pinned to :11435 silently landing on the local stack instead of the tunnel.
platform-docs → https://soggplatform.dedyn.io/ (public TLS edge — reachable from anywhere, no extra_hosts wiring or tunnel needed)

Why the two endpoints have different reachability stories: the gateway is a write-capable LLM dispatch API with no auth/rate-limit yet, so it’s deliberately kept off the public internet. Docs are read-only and safe to publish, so they’re served straight from the public TLS edge — no per-distro setup required.

Don’t hardcode these in app code — parameterize them via environment variables (LLM_GATEWAY_URL, PLATFORM_DOCS_URL). Same code then works under local dev, deployed prod, and the SSH-tunnel variant documented further down.

When platform-services is up, your consumer reaches it. When it’s down, the consumer still starts — gateway calls fail at the call site with a connection error, not at compose-up time. That’s the design intent.

What about direct ollama / docling access?

Not supported. ollama and docling are internal services of platform-services and have no consumer contract. The gateway owns all backend dispatch, including:

Priority-aware queueing (interactive vs batch tier)
Single global in-flight slot across both backends
Job-state persistence across client disconnects and gateway restarts
On-demand large-instance lifecycle: ollama and docling live on a separate OCI instance that the gateway spins up on the next job submission after it’s been idle-stopped, and stops again after LIFECYCLE_IDLE_TIMEOUT_SECONDS of no activity

If you were previously wired to ollama:11434 or docling:5005, see Migrating from direct backend access below.

What’s served

Service	URL
llm-gateway	`http://llm-gateway:11435` (private — see note below)
platform-docs (page)	`https://soggplatform.dedyn.io/`
platform-docs (model list)	`https://soggplatform.dedyn.io/models.json`
platform-docs (deploy info)	`https://soggplatform.dedyn.io/status.json`

Notes:

platform-docs is on a public TLS edge — browsers, external agents, and consumer containers all reach it the same way. No extra_hosts or tunnel needed. The same content is also served on http://platform-docs:5006/ to containers on the OCI host itself (in-host shortcut); local-dev consumers should ignore that and use the public URL.
llm-gateway is not on the public edge — it’s a write-capable LLM dispatch API with no auth yet, so it stays private. Reach it from a consumer container via the host-gateway alias plus either a local platform-services stack or the SSH tunnel documented below.
/models.json is proxied live from ollama’s /api/tags via platform-docs — hit this for model discovery rather than hardcoding model tags in your config.
ollama’s :11434 and docling’s :5001 are not part of the consumer contract. On the OCI deployment they aren’t reachable from consumers at all (large host is on a private VCN). On a local platform-services stack the runtime still publishes them for ops debugging, but consumers should not use them — see What about direct ollama / docling access? above.

Calling the gateway

The gateway exposes a uniform async job API for both backends. Submit a job, get an ID back immediately, poll for the result. Jobs are expected to run roughly 5–15 minutes on the current hardware (longer once we cut over to 72B-class models) — long enough that holding an HTTP connection open across the work is the wrong default.

Verb	Path	Purpose
`POST`	`/v1/jobs`	Submit a job. Returns `{ id, status, tier, backend, queue_position }`.
`GET`	`/v1/jobs`	List recent jobs (ops visibility). Query params: `status` (default `active` = queued+running+failed; or `queued` / `running` / `completed` / `failed` / `all`), `backend` (`ollama` / `docling`), `limit` (1–200, default 20). Returns a compact view per job — `result` payloads are replaced by `result_bytes`, errors trimmed to 500 chars. Use `/v1/jobs/{id}` for the full payload.
`GET`	`/v1/jobs/{id}`	Status + result (or error). Poll this.
`DELETE`	`/v1/jobs/{id}`	Cancel a job. Only valid while `status=queued`.
`GET`	`/queue`	Aggregate stats: in-flight (one global slot), per-tier queue depths, last-24h counters. Includes a `timestamp` field for cross-sample correlation.
`GET`	`/diagnostics`	Live backend health and a single-field `state` summarizing the whole platform (`stopped`, `starting`, `ready_cold`, `warming`, `ready_warm`, `busy`, `failed`, `backend_unhealthy`, `disabled`). Also exposes `warm` (boolean), `loaded_models` (in RAM right now), `available_models` (on disk), plus the raw per-backend probes. Use `state` and `warm` for routing decisions; use the raw fields for ops debug.

Submission shape

{
  "endpoint": "/api/chat",
  "payload": { "model": "qwen2.5:72b-instruct-q4_K_M", "messages": [...] },
  "priority": "interactive",
  "backend": "ollama"
}

endpoint — the upstream path to call on the chosen backend. For ollama: /api/chat, /api/generate, /api/embeddings, /v1/chat/completions, … For docling: /v1/convert/source/async.
payload — the body you would have POSTed directly to the backend, verbatim (with caveats noted per-backend below).
priority — optional. interactive or batch. Omitted → recorded as batch. See Choosing a priority for your job below. An unknown value (e.g. urgent) is rejected with 400 rather than silently falling back to batch.
backend — optional. ollama or docling. Omitted → the gateway derives it from the endpoint path (/v1/convert/* → docling, anything else → ollama). If you supply backend explicitly and it contradicts the endpoint (e.g. endpoint=/api/chat + backend=docling), the submit is rejected with 400 — there is one unambiguous dispatch path per job, and the gateway loud-fails rather than silently routing to the wrong backend.

Choosing a priority for your job

Set priority on the submit body to one of:

interactive — drained ahead of any queued batch work. For human-facing, latency-sensitive calls (editor assistant, chat UI).
batch — drained when no interactive work is queued. For bulk ingestion, indexing, scheduled jobs. This is the default when priority is omitted.

Tier is per-job, not per-caller: a single consumer can mix interactive and batch submissions freely. There is no central caller→tier mapping; the gateway trusts whatever priority value the consumer declares (the trust model assumes a small number of fully-trusted consumers on the internal network).

Set the X-Caller-Id request header to a short, stable identifier for your consumer project (e.g. my-editor-assistant). This is recorded on every job and surfaces in /v1/jobs/{id} and the gateway log — useful for tracing your own traffic and for the operator when debugging — but it does not drive tier assignment. If the header is absent, the gateway falls back to the peer IP as the caller identifier; don’t rely on the fallback, it’s a backstop.

Preemption is not supported. An interactive job submitted while a batch job is in flight waits for that batch job to finish (can be 5–15 min for docling work, longer for big inference). Once the in-flight slot frees, the interactive job dispatches before any queued batch. If sub-batch-duration latency matters for your use case, raise it — the gateway design will need to change.

Single global in-flight slot

The gateway runs one job at a time globally across all backends. An ollama job in flight blocks a docling dispatch and vice versa; two queued jobs of any combination run sequentially.

Why one slot, not one per backend: the on-demand large instance is CPU-bound on both workloads (no GPU on the free-tier ARM shape), so running ollama and docling in parallel halves each one’s throughput. Serial dispatch with priority tiers gives each job full CPU and keeps the queue model simple. Rationale and the trade-off in full: plans/priority-queueing.md Queue shape section.

The queue_position returned on submission counts every queued job ahead of yours globally — 1 means “next to run,” regardless of which backend any of the queued jobs target.

Job lifecycle

queued ─────► running ─────► completed
   │              │
   │              └────► failed (backend error, timeout, gateway restart)
   └────► failed (cancelled via DELETE, or platform-initiated bulk cancel)

If the gateway operator clears the queue (incident recovery, stuck backend, etc.), every queued job transitions to status=failed with the recorded error set to a message the operator chose — by default cancelled by platform services, but operators may include incident context (e.g. large instance OOM, restart pending). Treat this as a normal job-level failure on the consumer side: surface error to the user, resubmit if your workflow needs to retry.

While status=queued, the response includes queue_position (1-indexed, global; 1 means “next to run,” counting every queued job ahead of yours regardless of backend). Once a job starts, the response includes started_at and queue_position is dropped. On completion, the verbatim backend response is returned under result.

While status=running, the response also includes a phase field describing what the gateway is currently doing with your job. The bundled dispatchers emit:

`phase`	Meaning
`waiting_for_backend`	Worker claimed your job; lifecycle is bringing the on-demand large up (only relevant on a cold start)
`ollama_dispatching`	POST to ollama just sent; phase about to refine to one of the two below within ~2 s
`ollama_loading_model`	The model your call requested is being loaded from disk into RAM (cold-load cost)
`ollama_generating`	The model is resident; ollama is actively producing tokens
`docling_submitting`	Sending your PDF to docling’s async submit endpoint
`docling_polling`	docling accepted; gateway is polling for completion
`docling_fetching_result`	docling reported success; gateway is fetching the JSON result

phase is a UX affordance — a consumer can render meaningfully different “loading model…” vs “generating…” states without guessing — but it’s not a control plane signal. Don’t write logic that depends on phase transitions firing in a specific order or at all (e.g. a very fast inference may transition straight to completed before the watcher even writes ollama_generating). Treat it as a hint, not a contract.

Gateway restarts: queued jobs survive (state is persisted to SQLite). Any job that was actively running at restart time is marked failed with an error indicating the gateway restarted — the partial work is gone, and silently re-running could double-charge a non-idempotent caller. Resubmit if needed.

Retention. Completed and failed job records are kept for 72 hours after completed_at, then deleted by a background sweep that runs every hour. Both knobs are env-tunable on the gateway service (GATEWAY_JOB_TTL_SECONDS, default 259200; GATEWAY_CLEANUP_INTERVAL, default 3600). Queued and running jobs are never swept regardless of age — a job that takes a week to drain stays in the queue with no retention pressure. After sweep, GET /v1/jobs/{id} returns HTTP 404 with body {"error": "job not found"} — indistinguishable from an ID that never existed. Poll each result within 72 h of its completed_at, store the result on your side, or bump the TTL further for batch workloads. The completed_last_24h / failed_last_24h counters on /queue are independent of the retention window — they always report the last 24 h of activity.

Picking a model name

The production default is the 72B-class model on the OCI large host — qwen2.5:72b-instruct-q4_K_M. This is what consumers should target both in deployed prod and from dev (via the OCI tunnel). Qwen3 has no 72B size (its ladder jumps 32B → 235B), so the production target is Qwen2.5-72B; if you see a reference to “qwen3:72b” anywhere, it’s a docs error.

The 8B small profile (qwen3:8b-q4_K_M-nothink) is kept for legacy compatibility and minimal-resource tests only — running ollama locally on a developer machine that can’t hold a 72B model, or driving a regression suite where the smaller model’s faster inference is more important than its quality. Do not depend on the small profile being available on the OCI large.

Hardcoding a model name in your consumer ties it to one deployment. Parameterize via env var, same way as the gateway URL:

# Production / dev-against-OCI (default — what every consumer
# should target unless they have a specific reason not to)
LLM_GATEWAY_URL=http://llm-gateway:11435  # or 21435 if tunneled
LLM_MODEL=qwen2.5:72b-instruct-q4_K_M

# Local platform-services only (legacy / minimal-resource tests)
LLM_GATEWAY_URL=http://llm-gateway:11435
LLM_MODEL=qwen3:8b-q4_K_M-nothink

The model set actually loaded on each deployment lives in models/profiles/ (large.sh is the production target; small.sh is legacy). To discover what’s currently loaded on the gateway you’re pointing at, hit /diagnostics on the gateway — ollama.ps[].name lists what’s resident in RAM right now, ollama.tags lists everything on disk. The legacy https://soggplatform.dedyn.io/models.json proxy of /api/tags still works for public model discovery.

What happens when the model isn’t loaded in RAM

Submitting a job whose model name is in available_models but not in loaded_models is legal and pays a one-time disk→RAM load cost — the gateway forwards the request to ollama, ollama loads the model from local disk (the bind-mounted weights volume), and then runs inference. The HTTP request just blocks while loading. From the consumer’s perspective the call takes longer than usual, and the phase field on /v1/jobs/{id} reports ollama_loading_model during the load window so you can render a meaningful UX state instead of “still loading…”. Once loaded, the model stays resident under OLLAMA_KEEP_ALIVE=-1 and subsequent calls don’t pay the cost.

Submitting a job whose model name is not in available_models at all is an operator-side configuration gap — the host this gateway points at hasn’t pulled the model. Ollama’s behavior in that case varies by version (recent releases auto-pull, older releases 404). Either way, don’t rely on auto-pull for a 40 GB production model — a silent 10–20 minute background download is the wrong default for an inference path. Pre-flight check by reading /diagnostics.available_models before submitting; if the model you need isn’t there, that’s an operator ask, not a consumer retry.

Latency expectations

These are calibration numbers for sizing consumer-side timeouts, not SLOs. They depend on the OCI large’s actual shape (currently 20 OCPU / 140 GB, CPU inference — no GPU) and the specific model loaded. Treat them as ballpark; measure against your own workload once you’ve got something in production.

For the production model (qwen2.5:72b-instruct-q4_K_M) — numbers below are measured, not estimated, from the 2026-05-23 autostatement verify run on the current OCI large host (20 OCPU, 140 GB, CPU inference):

Scenario	Measured wall-time	What’s dominating
`state=stopped` → docling first call ready	~70 s	OCI boot + reach `ready_warm` for the docling backend
Docling: 9-page, 660 KB PDF	86–164 s	docling itself; not gateway-side
Ollama cold load (72B disk→RAM)	20 min 6 s	block-volume read speed for 40 GB of weights
Ollama warm typical extraction (small labelling prompt)	3 min 49 s – 4 min 22 s	token generation on CPU
Ollama cold first call (load + typical extraction)	~24 min	20-min load + ~4-min generate; fits inside the 60-min `GATEWAY_OLLAMA_TIMEOUT` default with headroom

The cold-load cost is large enough that the gateway pre-warms the production model in the background after boot. When the operator sets LIFECYCLE_WARM_MODEL=qwen2.5:72b-instruct-q4_K_M on the gateway, the lifecycle controller kicks a warm probe (a single-token inference against that model) as a fire-and-forget background task immediately after backend health passes. From the consumer side this means:

state=starting covers only OCI boot + the backend health probe (~70 s on the current OCI large), not the 20-min cold-load.
After state flips to ready, the typical post-boot state is ready_cold for a few minutes (model not yet resident) and then ready_warm once the background probe completes the load.
A consumer call that arrives during the warm window doesn’t wait for the probe to finish — it dispatches immediately, and either
1. ollama already has the model resident and runs at warm speed, or (b) it races the probe and both wait on the same per-model load inside ollama. Either way, the cold-load cost is paid once per cold start and not duplicated.

If LIFECYCLE_WARM_MODEL is not set, no background probe is spawned and the first consumer job after each cold start triggers the load inside its own HTTP budget. Same wall-time, just different attribution.

GET /diagnostics surfaces the probe outcome under lifecycle.warm:

"lifecycle": {
  ...
  "warm": {
    "model": "qwen2.5:72b-instruct-q4_K_M",
    "last_attempt_at": "2026-05-25T10:14:23.117+00:00",
    "last_outcome": "success",
    "last_duration_seconds": 1187.4,
    "last_error": null,
    "model_loaded_at": "2026-05-25T10:14:23.117+00:00"
  }
}

Fields:

last_outcome is one of success, timeout, transient_error, fatal_error, cancelled, error, or null (no probe run yet on the current boot). For consumer routing, treat anything other than success as “first call may pay cold-load” — the specific non-success value is operator-facing detail. Existing consumers that gate on last_outcome == "success" continue to work; the cancelled and error values are additive, not breaking.
last_outcome=success plus model_loaded_at set is the strongest signal that the next call is warm. loaded_models on the same /diagnostics response is the live ground truth; model_loaded_at is a “when did we last positively confirm it” timestamp useful for predicting freshness over longer windows.
A non-success outcome (probe timed out, hit a transient error, was cancelled by an unrelated lifecycle event, or raised an unexpected exception) doesn’t block READY — the controller stays ready and the first consumer call paying cold-load is the recovery path. Operator signal only.
fatal_error indicates a misconfigured LIFECYCLE_WARM_MODEL (HTTP 4xx — most commonly 404 model not found). Consumer jobs using a different model name continue to work; consumer jobs using the same model name will see the same error. State stays READY in either case.
cancelled means the probe was cancelled mid-flight by a reconcile-driven state transition (OCI reported the large stopped while the probe was still running). error means an unexpected exception escaped the probe — treat as a bug signal worth surfacing to the platform team.

Polling patience on cold starts

The gateway holds the ollama HTTP request server-side until ollama actually returns, up to GATEWAY_OLLAMA_TIMEOUT (default 60 min). Your consumer is polling /v1/jobs/{id}, not waiting on that HTTP call — each poll returns in milliseconds with status=running and the appropriate phase. phase=ollama_loading_model is now the normal signal that you’re paying cold-load (rather than the rare fallback it was when warm-on-boot blocked READY); phase=ollama_generating flips when ollama starts producing tokens. Your HTTP client’s per-request timeout only needs to cover one poll, not the whole job. The knob that matters for cold starts is how long your polling loop is willing to wait overall — for the 72B on the current OCI large, budget up to ~25 min for a cold first call (load + generate) and use the phase field to render meaningful state in the meantime.

For pre-flight smoke tests: a 5-token output against a trivial prompt ("Say pong.") completes in well under 30 seconds when state=ready_warm AND the requested model is in loaded_models.

If your verify runs produce additional measurements (especially on different output sizes or with different prompts), contribute them back — the table above is anchored on one filing’s worth of data plus the cold-load probe.

Submitting an ollama job

import os, time, requests

GATEWAY = os.environ["LLM_GATEWAY_URL"]  # e.g. http://llm-gateway:11435
MODEL = os.environ["LLM_MODEL"]          # e.g. qwen2.5:72b-instruct-q4_K_M
HEADERS = {"X-Caller-Id": "my-editor-assistant"}

submit = requests.post(
    f"{GATEWAY}/v1/jobs",
    headers=HEADERS,
    json={
        "endpoint": "/api/chat",
        "payload": {
            "model": MODEL,
            "messages": [{"role": "user", "content": "Summarize ..."}],
        },
        "priority": "interactive",
    },
)
submit.raise_for_status()
job_id = submit.json()["id"]

while True:
    r = requests.get(f"{GATEWAY}/v1/jobs/{job_id}", headers=HEADERS)
    r.raise_for_status()
    job = r.json()
    if job["status"] == "completed":
        print(job["result"])
        break
    if job["status"] == "failed":
        raise RuntimeError(job["error"])
    time.sleep(5)

A 5-second poll cadence is fine for 5–15 minute jobs. The gateway forces stream: false on ollama calls — you receive the full response in result, never tokens-in-flight.

Submitting a docling job

Docling input goes inline in the payload as base64. The gateway does not accept multipart uploads — that simplifies the gateway and keeps the submission shape uniform between backends. For a 10 MB PDF this means ~13 MB of base64 text in the request body, well inside the gateway’s 64 MB body cap.

The gateway forwards your payload to docling’s /v1/convert/source/async, polls until docling reports the task terminal, and returns the docling result JSON verbatim in result.

Important — conversion options must be nested under options. docling-serve accepts conversion knobs (do_ocr, to_formats, do_table_structure, md_page_break_placeholder, etc.) under an options sub-object, not as top-level siblings of sources. The gateway is a verbatim passthrough; if you put options at the top level they reach docling but get silently ignored, and you’ll see defaults instead — most visibly: no \f page-break markers in markdown. Confirmed in production by an autostatement regression on 2026-05-23.

Images: default is dropped. The gateway interprets a top-level include_images: bool field on the docling payload (default false). When false the gateway both (a) sets docling’s image_export_mode=placeholder to suppress server-side rendering of images into the result, and (b) strips any image data that does land in the result before storing it. The strip nulls result.document.json_content.pages[<n>].image (the rendered page bitmaps — the main bloat source on real PDFs) and pictures[].image (semantic-object detections), and replaces inline base64 data URIs in md_content / html_content with placeholders. Set include_images: true if you actually need the image bytes (e.g. an upcoming vision-model interpreter); be aware that a single PDF page can produce 1–10 MB of base64 image data at the default 144 dpi and the gateway stores the full result for the job-TTL window. Background: plans/docling-image-handling.md.

import base64, os, time, requests

GATEWAY = os.environ["LLM_GATEWAY_URL"]
HEADERS = {"X-Caller-Id": "my-doc-ingest"}

with open("annual-report.pdf", "rb") as f:
    pdf_b64 = base64.b64encode(f.read()).decode("ascii")

submit = requests.post(
    f"{GATEWAY}/v1/jobs",
    headers=HEADERS,
    json={
        "endpoint": "/v1/convert/source/async",
        "backend": "docling",
        "priority": "batch",
        "payload": {
            "sources": [
                {
                    "kind": "file",
                    "base64_string": pdf_b64,
                    "filename": "annual-report.pdf",
                }
            ],
            # Gateway-level knob. Default is false (drop images).
            # Sibling of `sources` / `options`; the gateway extracts
            # it from the payload before forwarding to docling.
            # "include_images": False,
            # All docling conversion knobs go under `options`.
            # Putting them at the top level alongside `sources`
            # results in docling silently using defaults — see the
            # warning above this snippet.
            "options": {
                "to_formats": ["md", "json"],
                "do_ocr": True,
                "do_table_structure": True,
                "md_page_break_placeholder": "\f",
            },
        },
    },
)
submit.raise_for_status()
job_id = submit.json()["id"]

while True:
    r = requests.get(f"{GATEWAY}/v1/jobs/{job_id}", headers=HEADERS)
    r.raise_for_status()
    job = r.json()
    if job["status"] == "completed":
        result = job["result"]  # the verbatim docling result JSON
        break
    if job["status"] == "failed":
        raise RuntimeError(job["error"])
    time.sleep(5)

The docling result shape — markdown, JSON document tree, etc. — is whatever docling returns at /v1/result/{task_id}. The gateway is a pass-through for that body. A 5-line regression test that asserts result["document"]["md_content"].count("\f") >= n_pages would catch a future shape regression like the 2026-05-23 incident in seconds.

Job size caps. Two timeouts apply, both server-side and configured on the gateway / docling services in docker-compose.yml:

GATEWAY_DOCLING_TIMEOUT (default 1200 s) — how long the gateway will keep polling docling before giving up on a job and marking it failed. Includes submission + polling + result fetch.
DOCLING_SERVE_MAX_DOCUMENT_TIMEOUT (default 900 s) — how long any single docling conversion may run before the docling worker abandons it. Independent of submit endpoint.

For ~150-page annual-report PDFs, both caps are comfortable. If you have larger documents, raise both before pushing them through.

Checking platform load

GET /queue returns a snapshot without affecting the queue:

{
  "in_flight": 1,
  "queued": { "interactive": 0, "batch": 3 },
  "running_by_tier": { "interactive": 0, "batch": 1 },
  "completed_last_24h": 47,
  "failed_last_24h": 2,
  "timestamp": "2026-05-23T09:25:01.124+00:00"
}

in_flight is 0 or 1 — there is a single global slot across all backends (see Single global in-flight slot above). Hit this before a non-urgent submission if you want to be a good citizen: if queued.batch is deep, you might choose to defer. timestamp is the gateway’s wall-clock at snapshot time; use it to correlate samples across time without relying on the relative lifecycle.last_activity_age_seconds.

The response also includes a lifecycle block with the on-demand large-instance state and (when running) two uptime views:

{
  "lifecycle": {
    "state": "ready",
    "in_flight": 0,
    "last_activity_age_seconds": 12.4,
    "last_error": null,
    "session": {
      "started_at": "2026-05-23T08:12:01.034+00:00",
      "uptime_seconds": 4501.2
    },
    "month_to_date": {
      "wall_hours": 18.42,
      "ocpu_hours": 368.4,
      "gb_hours": 2578.8,
      "month_start": "2026-05-01T00:00:00+00:00",
      "next_reset": "2026-06-01T00:00:00+00:00"
    }
  }
}

session is null when the large is stopped. month_to_date is informational today — useful if you want to know roughly how much of the monthly OCI free-tier budget has been spent so far this UTC month. A future release will surface remaining-budget on each job-status response and add a dedicated /usage endpoint with the same data; until then /queue.lifecycle.month_to_date is the place to look.

Knowing when the platform is ready for your call

For decisions that need to happen before you submit — “is the backend even up?”, “is the model I want already in RAM, or am I about to pay a cold-load cost?” — GET /diagnostics gives you a single derived state field plus a few denormalized lists:

{
  "state": "ready_warm",
  "warm": true,
  "loaded_models": ["qwen2.5:72b-instruct-q4_K_M"],
  "available_models": [
    "qwen2.5:72b-instruct-q4_K_M",
    "qwen3:8b-q4_K_M-nothink"
  ],
  "ollama": { ... raw per-backend detail ... },
  "docling": { ... raw per-backend detail ... },
  "lifecycle": { ... state, in_flight, last_activity_age_seconds, last_error, warm — same as /queue's lifecycle EXCEPT /queue additionally embeds session / month_to_date / large_shape (the uptime-accounting block) ... },
  "timestamp": "2026-05-23T09:25:02.341+00:00"
}

state is the single source of truth for “what’s the platform doing right now,” and is one of:

`state`	Meaning	Your next call will…	Wait or submit?
`stopped`	Large instance is idle-asleep (auto-stopped after `LIFECYCLE_IDLE_TIMEOUT_SECONDS` of no activity, default 2 h) — the next submission will auto-wake it	trigger a ~70 s boot + warm-on-boot loading the 72B (~20 min) in the background; first call pays cold-load if it races the probe	just submit
`starting`	OCI boot in progress (either triggered by a previous submission or by `make redeploy`)	wait until boot finishes	submit (will queue)
`stopping`	OCI shutdown in progress	wait, or queue and the boot will retrigger	submit (will queue and trigger wake)
`backend_unhealthy`	Instance up, ollama/docling not responding	likely fail; investigate before retry	don’t submit until resolved
`ready_cold`	Backends up, no model in RAM	pay disk→RAM load (~1–3 min for 72B)	submit
`warming`	A model load is in progress (typically the background warm-on-boot probe; can also be an in-flight job’s own cold-load)	wait briefly; first call after `ready_warm` is fast	submit (queues briefly)
`ready_warm`	Model resident, idle	run immediately	submit (fast)
`busy`	Model resident, currently generating	queue behind the in-flight job	submit (queues briefly)
`failed`	Lifecycle in failed state (e.g. OCI capacity, IAM revoked, OCID deleted)	submissions will likely fail; the lifecycle auto-clears after the `failed_cooldown_seconds` window and retries	wait, or fix the underlying OCI-side issue
`disabled`	Lifecycle controller off (local dev / non-OCI deploys)	reach the backend directly with no boot logic	submit

Sleep and wake — the operational model in one paragraph. The on-demand large instance auto-stops itself after LIFECYCLE_IDLE_TIMEOUT_SECONDS (default 7200 s / 2 h) of no gateway dispatches, and auto-wakes on the next job submission — the gateway intercepts every job submit and, if the large is stopped, issues an OCI start before dispatching. Consumers don’t need to explicitly trigger wake. There is no “operator paused the platform, hold your submission” state in this API; that scenario is covered by disabled (controller off entirely) or failed (lifecycle gave up). For every other state a consumer can safely submit and let the gateway handle whatever’s needed; the phase field on /v1/jobs/{id} will then narrate boot → load → generation.

warm is the convenience boolean: lifecycle == ready AND at least one model in RAM. True implies your next ollama inference starts immediately — no boot, no load. Use warm for the simple “is this a fast path?” check; use state when you need the full picture.

loaded_models and available_models let you check whether the specific model you want is resident (fast next call) or just on disk (will pay ~1–3 min load). If the model you need isn’t in available_models at all, you’ve targeted the wrong host or the model hasn’t been refreshed there yet — that’s an operator issue, not something to retry around.

The raw ollama.ps, ollama.tags, docling.health, and lifecycle blocks are there for ops debugging. Each backend reports errors in-band, so a single broken backend doesn’t blind you to the other — useful when you’re triaging which side is having a bad day.

Reference: end-to-end consumer pattern

The minimal examples earlier in this page elide a few things that matter in real consumers — pre-flight /diagnostics check, phase-aware UX, robust timeout, and explicit failure handling. Copy-paste the snippet below as a starting point rather than re-derive it. It assumes the model name and Choosing a priority for your job sections above have been read.

"""Minimal but production-shaped consumer of the platform gateway.

Reads LLM_GATEWAY_URL and LLM_MODEL from env (no defaults — failing
loudly beats defaulting to a wrong gateway, see #picking-a-model-name).
"""
import os
import time
from typing import Callable, Optional

import requests

GATEWAY = os.environ["LLM_GATEWAY_URL"]    # e.g. http://llm-gateway:11435
MODEL = os.environ["LLM_MODEL"]            # e.g. qwen2.5:72b-instruct-q4_K_M
CALLER = "my-consumer"                     # short stable identifier; logged
HEADERS = {"X-Caller-Id": CALLER}
PRIORITY = "batch"                         # "interactive" or "batch"; default batch


class GatewayError(RuntimeError):
    """Anything the gateway tells us went wrong. .job is the recorded
    row when the failure was a job-level error (so callers can read
    .job['error'], .job['phase'], etc.); None for pre-submit failures."""
    def __init__(self, message: str, job: Optional[dict] = None):
        super().__init__(message)
        self.job = job


def diagnostics() -> dict:
    r = requests.get(f"{GATEWAY}/diagnostics", timeout=5)
    r.raise_for_status()
    return r.json()


def preflight(model: str) -> dict:
    """Confirm the gateway will accept work for the given model.
    Raises GatewayError if the platform is in a non-submittable state
    or the model isn't available on the target host."""
    d = diagnostics()
    if d["state"] in ("failed", "backend_unhealthy"):
        raise GatewayError(f"platform not ready: state={d['state']}")
    if model not in d["available_models"] and d["available_models"]:
        # Empty list means we couldn't reach ollama (likely instance
        # asleep); submission will trigger wake and we'll learn for
        # sure then. Only fail loudly when we have a definitive list.
        raise GatewayError(
            f"model {model!r} not on disk; available={d['available_models']}"
        )
    return d


def submit(backend: str, endpoint: str, payload: dict) -> str:
    r = requests.post(
        f"{GATEWAY}/v1/jobs",
        headers=HEADERS,
        json={
            "backend": backend,
            "endpoint": endpoint,
            "payload": payload,
            "priority": PRIORITY,
        },
        timeout=10,
    )
    r.raise_for_status()
    return r.json()["id"]


def wait_until_terminal(
    job_id: str,
    *,
    timeout_s: int = 7200,
    poll_interval_s: float = 5.0,
    on_phase_change: Optional[Callable[[Optional[str]], None]] = None,
) -> dict:
    """Poll /v1/jobs/{id} until completed or failed. Calls
    on_phase_change(phase) once per (deduped) transition — the right
    place to render UX state."""
    deadline = time.monotonic() + timeout_s
    last_phase: Optional[str] = "__init__"
    while time.monotonic() < deadline:
        r = requests.get(f"{GATEWAY}/v1/jobs/{job_id}",
                         headers=HEADERS, timeout=10)
        r.raise_for_status()
        job = r.json()
        phase = job.get("phase")
        if phase != last_phase:
            if on_phase_change:
                on_phase_change(phase)
            last_phase = phase
        if job["status"] == "completed":
            return job
        if job["status"] == "failed":
            raise GatewayError(f"job failed: {job.get('error')}", job=job)
        time.sleep(poll_interval_s)
    raise GatewayError(
        f"job {job_id} did not terminate within {timeout_s}s; "
        f"last phase was {last_phase!r}"
    )


# Example: chat completion
def chat(messages: list, timeout_s: int = 7200) -> str:
    preflight(MODEL)
    job_id = submit("ollama", "/api/chat", {"model": MODEL, "messages": messages})
    print(f"submitted {job_id}")
    job = wait_until_terminal(
        job_id,
        timeout_s=timeout_s,
        on_phase_change=lambda p: print(f"  phase={p}"),
    )
    return job["result"]["message"]["content"]


if __name__ == "__main__":
    print(chat([{"role": "user", "content": "Say pong."}]))

What this snippet encodes that the minimal examples don’t:

Pre-flight /diagnostics with a defensive empty-list check. When the large is asleep, available_models is empty and we don’t fail loudly — submission will wake it and we’ll learn then. Only fail when we have a definitive list that excludes the requested model.
Phase-change callback for UX state. Deduped against the previous phase so callers only see real transitions.
Distinct error type that preserves the gateway’s recorded job row (including the final phase), so failure callers can read error.job["phase"] and render “failed during ollama_loading_model” instead of generic “failed.”
Explicit timeout with the last-seen phase reported on expiry — far easier to diagnose than a bare TimeoutError.

Adapt the chat() example to your shape: docling jobs use submit("docling", "/v1/convert/source/async", {...}), embeddings use submit("ollama", "/api/embeddings", {...}). The submit/poll/phase plumbing stays identical.

Pointing local dev at OCI (the default dev path)

For day-to-day consumer development, run your consumer locally and point it at the OCI gateway via an SSH tunnel. This gives you local hot-reload, editor tooling, and dev container niceties while the LLM and docling work runs on production-sized models in OCI.

Why this is the default rather than local platform-services:

Model parity with prod. Your consumer’s prompts run against the same models prod will use. Behaviour you observe in dev is what consumers see in production.
No 40+ GB model pulls on every new dev environment. Bootstrapping a new machine doesn’t have to wait on a giant download.
Local machines can’t host the 72B models anyway. Local platform-services is sized for 8B models — useful for exercising wiring, not for evaluating output quality.

Reach for local platform-services instead when you have a concrete reason: working offline, intentionally testing the small models, or OCI is unavailable. Switching is one env var change.

Quickstart

Two tunnel patterns — pick by where your consumer runs:

Inside a Docker dev container → Option B: tunnel from inside the container. The path that “just works” on Docker Desktop + WSL2 without host-side networking changes.
Native on the WSL distro / Linux / macOS (no container) → Option A: tunnel from the host.

Either way, one-time bootstrap is the same: generate key → authorize on OCI → SSH alias → run tunnel. Errors: Troubleshooting.

How it works

Two viable topologies depending on where the consumer process runs. Same key, same OCI authorization, same SSH alias — only the location of the ssh -L and the consumer’s URL differ.

Option A — tunnel on the WSL host. ssh -L runs on the WSL distro; the consumer reaches it via host-gateway. Works cleanly for consumers running natively on the distro (no container). For consumer dev containers under Docker Desktop + WSL2, also requires WSL in mirrored networking mode — otherwise WSL-host-bound ports are invisible to containers, regardless of 0.0.0.0 bind.

consumer (in container or native) → host-gateway → ssh -L on WSL → OCI:localhost:11435

Option B — tunnel inside the dev container. ssh -L runs in the consumer container; the consumer hits its own loopback. Side- steps host-side networking entirely. Requires bind-mounting the WSL distro’s ~/.ssh into the container so the same key/config are available.

consumer (in container) → localhost → ssh -L in same container → OCI:localhost:11435

One SSH key per dev distro

Each WSL distro (or local machine) that wants the tunnel gets its own key. Don’t copy keys between distros — that defeats their isolation. In the consumer-project distro:

ssh-keygen -t ed25519 -C "<distro-name>-tunnel" -f ~/.ssh/oci_arm
cat ~/.ssh/oci_arm.pub

This is a separate key from the operator’s full-access key documented in ../README.md → SSH access. That key exists for managing the instance; this one is for tunnels only.

The key lives in the WSL distro’s ~/.ssh/ regardless of which tunnel pattern you use (see Run the tunnel). Under Option B the dev container bind-mounts that directory read-only and uses the same key — no copying, no separate identity. “One key per distro” means per WSL distro.

Authorize the key as tunnel-only

The public half goes onto OCI’s ~/.ssh/authorized_keys, prefixed with restrictions so the key cannot be used for anything except forwarding the gateway port:

command="echo tunnel-only access; exit 1",no-pty,no-agent-forwarding,no-X11-forwarding,no-user-rc,permitopen="localhost:11435" ssh-ed25519 AAAA... <distro-name>-tunnel

Why not the restrict umbrella keyword? OpenSSH’s restrict is the documented one-word equivalent of the four no-* options below. But on OpenSSH 9.6p1 (Ubuntu 24.04, current OCI image) we observed that restrict,permitopen=... parses correctly — sshd’s debug log shows the permitopen target listed — yet forwarding gets denied with administratively prohibited anyway. The expanded form (each restriction named individually) behaves correctly. Verified on 2026-05-19 during the autostatement onboarding; the pattern in this doc deliberately avoids restrict so future onboardings don’t repeat the debug session.

Where this command runs. Not from the consumer distro — that’s the distro we’re granting access, so it can’t authorize itself yet. The append runs from a machine that already has admin SSH to OCI — typically the operator’s platform-services WSL distro. One-liner from there:

echo 'command="echo tunnel-only access; exit 1",no-pty,no-agent-forwarding,no-X11-forwarding,no-user-rc,permitopen="localhost:11435" ssh-ed25519 AAAA... <distro-name>-tunnel' \
  | ssh oci-arm 'cat >> ~/.ssh/authorized_keys'

Verify it landed:

ssh oci-arm 'tail -1 ~/.ssh/authorized_keys'
ssh oci-arm 'tail -1 ~/.ssh/authorized_keys' | grep -oE 'ssh-ed25519 \S+ \S+$' | ssh-keygen -lf -

The grep -oE step strips the long options prefix (which contains spaces inside command="...") and isolates the bare <keytype> <keydata> <comment> so ssh-keygen -lf - can fingerprint it. Cross-check the printed fingerprint against the one the consumer distro printed in Generate the key (step 1). If they match, the key is intact through paste.

After this one-time bootstrap, the consumer distro talks to OCI directly forever — the operator distro is just the “trusted introducer” that vouches for the new key on day one.

What each option does:

command="echo tunnel-only access; exit 1" — forces every non-tunnel session to print and exit, blocking ssh oci-arm 'cmd' from running arbitrary commands. Port forwarding uses a separate SSH protocol path and runs before the forced command, so forwards still work.
no-pty — denies PTY allocation (no interactive shell).
no-agent-forwarding — blocks SSH agent forwarding.
no-X11-forwarding — blocks X11 forwarding.
no-user-rc — blocks execution of ~/.ssh/rc at login.
permitopen="localhost:11435" — permits forwarding to llm-gateway only.

If this key leaks, the worst an attacker can do is open forwards to the gateway port. They can’t get a shell, run commands, read files, forward agent or X11, or pivot to any other port.

SSH config alias

In the consumer-project distro’s ~/.ssh/config:

Host oci-arm
  HostName 79.76.60.187
  User ubuntu
  IdentityFile ~/.ssh/oci_arm
  IdentitiesOnly yes

Run the tunnel

Pick the option matching where your consumer runs. permitopen on OCI checks only the remote destination, so the same authorized_keys line works for either option.

Option A — from the WSL host

ssh -L 0.0.0.0:11435:localhost:11435 oci-arm -N

The 0.0.0.0 bind (not the default 127.0.0.1) is what lets a consumer dev container reach the port via host-gateway. On Docker Desktop + WSL2 this also requires the distro to be in mirrored networking mode — in Windows ~/.wslconfig:

[wsl2]
networkingMode=mirrored

then wsl --shutdown from PowerShell. Without mirroring, WSL-bound ports aren’t visible to containers no matter how the tunnel binds, and Option B is the right path.

Consumer-side wiring:

extra_hosts:
  - "llm-gateway:host-gateway"
environment:
  LLM_GATEWAY_URL: "http://llm-gateway:11435"

Option B — from inside the dev container

Bind-mount the WSL distro’s ~/.ssh into the container so the same key, config, and known_hosts from the bootstrap are available. In the consumer’s devcontainer.json, add to mounts:

"source=${localEnv:HOME}/.ssh,target=/home/vscode/.ssh,type=bind,readonly"

(${localEnv:HOME} resolves to the WSL distro’s $HOME. Adjust /home/vscode to your container user.) Rebuild the container. Then from a terminal inside it:

ssh -L 11435:localhost:11435 oci-arm -N

No 0.0.0.0 needed — the tunnel and the consumer share the container’s loopback. Consumer-side wiring:

environment:
  LLM_GATEWAY_URL: "http://localhost:11435"

No llm-gateway:host-gateway entry in extra_hosts — the URL points at the container’s own loopback.

Persistence

SSH tunnels die with sleep, network changes, or laptop suspend. Under Option A: wrap with autossh or a systemd --user unit. Under Option B: re-run in the container terminal, or add a postStartCommand that backgrounds it. Skip both until it actually annoys you.

If you run the tunnel from the same machine that already has a local platform-services compose stack up, port 11435 is already claimed by the local stack. Bind the tunnel to an alternate local port instead — pick something far enough from the regular range to be unambiguous (the 21000 prefix is a useful convention):

ssh -L 21435:localhost:11435 oci-arm -N

This is the convention make tunnel in platform-services itself uses — it binds *:21435 precisely so the local stack on :11435 and the OCI tunnel can coexist.

The consumer chooses which to hit by switching its LLM_GATEWAY_URL env var between the two port numbers:

http://llm-gateway:11435 → local platform-services
http://llm-gateway:21435 → OCI via tunnel

The extra_hosts mapping is identical for both; only the port number discriminates. permitopen on the OCI side only checks the remote target (localhost:11435), never the local bind port — so the same authorized_keys line works regardless of which local port you pick.

Diagnosing a wrong-port hit. If your consumer points at :11435 and you have a local stack running, the call lands on the local stack silently — there’s no error, just stale-looking data. The cleanest tells in a GET /queue response from the current OCI gateway:

lifecycle block present (with state, warm, in_flight, last_activity_age_seconds, …)
timestamp field at the top level
in_flight is a scalar integer, NOT an in_flight_by_backend dict

If any of those are missing, you’re hitting a pre-current gateway — most likely the local stack on :11435. Switch the consumer URL to :21435 (assuming make tunnel is up) and re-check.

How granular the local↔︎OCI switch is depends on how your consumer reads the env var. If it’s baked in at container start (e.g. fixed in the compose file), switching means a container restart. If it’s read per process invocation (common Python pattern: os.environ.get(...) inside the entry point), you can flip per command — and even run one process against OCI and another against local simultaneously from the same container:

LLM_GATEWAY_URL=http://llm-gateway:21435 \
  python -m scripts.coverage_probe_pass2 --full

Trade-offs

Latency. ~10–30 ms per request from Northern Europe to Stockholm. Negligible for 5–15 minute jobs where one extra round trip on submission is invisible against the work itself.
Bandwidth on docling uploads. PDFs are base64’d in the request body and travel over the tunnel. A 5 MB PDF (~7 MB base64) on a 20 Mbps upstream is ~3 seconds before docling starts. Fine for “interpret one or a few documents at a time”; notable for batch ingestion.
Shared compute. The deployed runtime has a single global in-flight slot across all backends. Heavy local dev usage queues behind (or in front of) prod traffic. The gateway’s priority tiers are how you cooperate — submit dev jobs with priority: batch so they don’t preempt interactive prod work.
Tunnel lifecycle. SSH tunnels die with sleep, network changes, or laptop suspend. Either restart by hand or wrap in autossh.

When not to reach for the tunnel

For a consumer deployed alongside platform-services on the OCI host. Use the regular extra_hosts: ["llm-gateway:host-gateway"] wiring directly — same Docker daemon, no tunnel needed.
When the consumer needs to keep working offline. The tunnel needs the remote host to be reachable; if you need to work on a plane, fall back to local platform-services with the small models (set LLM_GATEWAY_URL to http://llm-gateway:11435 and LLM_MODEL to an 8B tag — output quality will differ from prod, but the wiring exercises end-to-end).
When OCI is unavailable (capacity error on the on-demand large instance, region outage). Same fallback as above.

Troubleshooting

Symptom	Fix
`ssh: Could not resolve hostname oci-arm`	No `Host oci-arm` block on this distro. Add the SSH alias, or use `-i ~/.ssh/oci_arm ubuntu@<oci-ip>` inline.
Tunnel opens but container `curl` times out / connection-refuses	Under Option A: `ssh -L` bound to 127.0.0.1, or WSL not in mirrored networking mode (WSL-bound ports invisible to containers). Re-bind to `0.0.0.0`, enable mirrored mode in `~/.wslconfig`, or switch to Option B — the cleanest fix on Docker Desktop + WSL2.
`administratively prohibited` on forward	`authorized_keys` missing `permitopen="localhost:11435"`, or it uses `restrict` (broken on OpenSSH 9.6p1 — see Authorize the key).
`GET /v1/jobs/{id}` returns HTTP 404	72 h retention sweep deleted the record. Poll within 72 h of `completed_at`, or persist results client-side. See Job lifecycle.

Migrating from direct backend access

If your consumer was wired to call ollama or docling directly, the migration is:

Add the gateway to extra_hosts (if it wasn’t already): "llm-gateway:host-gateway". Drop the old ollama: and docling: entries from the same list.
Replace direct backend calls with gateway job submissions. For ollama: POST /v1/jobs with endpoint: "/api/chat" (or whichever ollama path) and your old payload as payload. Then poll GET /v1/jobs/{id} until terminal. Worked example above.
For docling: switch from multipart upload to JSON+base64 submission via the gateway. Worked example above. The result shape is unchanged — the gateway returns docling’s /v1/result body verbatim under result.
If you were on the deprecated pattern A (joining platform_default), drop the networks: block — pattern A is gone. The extra_hosts: wiring above is the only supported shape.

The streaming-tokens code path (if you had one) doesn’t carry over — the gateway never streams. For 5–15 minute generations nobody is watching tokens form anyway; if you genuinely need streaming, that’s a feature request, not a migration step.

What’s not here

Authentication — the platform network is trusted; anything on it (or anything reaching the published gateway port) can use the gateway. No auth at the API layer.
Rate limiting / per-caller quotas — not implemented. The gateway enforces priority and serialization but does not cap how much any one caller can submit.
Authentication on the gateway — the X-Caller-Id header is trusted, not verified. Suitable for an internal network of cooperating consumers; not suitable for untrusted clients.
Streaming partial output from the gateway — clients see status changes via polling, not tokens-in-flight.
Direct access to ollama or docling — internal to platform-services, see What about direct ollama / docling access? above.
Tutorials for the ollama or docling APIs — see their upstream docs. (Exception: the docling submission shape used by the gateway is covered in Submitting a docling job above.)