Connecting to platform-services

Connecting to platform-services

This page is for consumer projects — apps and dev containers that need to use the shared LLM runtime (and, eventually, the public edge proxy) provided by platform-services.

How this works

Consumer projects reach platform-services through the llm-gateway — a priority-aware async job queue that is the single entry point for both LLM (ollama) and document parsing (docling) work. The gateway API and wiring are identical regardless of which deployment you point at.

Two deployments, one of them is the default

Deployment Where it runs Models loaded When to use
OCI (default) Oracle Cloud Ampere A1 Production models (72B-class LLM, VLM-based docling) Default for everything — production consumer deployments and local consumer development alike.
Local platform-services Developer machine Small models only (8B-class) Special cases: working offline, intentionally exercising the small models, or when OCI is unavailable.

Concretely:

Connecting

Consumers reach platform-services through the gateway via the published host port. Your consumer doesn’t join the platform_default network — it stays decoupled from platform-services’ lifecycle and keeps starting/running when the platform is rebuilding or down.

services:
  app:
    extra_hosts:
      - "llm-gateway:host-gateway"

Behind the scenes, llm-gateway resolves via /etc/hosts to the host gateway IP; TCP connections then land on the published host port (11435).

URLs:

Why the two endpoints have different reachability stories: the gateway is a write-capable LLM dispatch API with no auth/rate-limit yet, so it’s deliberately kept off the public internet. Docs are read-only and safe to publish, so they’re served straight from the public TLS edge — no per-distro setup required.

Don’t hardcode these in app code — parameterize them via environment variables (LLM_GATEWAY_URL, PLATFORM_DOCS_URL). Same code then works under local dev, deployed prod, and the SSH-tunnel variant documented further down.

When platform-services is up, your consumer reaches it. When it’s down, the consumer still starts — gateway calls fail at the call site with a connection error, not at compose-up time. That’s the design intent.

What about direct ollama / docling access?

Not supported. ollama and docling are internal services of platform-services and have no consumer contract. The gateway owns all backend dispatch, including:

If you were previously wired to ollama:11434 or docling:5005, see Migrating from direct backend access below.

What’s served

Service URL
llm-gateway http://llm-gateway:11435 (private — see note below)
platform-docs (page) https://soggplatform.dedyn.io/
platform-docs (model list) https://soggplatform.dedyn.io/models.json
platform-docs (deploy info) https://soggplatform.dedyn.io/status.json

Notes:

Calling the gateway

The gateway exposes a uniform async job API for both backends. Submit a job, get an ID back immediately, poll for the result. Jobs are expected to run roughly 5–15 minutes on the current hardware (longer once we cut over to 72B-class models) — long enough that holding an HTTP connection open across the work is the wrong default.

Verb Path Purpose
POST /v1/jobs Submit a job. Returns { id, status, tier, backend, queue_position }.
GET /v1/jobs List recent jobs (ops visibility). Query params: status (default active = queued+running+failed; or queued / running / completed / failed / all), backend (ollama / docling), limit (1–200, default 20). Returns a compact view per job — result payloads are replaced by result_bytes, errors trimmed to 500 chars. Use /v1/jobs/{id} for the full payload.
GET /v1/jobs/{id} Status + result (or error). Poll this.
DELETE /v1/jobs/{id} Cancel a job. Only valid while status=queued.
GET /queue Aggregate stats: in-flight (one global slot), per-tier queue depths, last-24h counters. Includes a timestamp field for cross-sample correlation.
GET /diagnostics Live backend health and a single-field state summarizing the whole platform (stopped, starting, ready_cold, warming, ready_warm, busy, failed, backend_unhealthy, disabled). Also exposes warm (boolean), loaded_models (in RAM right now), available_models (on disk), plus the raw per-backend probes. Use state and warm for routing decisions; use the raw fields for ops debug.

Submission shape

{
  "endpoint": "/api/chat",
  "payload": { "model": "qwen2.5:72b-instruct-q4_K_M", "messages": [...] },
  "priority": "interactive",
  "backend": "ollama"
}

Choosing a priority for your job

Set priority on the submit body to one of:

Tier is per-job, not per-caller: a single consumer can mix interactive and batch submissions freely. There is no central caller→tier mapping; the gateway trusts whatever priority value the consumer declares (the trust model assumes a small number of fully-trusted consumers on the internal network).

Set the X-Caller-Id request header to a short, stable identifier for your consumer project (e.g. my-editor-assistant). This is recorded on every job and surfaces in /v1/jobs/{id} and the gateway log — useful for tracing your own traffic and for the operator when debugging — but it does not drive tier assignment. If the header is absent, the gateway falls back to the peer IP as the caller identifier; don’t rely on the fallback, it’s a backstop.

Preemption is not supported. An interactive job submitted while a batch job is in flight waits for that batch job to finish (can be 5–15 min for docling work, longer for big inference). Once the in-flight slot frees, the interactive job dispatches before any queued batch. If sub-batch-duration latency matters for your use case, raise it — the gateway design will need to change.

Single global in-flight slot

The gateway runs one job at a time globally across all backends. An ollama job in flight blocks a docling dispatch and vice versa; two queued jobs of any combination run sequentially.

Why one slot, not one per backend: the on-demand large instance is CPU-bound on both workloads (no GPU on the free-tier ARM shape), so running ollama and docling in parallel halves each one’s throughput. Serial dispatch with priority tiers gives each job full CPU and keeps the queue model simple. Rationale and the trade-off in full: plans/priority-queueing.md Queue shape section.

The queue_position returned on submission counts every queued job ahead of yours globally — 1 means “next to run,” regardless of which backend any of the queued jobs target.

Job lifecycle

queued ─────► running ─────► completed
   │              │
   │              └────► failed (backend error, timeout, gateway restart)
   └────► failed (cancelled via DELETE, or platform-initiated bulk cancel)

If the gateway operator clears the queue (incident recovery, stuck backend, etc.), every queued job transitions to status=failed with the recorded error set to a message the operator chose — by default cancelled by platform services, but operators may include incident context (e.g. large instance OOM, restart pending). Treat this as a normal job-level failure on the consumer side: surface error to the user, resubmit if your workflow needs to retry.

While status=queued, the response includes queue_position (1-indexed, global; 1 means “next to run,” counting every queued job ahead of yours regardless of backend). Once a job starts, the response includes started_at and queue_position is dropped. On completion, the verbatim backend response is returned under result.

While status=running, the response also includes a phase field describing what the gateway is currently doing with your job. The bundled dispatchers emit:

phase Meaning
waiting_for_backend Worker claimed your job; lifecycle is bringing the on-demand large up (only relevant on a cold start)
ollama_dispatching POST to ollama just sent; phase about to refine to one of the two below within ~2 s
ollama_loading_model The model your call requested is being loaded from disk into RAM (cold-load cost)
ollama_generating The model is resident; ollama is actively producing tokens
docling_submitting Sending your PDF to docling’s async submit endpoint
docling_polling docling accepted; gateway is polling for completion
docling_fetching_result docling reported success; gateway is fetching the JSON result

phase is a UX affordance — a consumer can render meaningfully different “loading model…” vs “generating…” states without guessing — but it’s not a control plane signal. Don’t write logic that depends on phase transitions firing in a specific order or at all (e.g. a very fast inference may transition straight to completed before the watcher even writes ollama_generating). Treat it as a hint, not a contract.

Gateway restarts: queued jobs survive (state is persisted to SQLite). Any job that was actively running at restart time is marked failed with an error indicating the gateway restarted — the partial work is gone, and silently re-running could double-charge a non-idempotent caller. Resubmit if needed.

Retention. Completed and failed job records are kept for 72 hours after completed_at, then deleted by a background sweep that runs every hour. Both knobs are env-tunable on the gateway service (GATEWAY_JOB_TTL_SECONDS, default 259200; GATEWAY_CLEANUP_INTERVAL, default 3600). Queued and running jobs are never swept regardless of age — a job that takes a week to drain stays in the queue with no retention pressure. After sweep, GET /v1/jobs/{id} returns HTTP 404 with body {"error": "job not found"} — indistinguishable from an ID that never existed. Poll each result within 72 h of its completed_at, store the result on your side, or bump the TTL further for batch workloads. The completed_last_24h / failed_last_24h counters on /queue are independent of the retention window — they always report the last 24 h of activity.

Picking a model name

The production default is the 72B-class model on the OCI large hostqwen2.5:72b-instruct-q4_K_M. This is what consumers should target both in deployed prod and from dev (via the OCI tunnel). Qwen3 has no 72B size (its ladder jumps 32B → 235B), so the production target is Qwen2.5-72B; if you see a reference to “qwen3:72b” anywhere, it’s a docs error.

The 8B small profile (qwen3:8b-q4_K_M-nothink) is kept for legacy compatibility and minimal-resource tests only — running ollama locally on a developer machine that can’t hold a 72B model, or driving a regression suite where the smaller model’s faster inference is more important than its quality. Do not depend on the small profile being available on the OCI large.

Hardcoding a model name in your consumer ties it to one deployment. Parameterize via env var, same way as the gateway URL:

# Production / dev-against-OCI (default — what every consumer
# should target unless they have a specific reason not to)
LLM_GATEWAY_URL=http://llm-gateway:11435  # or 21435 if tunneled
LLM_MODEL=qwen2.5:72b-instruct-q4_K_M

# Local platform-services only (legacy / minimal-resource tests)
LLM_GATEWAY_URL=http://llm-gateway:11435
LLM_MODEL=qwen3:8b-q4_K_M-nothink

The model set actually loaded on each deployment lives in models/profiles/ (large.sh is the production target; small.sh is legacy). To discover what’s currently loaded on the gateway you’re pointing at, hit /diagnostics on the gateway — ollama.ps[].name lists what’s resident in RAM right now, ollama.tags lists everything on disk. The legacy https://soggplatform.dedyn.io/models.json proxy of /api/tags still works for public model discovery.

What happens when the model isn’t loaded in RAM

Submitting a job whose model name is in available_models but not in loaded_models is legal and pays a one-time disk→RAM load cost — the gateway forwards the request to ollama, ollama loads the model from local disk (the bind-mounted weights volume), and then runs inference. The HTTP request just blocks while loading. From the consumer’s perspective the call takes longer than usual, and the phase field on /v1/jobs/{id} reports ollama_loading_model during the load window so you can render a meaningful UX state instead of “still loading…”. Once loaded, the model stays resident under OLLAMA_KEEP_ALIVE=-1 and subsequent calls don’t pay the cost.

Submitting a job whose model name is not in available_models at all is an operator-side configuration gap — the host this gateway points at hasn’t pulled the model. Ollama’s behavior in that case varies by version (recent releases auto-pull, older releases 404). Either way, don’t rely on auto-pull for a 40 GB production model — a silent 10–20 minute background download is the wrong default for an inference path. Pre-flight check by reading /diagnostics.available_models before submitting; if the model you need isn’t there, that’s an operator ask, not a consumer retry.

Latency expectations

These are calibration numbers for sizing consumer-side timeouts, not SLOs. They depend on the OCI large’s actual shape (currently 20 OCPU / 140 GB, CPU inference — no GPU) and the specific model loaded. Treat them as ballpark; measure against your own workload once you’ve got something in production.

For the production model (qwen2.5:72b-instruct-q4_K_M) — numbers below are measured, not estimated, from the 2026-05-23 autostatement verify run on the current OCI large host (20 OCPU, 140 GB, CPU inference):

Scenario Measured wall-time What’s dominating
state=stopped → docling first call ready ~70 s OCI boot + reach ready_warm for the docling backend
Docling: 9-page, 660 KB PDF 86–164 s docling itself; not gateway-side
Ollama cold load (72B disk→RAM) 20 min 6 s block-volume read speed for 40 GB of weights
Ollama warm typical extraction (small labelling prompt) 3 min 49 s – 4 min 22 s token generation on CPU
Ollama cold first call (load + typical extraction) ~24 min 20-min load + ~4-min generate; fits inside the 60-min GATEWAY_OLLAMA_TIMEOUT default with headroom

The cold-load cost is large enough that the gateway pre-warms the production model in the background after boot. When the operator sets LIFECYCLE_WARM_MODEL=qwen2.5:72b-instruct-q4_K_M on the gateway, the lifecycle controller kicks a warm probe (a single-token inference against that model) as a fire-and-forget background task immediately after backend health passes. From the consumer side this means:

If LIFECYCLE_WARM_MODEL is not set, no background probe is spawned and the first consumer job after each cold start triggers the load inside its own HTTP budget. Same wall-time, just different attribution.

GET /diagnostics surfaces the probe outcome under lifecycle.warm:

"lifecycle": {
  ...
  "warm": {
    "model": "qwen2.5:72b-instruct-q4_K_M",
    "last_attempt_at": "2026-05-25T10:14:23.117+00:00",
    "last_outcome": "success",
    "last_duration_seconds": 1187.4,
    "last_error": null,
    "model_loaded_at": "2026-05-25T10:14:23.117+00:00"
  }
}

Fields:

Polling patience on cold starts

The gateway holds the ollama HTTP request server-side until ollama actually returns, up to GATEWAY_OLLAMA_TIMEOUT (default 60 min). Your consumer is polling /v1/jobs/{id}, not waiting on that HTTP call — each poll returns in milliseconds with status=running and the appropriate phase. phase=ollama_loading_model is now the normal signal that you’re paying cold-load (rather than the rare fallback it was when warm-on-boot blocked READY); phase=ollama_generating flips when ollama starts producing tokens. Your HTTP client’s per-request timeout only needs to cover one poll, not the whole job. The knob that matters for cold starts is how long your polling loop is willing to wait overall — for the 72B on the current OCI large, budget up to ~25 min for a cold first call (load + generate) and use the phase field to render meaningful state in the meantime.

For pre-flight smoke tests: a 5-token output against a trivial prompt ("Say pong.") completes in well under 30 seconds when state=ready_warm AND the requested model is in loaded_models.

If your verify runs produce additional measurements (especially on different output sizes or with different prompts), contribute them back — the table above is anchored on one filing’s worth of data plus the cold-load probe.

Submitting an ollama job

import os, time, requests

GATEWAY = os.environ["LLM_GATEWAY_URL"]  # e.g. http://llm-gateway:11435
MODEL = os.environ["LLM_MODEL"]          # e.g. qwen2.5:72b-instruct-q4_K_M
HEADERS = {"X-Caller-Id": "my-editor-assistant"}

submit = requests.post(
    f"{GATEWAY}/v1/jobs",
    headers=HEADERS,
    json={
        "endpoint": "/api/chat",
        "payload": {
            "model": MODEL,
            "messages": [{"role": "user", "content": "Summarize ..."}],
        },
        "priority": "interactive",
    },
)
submit.raise_for_status()
job_id = submit.json()["id"]

while True:
    r = requests.get(f"{GATEWAY}/v1/jobs/{job_id}", headers=HEADERS)
    r.raise_for_status()
    job = r.json()
    if job["status"] == "completed":
        print(job["result"])
        break
    if job["status"] == "failed":
        raise RuntimeError(job["error"])
    time.sleep(5)

A 5-second poll cadence is fine for 5–15 minute jobs. The gateway forces stream: false on ollama calls — you receive the full response in result, never tokens-in-flight.

Submitting a docling job

Docling input goes inline in the payload as base64. The gateway does not accept multipart uploads — that simplifies the gateway and keeps the submission shape uniform between backends. For a 10 MB PDF this means ~13 MB of base64 text in the request body, well inside the gateway’s 64 MB body cap.

The gateway forwards your payload to docling’s /v1/convert/source/async, polls until docling reports the task terminal, and returns the docling result JSON verbatim in result.

Important — conversion options must be nested under options. docling-serve accepts conversion knobs (do_ocr, to_formats, do_table_structure, md_page_break_placeholder, etc.) under an options sub-object, not as top-level siblings of sources. The gateway is a verbatim passthrough; if you put options at the top level they reach docling but get silently ignored, and you’ll see defaults instead — most visibly: no \f page-break markers in markdown. Confirmed in production by an autostatement regression on 2026-05-23.

Images: default is dropped. The gateway interprets a top-level include_images: bool field on the docling payload (default false). When false the gateway both (a) sets docling’s image_export_mode=placeholder to suppress server-side rendering of images into the result, and (b) strips any image data that does land in the result before storing it. The strip nulls result.document.json_content.pages[<n>].image (the rendered page bitmaps — the main bloat source on real PDFs) and pictures[].image (semantic-object detections), and replaces inline base64 data URIs in md_content / html_content with placeholders. Set include_images: true if you actually need the image bytes (e.g. an upcoming vision-model interpreter); be aware that a single PDF page can produce 1–10 MB of base64 image data at the default 144 dpi and the gateway stores the full result for the job-TTL window. Background: plans/docling-image-handling.md.

import base64, os, time, requests

GATEWAY = os.environ["LLM_GATEWAY_URL"]
HEADERS = {"X-Caller-Id": "my-doc-ingest"}

with open("annual-report.pdf", "rb") as f:
    pdf_b64 = base64.b64encode(f.read()).decode("ascii")

submit = requests.post(
    f"{GATEWAY}/v1/jobs",
    headers=HEADERS,
    json={
        "endpoint": "/v1/convert/source/async",
        "backend": "docling",
        "priority": "batch",
        "payload": {
            "sources": [
                {
                    "kind": "file",
                    "base64_string": pdf_b64,
                    "filename": "annual-report.pdf",
                }
            ],
            # Gateway-level knob. Default is false (drop images).
            # Sibling of `sources` / `options`; the gateway extracts
            # it from the payload before forwarding to docling.
            # "include_images": False,
            # All docling conversion knobs go under `options`.
            # Putting them at the top level alongside `sources`
            # results in docling silently using defaults — see the
            # warning above this snippet.
            "options": {
                "to_formats": ["md", "json"],
                "do_ocr": True,
                "do_table_structure": True,
                "md_page_break_placeholder": "\f",
            },
        },
    },
)
submit.raise_for_status()
job_id = submit.json()["id"]

while True:
    r = requests.get(f"{GATEWAY}/v1/jobs/{job_id}", headers=HEADERS)
    r.raise_for_status()
    job = r.json()
    if job["status"] == "completed":
        result = job["result"]  # the verbatim docling result JSON
        break
    if job["status"] == "failed":
        raise RuntimeError(job["error"])
    time.sleep(5)

The docling result shape — markdown, JSON document tree, etc. — is whatever docling returns at /v1/result/{task_id}. The gateway is a pass-through for that body. A 5-line regression test that asserts result["document"]["md_content"].count("\f") >= n_pages would catch a future shape regression like the 2026-05-23 incident in seconds.

Job size caps. Two timeouts apply, both server-side and configured on the gateway / docling services in docker-compose.yml:

For ~150-page annual-report PDFs, both caps are comfortable. If you have larger documents, raise both before pushing them through.

Checking platform load

GET /queue returns a snapshot without affecting the queue:

{
  "in_flight": 1,
  "queued": { "interactive": 0, "batch": 3 },
  "running_by_tier": { "interactive": 0, "batch": 1 },
  "completed_last_24h": 47,
  "failed_last_24h": 2,
  "timestamp": "2026-05-23T09:25:01.124+00:00"
}

in_flight is 0 or 1 — there is a single global slot across all backends (see Single global in-flight slot above). Hit this before a non-urgent submission if you want to be a good citizen: if queued.batch is deep, you might choose to defer. timestamp is the gateway’s wall-clock at snapshot time; use it to correlate samples across time without relying on the relative lifecycle.last_activity_age_seconds.

The response also includes a lifecycle block with the on-demand large-instance state and (when running) two uptime views:

{
  "lifecycle": {
    "state": "ready",
    "in_flight": 0,
    "last_activity_age_seconds": 12.4,
    "last_error": null,
    "session": {
      "started_at": "2026-05-23T08:12:01.034+00:00",
      "uptime_seconds": 4501.2
    },
    "month_to_date": {
      "wall_hours": 18.42,
      "ocpu_hours": 368.4,
      "gb_hours": 2578.8,
      "month_start": "2026-05-01T00:00:00+00:00",
      "next_reset": "2026-06-01T00:00:00+00:00"
    }
  }
}

session is null when the large is stopped. month_to_date is informational today — useful if you want to know roughly how much of the monthly OCI free-tier budget has been spent so far this UTC month. A future release will surface remaining-budget on each job-status response and add a dedicated /usage endpoint with the same data; until then /queue.lifecycle.month_to_date is the place to look.

Knowing when the platform is ready for your call

For decisions that need to happen before you submit — “is the backend even up?”, “is the model I want already in RAM, or am I about to pay a cold-load cost?” — GET /diagnostics gives you a single derived state field plus a few denormalized lists:

{
  "state": "ready_warm",
  "warm": true,
  "loaded_models": ["qwen2.5:72b-instruct-q4_K_M"],
  "available_models": [
    "qwen2.5:72b-instruct-q4_K_M",
    "qwen3:8b-q4_K_M-nothink"
  ],
  "ollama": { ... raw per-backend detail ... },
  "docling": { ... raw per-backend detail ... },
  "lifecycle": { ... state, in_flight, last_activity_age_seconds, last_error, warm  same as /queue's lifecycle EXCEPT /queue additionally embeds session / month_to_date / large_shape (the uptime-accounting block) ... },
  "timestamp": "2026-05-23T09:25:02.341+00:00"
}

state is the single source of truth for “what’s the platform doing right now,” and is one of:

state Meaning Your next call will… Wait or submit?
stopped Large instance is idle-asleep (auto-stopped after LIFECYCLE_IDLE_TIMEOUT_SECONDS of no activity, default 2 h) — the next submission will auto-wake it trigger a ~70 s boot + warm-on-boot loading the 72B (~20 min) in the background; first call pays cold-load if it races the probe just submit
starting OCI boot in progress (either triggered by a previous submission or by make redeploy) wait until boot finishes submit (will queue)
stopping OCI shutdown in progress wait, or queue and the boot will retrigger submit (will queue and trigger wake)
backend_unhealthy Instance up, ollama/docling not responding likely fail; investigate before retry don’t submit until resolved
ready_cold Backends up, no model in RAM pay disk→RAM load (~1–3 min for 72B) submit
warming A model load is in progress (typically the background warm-on-boot probe; can also be an in-flight job’s own cold-load) wait briefly; first call after ready_warm is fast submit (queues briefly)
ready_warm Model resident, idle run immediately submit (fast)
busy Model resident, currently generating queue behind the in-flight job submit (queues briefly)
failed Lifecycle in failed state (e.g. OCI capacity, IAM revoked, OCID deleted) submissions will likely fail; the lifecycle auto-clears after the failed_cooldown_seconds window and retries wait, or fix the underlying OCI-side issue
disabled Lifecycle controller off (local dev / non-OCI deploys) reach the backend directly with no boot logic submit

Sleep and wake — the operational model in one paragraph. The on-demand large instance auto-stops itself after LIFECYCLE_IDLE_TIMEOUT_SECONDS (default 7200 s / 2 h) of no gateway dispatches, and auto-wakes on the next job submission — the gateway intercepts every job submit and, if the large is stopped, issues an OCI start before dispatching. Consumers don’t need to explicitly trigger wake. There is no “operator paused the platform, hold your submission” state in this API; that scenario is covered by disabled (controller off entirely) or failed (lifecycle gave up). For every other state a consumer can safely submit and let the gateway handle whatever’s needed; the phase field on /v1/jobs/{id} will then narrate boot → load → generation.

warm is the convenience boolean: lifecycle == ready AND at least one model in RAM. True implies your next ollama inference starts immediately — no boot, no load. Use warm for the simple “is this a fast path?” check; use state when you need the full picture.

loaded_models and available_models let you check whether the specific model you want is resident (fast next call) or just on disk (will pay ~1–3 min load). If the model you need isn’t in available_models at all, you’ve targeted the wrong host or the model hasn’t been refreshed there yet — that’s an operator issue, not something to retry around.

The raw ollama.ps, ollama.tags, docling.health, and lifecycle blocks are there for ops debugging. Each backend reports errors in-band, so a single broken backend doesn’t blind you to the other — useful when you’re triaging which side is having a bad day.

Reference: end-to-end consumer pattern

The minimal examples earlier in this page elide a few things that matter in real consumers — pre-flight /diagnostics check, phase-aware UX, robust timeout, and explicit failure handling. Copy-paste the snippet below as a starting point rather than re-derive it. It assumes the model name and Choosing a priority for your job sections above have been read.

"""Minimal but production-shaped consumer of the platform gateway.

Reads LLM_GATEWAY_URL and LLM_MODEL from env (no defaults — failing
loudly beats defaulting to a wrong gateway, see #picking-a-model-name).
"""
import os
import time
from typing import Callable, Optional

import requests

GATEWAY = os.environ["LLM_GATEWAY_URL"]    # e.g. http://llm-gateway:11435
MODEL = os.environ["LLM_MODEL"]            # e.g. qwen2.5:72b-instruct-q4_K_M
CALLER = "my-consumer"                     # short stable identifier; logged
HEADERS = {"X-Caller-Id": CALLER}
PRIORITY = "batch"                         # "interactive" or "batch"; default batch


class GatewayError(RuntimeError):
    """Anything the gateway tells us went wrong. .job is the recorded
    row when the failure was a job-level error (so callers can read
    .job['error'], .job['phase'], etc.); None for pre-submit failures."""
    def __init__(self, message: str, job: Optional[dict] = None):
        super().__init__(message)
        self.job = job


def diagnostics() -> dict:
    r = requests.get(f"{GATEWAY}/diagnostics", timeout=5)
    r.raise_for_status()
    return r.json()


def preflight(model: str) -> dict:
    """Confirm the gateway will accept work for the given model.
    Raises GatewayError if the platform is in a non-submittable state
    or the model isn't available on the target host."""
    d = diagnostics()
    if d["state"] in ("failed", "backend_unhealthy"):
        raise GatewayError(f"platform not ready: state={d['state']}")
    if model not in d["available_models"] and d["available_models"]:
        # Empty list means we couldn't reach ollama (likely instance
        # asleep); submission will trigger wake and we'll learn for
        # sure then. Only fail loudly when we have a definitive list.
        raise GatewayError(
            f"model {model!r} not on disk; available={d['available_models']}"
        )
    return d


def submit(backend: str, endpoint: str, payload: dict) -> str:
    r = requests.post(
        f"{GATEWAY}/v1/jobs",
        headers=HEADERS,
        json={
            "backend": backend,
            "endpoint": endpoint,
            "payload": payload,
            "priority": PRIORITY,
        },
        timeout=10,
    )
    r.raise_for_status()
    return r.json()["id"]


def wait_until_terminal(
    job_id: str,
    *,
    timeout_s: int = 7200,
    poll_interval_s: float = 5.0,
    on_phase_change: Optional[Callable[[Optional[str]], None]] = None,
) -> dict:
    """Poll /v1/jobs/{id} until completed or failed. Calls
    on_phase_change(phase) once per (deduped) transition — the right
    place to render UX state."""
    deadline = time.monotonic() + timeout_s
    last_phase: Optional[str] = "__init__"
    while time.monotonic() < deadline:
        r = requests.get(f"{GATEWAY}/v1/jobs/{job_id}",
                         headers=HEADERS, timeout=10)
        r.raise_for_status()
        job = r.json()
        phase = job.get("phase")
        if phase != last_phase:
            if on_phase_change:
                on_phase_change(phase)
            last_phase = phase
        if job["status"] == "completed":
            return job
        if job["status"] == "failed":
            raise GatewayError(f"job failed: {job.get('error')}", job=job)
        time.sleep(poll_interval_s)
    raise GatewayError(
        f"job {job_id} did not terminate within {timeout_s}s; "
        f"last phase was {last_phase!r}"
    )


# Example: chat completion
def chat(messages: list, timeout_s: int = 7200) -> str:
    preflight(MODEL)
    job_id = submit("ollama", "/api/chat", {"model": MODEL, "messages": messages})
    print(f"submitted {job_id}")
    job = wait_until_terminal(
        job_id,
        timeout_s=timeout_s,
        on_phase_change=lambda p: print(f"  phase={p}"),
    )
    return job["result"]["message"]["content"]


if __name__ == "__main__":
    print(chat([{"role": "user", "content": "Say pong."}]))

What this snippet encodes that the minimal examples don’t:

Adapt the chat() example to your shape: docling jobs use submit("docling", "/v1/convert/source/async", {...}), embeddings use submit("ollama", "/api/embeddings", {...}). The submit/poll/phase plumbing stays identical.

Pointing local dev at OCI (the default dev path)

For day-to-day consumer development, run your consumer locally and point it at the OCI gateway via an SSH tunnel. This gives you local hot-reload, editor tooling, and dev container niceties while the LLM and docling work runs on production-sized models in OCI.

Why this is the default rather than local platform-services:

Reach for local platform-services instead when you have a concrete reason: working offline, intentionally testing the small models, or OCI is unavailable. Switching is one env var change.

Quickstart

Two tunnel patterns — pick by where your consumer runs:

Either way, one-time bootstrap is the same: generate keyauthorize on OCISSH aliasrun tunnel. Errors: Troubleshooting.

How it works

Two viable topologies depending on where the consumer process runs. Same key, same OCI authorization, same SSH alias — only the location of the ssh -L and the consumer’s URL differ.

Option A — tunnel on the WSL host. ssh -L runs on the WSL distro; the consumer reaches it via host-gateway. Works cleanly for consumers running natively on the distro (no container). For consumer dev containers under Docker Desktop + WSL2, also requires WSL in mirrored networking mode — otherwise WSL-host-bound ports are invisible to containers, regardless of 0.0.0.0 bind.

consumer (in container or native) → host-gateway → ssh -L on WSL → OCI:localhost:11435

Option B — tunnel inside the dev container. ssh -L runs in the consumer container; the consumer hits its own loopback. Side- steps host-side networking entirely. Requires bind-mounting the WSL distro’s ~/.ssh into the container so the same key/config are available.

consumer (in container) → localhost → ssh -L in same container → OCI:localhost:11435

One SSH key per dev distro

Each WSL distro (or local machine) that wants the tunnel gets its own key. Don’t copy keys between distros — that defeats their isolation. In the consumer-project distro:

ssh-keygen -t ed25519 -C "<distro-name>-tunnel" -f ~/.ssh/oci_arm
cat ~/.ssh/oci_arm.pub

This is a separate key from the operator’s full-access key documented in ../README.mdSSH access. That key exists for managing the instance; this one is for tunnels only.

The key lives in the WSL distro’s ~/.ssh/ regardless of which tunnel pattern you use (see Run the tunnel). Under Option B the dev container bind-mounts that directory read-only and uses the same key — no copying, no separate identity. “One key per distro” means per WSL distro.

Authorize the key as tunnel-only

The public half goes onto OCI’s ~/.ssh/authorized_keys, prefixed with restrictions so the key cannot be used for anything except forwarding the gateway port:

command="echo tunnel-only access; exit 1",no-pty,no-agent-forwarding,no-X11-forwarding,no-user-rc,permitopen="localhost:11435" ssh-ed25519 AAAA... <distro-name>-tunnel

Why not the restrict umbrella keyword? OpenSSH’s restrict is the documented one-word equivalent of the four no-* options below. But on OpenSSH 9.6p1 (Ubuntu 24.04, current OCI image) we observed that restrict,permitopen=... parses correctly — sshd’s debug log shows the permitopen target listed — yet forwarding gets denied with administratively prohibited anyway. The expanded form (each restriction named individually) behaves correctly. Verified on 2026-05-19 during the autostatement onboarding; the pattern in this doc deliberately avoids restrict so future onboardings don’t repeat the debug session.

Where this command runs. Not from the consumer distro — that’s the distro we’re granting access, so it can’t authorize itself yet. The append runs from a machine that already has admin SSH to OCI — typically the operator’s platform-services WSL distro. One-liner from there:

echo 'command="echo tunnel-only access; exit 1",no-pty,no-agent-forwarding,no-X11-forwarding,no-user-rc,permitopen="localhost:11435" ssh-ed25519 AAAA... <distro-name>-tunnel' \
  | ssh oci-arm 'cat >> ~/.ssh/authorized_keys'

Verify it landed:

ssh oci-arm 'tail -1 ~/.ssh/authorized_keys'
ssh oci-arm 'tail -1 ~/.ssh/authorized_keys' | grep -oE 'ssh-ed25519 \S+ \S+$' | ssh-keygen -lf -

The grep -oE step strips the long options prefix (which contains spaces inside command="...") and isolates the bare <keytype> <keydata> <comment> so ssh-keygen -lf - can fingerprint it. Cross-check the printed fingerprint against the one the consumer distro printed in Generate the key (step 1). If they match, the key is intact through paste.

After this one-time bootstrap, the consumer distro talks to OCI directly forever — the operator distro is just the “trusted introducer” that vouches for the new key on day one.

What each option does:

If this key leaks, the worst an attacker can do is open forwards to the gateway port. They can’t get a shell, run commands, read files, forward agent or X11, or pivot to any other port.

SSH config alias

In the consumer-project distro’s ~/.ssh/config:

Host oci-arm
  HostName 79.76.60.187
  User ubuntu
  IdentityFile ~/.ssh/oci_arm
  IdentitiesOnly yes

Run the tunnel

Pick the option matching where your consumer runs. permitopen on OCI checks only the remote destination, so the same authorized_keys line works for either option.

Option A — from the WSL host

ssh -L 0.0.0.0:11435:localhost:11435 oci-arm -N

The 0.0.0.0 bind (not the default 127.0.0.1) is what lets a consumer dev container reach the port via host-gateway. On Docker Desktop + WSL2 this also requires the distro to be in mirrored networking mode — in Windows ~/.wslconfig:

[wsl2]
networkingMode=mirrored

then wsl --shutdown from PowerShell. Without mirroring, WSL-bound ports aren’t visible to containers no matter how the tunnel binds, and Option B is the right path.

Consumer-side wiring:

extra_hosts:
  - "llm-gateway:host-gateway"
environment:
  LLM_GATEWAY_URL: "http://llm-gateway:11435"

Option B — from inside the dev container

Bind-mount the WSL distro’s ~/.ssh into the container so the same key, config, and known_hosts from the bootstrap are available. In the consumer’s devcontainer.json, add to mounts:

"source=${localEnv:HOME}/.ssh,target=/home/vscode/.ssh,type=bind,readonly"

(${localEnv:HOME} resolves to the WSL distro’s $HOME. Adjust /home/vscode to your container user.) Rebuild the container. Then from a terminal inside it:

ssh -L 11435:localhost:11435 oci-arm -N

No 0.0.0.0 needed — the tunnel and the consumer share the container’s loopback. Consumer-side wiring:

environment:
  LLM_GATEWAY_URL: "http://localhost:11435"

No llm-gateway:host-gateway entry in extra_hosts — the URL points at the container’s own loopback.

Persistence

SSH tunnels die with sleep, network changes, or laptop suspend. Under Option A: wrap with autossh or a systemd --user unit. Under Option B: re-run in the container terminal, or add a postStartCommand that backgrounds it. Skip both until it actually annoys you.

Sharing localhost with a local platform-services

If you run the tunnel from the same machine that already has a local platform-services compose stack up, port 11435 is already claimed by the local stack. Bind the tunnel to an alternate local port instead — pick something far enough from the regular range to be unambiguous (the 21000 prefix is a useful convention):

ssh -L 21435:localhost:11435 oci-arm -N

This is the convention make tunnel in platform-services itself uses — it binds *:21435 precisely so the local stack on :11435 and the OCI tunnel can coexist.

The consumer chooses which to hit by switching its LLM_GATEWAY_URL env var between the two port numbers:

The extra_hosts mapping is identical for both; only the port number discriminates. permitopen on the OCI side only checks the remote target (localhost:11435), never the local bind port — so the same authorized_keys line works regardless of which local port you pick.

Diagnosing a wrong-port hit. If your consumer points at :11435 and you have a local stack running, the call lands on the local stack silently — there’s no error, just stale-looking data. The cleanest tells in a GET /queue response from the current OCI gateway:

If any of those are missing, you’re hitting a pre-current gateway — most likely the local stack on :11435. Switch the consumer URL to :21435 (assuming make tunnel is up) and re-check.

How granular the local↔︎OCI switch is depends on how your consumer reads the env var. If it’s baked in at container start (e.g. fixed in the compose file), switching means a container restart. If it’s read per process invocation (common Python pattern: os.environ.get(...) inside the entry point), you can flip per command — and even run one process against OCI and another against local simultaneously from the same container:

LLM_GATEWAY_URL=http://llm-gateway:21435 \
  python -m scripts.coverage_probe_pass2 --full

Trade-offs

When not to reach for the tunnel

Troubleshooting

Symptom Fix
ssh: Could not resolve hostname oci-arm No Host oci-arm block on this distro. Add the SSH alias, or use -i ~/.ssh/oci_arm ubuntu@<oci-ip> inline.
Tunnel opens but container curl times out / connection-refuses Under Option A: ssh -L bound to 127.0.0.1, or WSL not in mirrored networking mode (WSL-bound ports invisible to containers). Re-bind to 0.0.0.0, enable mirrored mode in ~/.wslconfig, or switch to Option B — the cleanest fix on Docker Desktop + WSL2.
administratively prohibited on forward authorized_keys missing permitopen="localhost:11435", or it uses restrict (broken on OpenSSH 9.6p1 — see Authorize the key).
GET /v1/jobs/{id} returns HTTP 404 72 h retention sweep deleted the record. Poll within 72 h of completed_at, or persist results client-side. See Job lifecycle.

Migrating from direct backend access

If your consumer was wired to call ollama or docling directly, the migration is:

  1. Add the gateway to extra_hosts (if it wasn’t already): "llm-gateway:host-gateway". Drop the old ollama: and docling: entries from the same list.
  2. Replace direct backend calls with gateway job submissions. For ollama: POST /v1/jobs with endpoint: "/api/chat" (or whichever ollama path) and your old payload as payload. Then poll GET /v1/jobs/{id} until terminal. Worked example above.
  3. For docling: switch from multipart upload to JSON+base64 submission via the gateway. Worked example above. The result shape is unchanged — the gateway returns docling’s /v1/result body verbatim under result.
  4. If you were on the deprecated pattern A (joining platform_default), drop the networks: block — pattern A is gone. The extra_hosts: wiring above is the only supported shape.

The streaming-tokens code path (if you had one) doesn’t carry over — the gateway never streams. For 5–15 minute generations nobody is watching tokens form anyway; if you genuinely need streaming, that’s a feature request, not a migration step.

What’s not here