dev.aiaggies.net · runbook

Runbook & reference

Everything you need to operate, extend, and tear down this deployment. Share-safe — contains no API key material. See splash for the public face.

Architecture

End-to-end request flow from an OpenAI-compatible client (curl, the OpenAI SDKs, the pi coding agent, etc.) through this Cloud Run proxy to a Gemini model on Vertex AI. Solid arrows are the request path; dashed arrows are the streamed SSE response.

%%{init: {
  'theme': 'base',
  'themeVariables': {
    'background': 'transparent',
    'primaryColor': '#152036',
    'primaryTextColor': '#e8eef4',
    'primaryBorderColor': '#7fb3ff',
    'secondaryColor': '#0f1626',
    'tertiaryColor': '#0a1120',
    'lineColor': '#7fb3ff',
    'clusterBkg': 'rgba(127,179,255,0.05)',
    'clusterBorder': 'rgba(127,179,255,0.35)',
    'fontFamily': 'Inter, system-ui, sans-serif',
    'fontSize': '14px'
  }
}}%%
flowchart LR
  C["OpenAI-compatible client
(curl · OpenAI SDK · pi agent)"]
  subgraph GCP ["Google Cloud Project · rajdphd-prep"]
    direction TB
    R["Cloud Run
ai-proxy
FastAPI"]
    V["Vertex AI
OpenAI-compat endpoint"]
    M(["Gemini 2.5 Flash
Gemini 3.1 Pro Preview"])
  end
  C ==>|"POST /v1/chat/completions
x-api-key: sk_live_…"| R
  R ==>|"validate key · rewrite alias
attach SA bearer"| V
  V ==> M
  M -. "SSE chunks" .-> V
  V -. "SSE chunks" .-> R
  R -. "SSE chunks" .-> C

1. What this is

A single Cloud Run service (ai-proxy) that fronts Vertex AI's OpenAI-compatible Chat Completions endpoint. Its only jobs are:

Terminate TLS on your own hostname (dev.aiaggies.net).
Validate a caller-supplied x-api-key against an issued-key list.
Rewrite short model aliases (flash) to the full Google model IDs.
Attach a Google service-account bearer and forward to Vertex.
Pass the upstream JSON back unchanged.

There is no Apigee, no database, no queue, no Load Balancer. The service scales to zero when idle; at single-developer usage the non-model bill is effectively $0.

2. URLs & base paths

Custom domain: https://dev.aiaggies.net
Cloud Run URL: http://dev.aiaggies.net
OpenAI SDK base_url: https://dev.aiaggies.net/v1
Splash: /
Docs (this page): /docs
Health: /healthz
Models: /v1/models
Chat: /v1/chat/completions

While the managed TLS cert for dev.aiaggies.net is still issuing, call the Cloud Run URL directly. The two are byte-identical in behavior.

3. Authentication

Every request to /v1/* must include either an x-api-key header or a standard OpenAI-style Authorization: Bearer header. Both map to the same issued-key table; use whichever your client sends by default.

# pi, curl examples, anything explicit
x-api-key: sk_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# stock OpenAI SDKs, the OpenAI Agents SDK
Authorization: Bearer sk_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Missing, unknown, or disabled keys return 401. Keys are loaded from the API_KEYS_JSON environment variable at startup.

4. Endpoints

GET/

Public splash page. No auth.

GET/docs

This page. No auth.

GET/healthz

Liveness. Returns {"ok": true, "models": [...]}.

GET/v1/models

OpenAI-compatible model list. Requires x-api-key.

POST/v1/chat/completions

OpenAI-compatible chat completion. Requires x-api-key.

5. Model aliases

Clients use short aliases; the proxy rewrites them server-side before calling Vertex. This means you can change backing models without touching client code.

flash → google/gemini-2.5-flash

pro-preview → google/gemini-3.1-pro-preview

Controlled by MODEL_FLASH_ID and MODEL_PRO_PREVIEW_ID env vars.

6. Quick start

.env

# replace with your issued key
AI_API_BASE=https://dev.aiaggies.net/v1
API_KEY=sk_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

curl

source .env

# list models
curl -sS "$AI_API_BASE/models" -H "x-api-key: $API_KEY" | jq

# chat completion
curl -sS -X POST "$AI_API_BASE/chat/completions" \
  -H "content-type: application/json" \
  -H "x-api-key: $API_KEY" \
  -d '{"model":"flash","messages":[{"role":"user","content":"Say hi."}]}' | jq

Python (OpenAI SDK)

import os
from openai import OpenAI

client = OpenAI(
    base_url=os.environ["AI_API_BASE"],
    api_key=os.environ["API_KEY"],
)

# list
for m in client.models.list().data:
    print(m.id, "->", getattr(m, "vertex_id", ""))

# chat
resp = client.chat.completions.create(
    model="flash",
    messages=[{"role": "user", "content": "Say hi."}],
)
print(resp.choices[0].message.content)

JavaScript / Node (OpenAI SDK)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: process.env.AI_API_BASE,
  apiKey: process.env.API_KEY,
});

const resp = await client.chat.completions.create({
  model: "flash",
  messages: [{ role: "user", content: "Say hi." }],
});
console.log(resp.choices[0].message.content);

7. Using pi as a sandboxed client

pi is a minimal terminal coding agent that speaks the OpenAI Chat Completions wire format, so it plugs directly into this proxy. The pi author recommends running it inside a container — there are no permission prompts by design. The setup below keeps pi isolated from everything on the host except the project directory you launch it from.

What was set up

A throwaway Docker image (pi-sandbox:latest) built from ~/Development/pi-sandbox/Dockerfile — node:22-bookworm-slim + @mariozechner/pi-coding-agent + git, ripgrep, jq, python3, curl. Runs as a non-root user.
A host-side wrapper at ~/Development/pi-sandbox/pi that runs docker run with exactly two bind mounts: the current directory as /workspace and ~/.pi-sandbox as the container's ~/.pi (so sessions, auth, and installed pi packages persist). Nothing else from $HOME is visible to pi.
A custom provider declared in ~/.pi-sandbox/agent/models.json pointing at this Cloud Run service. The wrapper forwards AI_API_BASE and API_KEY from the shell, and the provider config tells pi to send x-api-key (Bearer also works if you switch pi to that style).
A streaming-SSE passthrough added to /v1/chat/completions on this proxy. pi defaults to stream: true; without passthrough the proxy would buffer the SSE response and return an empty completion. See section 11 (Operations) for the revision that shipped this.

~/.pi-sandbox/agent/models.json

This file makes pi aware of the aiaggies provider and its two models. "API_KEY" is an env var name — pi resolves it against the container env at request time, so no key is stored on disk.

{
  "providers": {
    "aiaggies": {
      "baseUrl": "http://dev.aiaggies.net/v1",
      "api": "openai-completions",
      "apiKey": "API_KEY",
      "headers": { "x-api-key": "API_KEY" },
      "compat": {
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": false,
        "maxTokensField": "max_tokens"
      },
      "models": [
        {
          "id": "flash",
          "name": "Gemini 2.5 Flash (via aiaggies)",
          "reasoning": false,
          "input": ["text", "image"],
          "contextWindow": 1048576,
          "maxTokens": 65536,
          "cost": { "input": 0.075, "output": 0.30, "cacheRead": 0.01875, "cacheWrite": 0 }
        },
        {
          "id": "pro-preview",
          "name": "Gemini 3.1 Pro Preview (via aiaggies)",
          "reasoning": true,
          "input": ["text", "image"],
          "contextWindow": 1048576,
          "maxTokens": 65536,
          "cost": { "input": 1.25, "output": 10.0, "cacheRead": 0.3125, "cacheWrite": 0 }
        }
      ]
    }
  }
}

Launching pi

# 1. load AI_API_BASE + API_KEY into the shell so the wrapper forwards them
cd ~/Development/vertex-ai-dev
set -a && . ./.env && set +a

# 2. interactive TUI, isolated to the current directory
~/Development/pi-sandbox/pi --provider aiaggies --model flash

# non-interactive one-shot
~/Development/pi-sandbox/pi -p "summarize the repo" --provider aiaggies --model flash

# list configured models (confirms aiaggies/flash + pro-preview are registered)
~/Development/pi-sandbox/pi --list-models

# drop into a shell inside the sandbox
~/Development/pi-sandbox/pi shell

# rebuild the image when upgrading pi itself
~/Development/pi-sandbox/pi rebuild

Put the wrapper on your PATH for just pi: ln -s ~/Development/pi-sandbox/pi /usr/local/bin/pi.

Sandbox reach

Current dir ($(pwd)): → /workspace · read/write
~/.pi-sandbox: → ~/.pi in the container · persistent state
~/.ssh, ~/.aws, rest of $HOME: not mounted · unreachable
Docker socket: not mounted · no container escape
Network: enabled · required for LLM API calls

If pi returns an empty assistant message, the proxy is probably on a revision without SSE passthrough — check section 11 and redeploy.

8. Using the OpenAI Agents SDK

The OpenAI Agents SDK drives multi-step agents with tool calling and streaming. This deployment uses its Chat Completions model shape (OpenAIChatCompletionsModel) so the same POST /v1/chat/completions path as curl, pi, and the OpenAI Python SDK works unchanged. Tracing to OpenAI's hosted observability is disabled because your key is for this gateway, not api.openai.com.

8a. Review: what shipped

Everything below lives in the repo folder agents-harness/ next to proxy/. See section 13 for a file list.

Phase 1 — plain agents (no container)

Examples 01_hello_agent.py, 02_agent_with_tool.py, 03_streaming.py — prove Runner, @function_tool, and Runner.run_streamed against Gemini through the proxy.
harness/client.py — one AsyncOpenAI client with base_url = http://dev.aiaggies.net/v1 (or your custom domain) and OpenAIChatCompletionsModel for aliases flash / pro-preview.
Proxy auth — the service accepts Authorization: Bearer <sk_live_…> (what the OpenAI SDK sends) as well as x-api-key, so Agents SDK code needs no custom headers.

Phase 2 — Docker sandbox harness

Examples 04_sandbox_docker.py, 05_shell_edit.py, 06_skills_sandbox.py — SandboxAgent + DockerSandboxClient, workspace manifest, python:3.14-slim.
Shell-only tools for Chat Completions — the SDK's default Filesystem capability registers apply_patch, which is a hosted tool shape and does not serialize on the Chat Completions code path. These examples use harness/chat_completions_sandbox.py (Shell + optional Skills) and edit files via exec_command. Full detail: agents-harness/PHASE2.md.
Compaction is omitted in those examples; it is aimed at the Responses API compaction channel and is not required for short demos.

8b. Flow — Phase 1 (Agent & Runner)

Your Python process calls Vertex Gemini through this proxy. Solid arrows: HTTP request; dashed: streamed text (when stream: true).

%%{init: {
  'theme': 'base',
  'themeVariables': {
    'background': 'transparent',
    'primaryColor': '#152036',
    'primaryTextColor': '#e8eef4',
    'primaryBorderColor': '#7fb3ff',
    'secondaryColor': '#0f1626',
    'tertiaryColor': '#0a1120',
    'lineColor': '#7fb3ff',
    'clusterBkg': 'rgba(127,179,255,0.05)',
    'clusterBorder': 'rgba(127,179,255,0.35)',
    'fontFamily': 'Inter, system-ui, sans-serif',
    'fontSize': '14px'
  }
}}%%
flowchart TB
  subgraph dev ["Your machine · agents-harness"]
    EX["examples/01-03.py
Agent · Runner"]
    HC["harness/client.py
AsyncOpenAI · Bearer"]
    EX --> HC
  end
  subgraph cr ["Cloud Run"]
    PX["ai-proxy
validate key · map flash to Vertex ID"]
  end
  subgraph vtx ["Vertex AI"]
    GM["Gemini"]
  end
  HC ==>|"POST /v1/chat/completions"| PX
  PX ==> GM
  GM -. "tokens / JSON" .-> PX
  PX -. "SSE or JSON body" .-> HC

8c. Flow — Phase 2 (Docker sandbox)

Tool execution happens inside a disposable container; LLM calls still originate from your Python process on the host and use the same Bearer-authenticated path to the proxy as Phase 1.

%%{init: {
  'theme': 'base',
  'themeVariables': {
    'background': 'transparent',
    'primaryColor': '#152036',
    'primaryTextColor': '#e8eef4',
    'primaryBorderColor': '#7fb3ff',
    'secondaryColor': '#0f1626',
    'tertiaryColor': '#0a1120',
    'lineColor': '#7fb3ff',
    'clusterBkg': 'rgba(127,179,255,0.05)',
    'clusterBorder': 'rgba(127,179,255,0.35)',
    'fontFamily': 'Inter, system-ui, sans-serif',
    'fontSize': '14px'
  }
}}%%
flowchart LR
  subgraph host ["Host OS"]
    PY["Python · examples/04-06
SandboxAgent · Runner"]
    CL["gemini_flash()
same client as Phase 1"]
    PY --> CL
  end
  subgraph dock ["Docker"]
    CTR["Container
/workspace · shell tools"]
  end
  subgraph gw ["Gateway + model"]
    PX2["ai-proxy"]
    GM2["Gemini"]
  end
  PY <-->|"exec_command · Skills"| CTR
  CL ==>|"HTTPS Bearer"| PX2
  PX2 ==> GM2
  GM2 -.-> PX2
  PX2 -.-> CL

8d. How to run

From the repo root, load the same env you use for curl (or define AIAGGIES_BASE_URL and AIAGGIES_API_KEY in agents-harness/.env).

cd ~/Development/vertex-ai-dev
set -a && . ./.env && set +a
cd agents-harness
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt   # openai-agents[docker] + python-dotenv

# optional: map names the harness expects
export AIAGGIES_BASE_URL="$AI_API_BASE"
export AIAGGIES_API_KEY="$API_KEY"

Scripts (run from agents-harness/ with venv active):

Script	Needs Docker	What it proves
`01_hello_agent.py`	no	Minimal `Runner.run`
`02_agent_with_tool.py`	no	Python `@function_tool` round-trip
`03_streaming.py`	no	SSE token stream via proxy
`04_sandbox_docker.py`	yes	Read workspace with shell
`05_shell_edit.py`	yes	Edit file via `exec_command`
`06_skills_sandbox.py`	yes	Inline `Skills` + shell

python examples/01_hello_agent.py
python examples/02_agent_with_tool.py
python examples/03_streaming.py
# Phase 2
python examples/04_sandbox_docker.py
python examples/05_shell_edit.py
python examples/06_skills_sandbox.py

8e. Code snippets (reference)

Shared wiring (`harness/client.py`)

from agents import AsyncOpenAI, OpenAIChatCompletionsModel, set_tracing_disabled

set_tracing_disabled(True)

_client = AsyncOpenAI(
    base_url=os.environ["AIAGGIES_BASE_URL"],   # http://dev.aiaggies.net/v1 or dev.aiaggies.net/v1
    api_key=os.environ["AIAGGIES_API_KEY"],     # sk_live_…
)

def gemini_flash():
    return OpenAIChatCompletionsModel(model="flash", openai_client=_client)

def gemini_pro():
    return OpenAIChatCompletionsModel(model="pro-preview", openai_client=_client)

Bare agent

import asyncio
from agents import Agent, Runner
from harness import gemini_flash

async def main():
    agent = Agent(
        name="Greeter",
        instructions="Reply in one short sentence.",
        model=gemini_flash(),
    )
    result = await Runner.run(agent, "Say hello and name the model you are.")
    print(result.final_output)

asyncio.run(main())

Tool-calling

Confirms that Gemini plans tool calls through the OpenAI-compat surface and that our proxy serializes them correctly.

from agents import Agent, Runner, function_tool
from harness import gemini_flash

@function_tool
def add(a: int, b: int) -> int:
    # Add two integers and return the sum.
    return a + b

agent = Agent(
    name="Calculator",
    instructions="Use the `add` tool for any sum. After it returns, state the result briefly.",
    model=gemini_flash(),
    tools=[add],
)

result = await Runner.run(agent, "What is 2024 plus 1776?")
print(result.final_output)   # -> "The sum is 3800."

Streaming

from agents import Agent, Runner
from harness import gemini_flash
from openai.types.responses import ResponseTextDeltaEvent

agent = Agent(name="Storyteller", instructions="Be vivid.", model=gemini_flash())
result = Runner.run_streamed(agent, "Describe a rainy evening in three paragraphs.")

async for event in result.stream_events():
    if event.type == "raw_response_event" and isinstance(event.data, ResponseTextDeltaEvent):
        print(event.data.delta, end="", flush=True)

8f. Notes & limitations

Tracing must be off. The SDK's default tracer posts to api.openai.com; your aiaggies key does not authenticate against that host. Always call set_tracing_disabled(True).
Model aliases only. Use "flash" or "pro-preview" — the proxy rewrites these to the full Vertex model IDs. Passing a raw Vertex ID will 400.
OpenAIChatCompletionsModel, not OpenAIResponsesModel. This proxy exposes Chat Completions; the Responses API is not forwarded. The two model shapes support different feature sets, so the SDK docs recommend picking one per workflow.
Sandbox harness — see section 8c and 8a. Do not enable the default Filesystem capability if you stay on Chat Completions + Gemini; use Shell (+ Skills) and PHASE2.md for rationale.

9. Managing API keys

Keys live in the API_KEYS_JSON Cloud Run env var as a JSON array. Each entry has an id (for logs), the key, and an enabled flag. No database required — revoke by flipping enabled to false (or removing the entry) and updating the service.

Format

[
  {"id": "raj-laptop", "key": "sk_live_xxxx...", "enabled": true},
  {"id": "raj-ci",     "key": "sk_live_yyyy...", "enabled": true}
]

Generate a new key

python3 -c 'import secrets; print("sk_live_" + secrets.token_urlsafe(32))'

Update keys without rebuilding the image

gcloud run services update ai-proxy \
  --region=us-central1 --project=rajdphd-prep \
  --set-env-vars="^##^API_KEYS_JSON=$(cat api-keys.json | jq -c .)"

Env vars are visible to anyone with roles/run.viewer on the project. For real separation of duties, move API_KEYS_JSON into Secret Manager and read it at startup with roles/secretmanager.secretAccessor.

10. GCP resources

Project: rajdphd-prep
Region: us-central1
Cloud Run service: ai-proxy
Service account: ai-proxy-sa@rajdphd-prep.iam.gserviceaccount.com
IAM role: roles/aiplatform.user
Domain mapping: dev.aiaggies.net → ai-proxy
DNS record: dev CNAME ghs.googlehosted.com.

APIs enabled: run.googleapis.com, aiplatform.googleapis.com, artifactregistry.googleapis.com, cloudbuild.googleapis.com.

11. Operations

Redeploy after a code change

cd ~/Development/vertex-ai-dev

gcloud run deploy ai-proxy \
  --source=./proxy \
  --region=us-central1 --project=rajdphd-prep \
  --service-account=ai-proxy-sa@rajdphd-prep.iam.gserviceaccount.com \
  --allow-unauthenticated \
  --min-instances=0 --max-instances=3 \
  --quiet

Read structured logs

gcloud logging read \
  'resource.type="cloud_run_revision" AND resource.labels.service_name="ai-proxy"' \
  --project=rajdphd-prep --limit=50 --format=json --freshness=30m

Current revision & traffic

gcloud run services describe ai-proxy \
  --region=us-central1 --project=rajdphd-prep \
  --format='value(status.latestReadyRevisionName,status.url)'

Roll back one revision

gcloud run services update-traffic ai-proxy \
  --region=us-central1 --project=rajdphd-prep \
  --to-revisions=<PREVIOUS_REVISION_NAME>=100

Check custom-domain cert status

gcloud beta run domain-mappings describe \
  --domain=dev.aiaggies.net \
  --region=us-central1 --project=rajdphd-prep \
  --format='value(status.conditions[].type,status.conditions[].status,status.conditions[].message)'

When CertificateProvisioned=True and Ready=True, HTTPS on the custom domain is live.

12. Cost

Cloud Run: scale-to-zero; free tier covers ~2M requests / 360k vCPU-s / 180k GiB-s per month.
Artifact Registry: a few MB; pennies per month.
Cloud Build: only runs on deploy; free tier covers casual use.
Domain mapping: $0. No Load Balancer.
Cloud Logging: 50 GiB/mo ingest free.
Vertex AI tokens: pay per token at Google's published rates — the only real cost.

At single-user scale, expect $0 per month for infra. Cost scales linearly with actual use only.

13. Files in the repo

proxy/main.py: FastAPI app: routes, auth, Vertex forwarding.
proxy/pages.py: Splash + this runbook HTML.
proxy/Dockerfile: Container used by gcloud run deploy --source.
proxy/requirements.txt: Python deps (FastAPI, httpx, google-auth, requests).
deploy.sh: One-shot idempotent deploy script.
SPEC.md: Design contract this implementation satisfies.
agents-harness/: OpenAI Agents SDK: harness/client.py, examples 01–06, PHASE2.md, chat_completions_sandbox.py.
html/index.html: Local documentation site (runs via python3 serve.py).
.env: Local-only; holds AI_API_BASE and API_KEY.

14. Teardown

Removes all GCP resources this project created. Vertex AI itself stays enabled.

gcloud beta run domain-mappings delete --domain=dev.aiaggies.net \
  --region=us-central1 --project=rajdphd-prep --quiet

gcloud run services delete ai-proxy \
  --region=us-central1 --project=rajdphd-prep --quiet

gcloud projects remove-iam-policy-binding rajdphd-prep \
  --member="serviceAccount:ai-proxy-sa@rajdphd-prep.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user" --condition=None --quiet

gcloud iam service-accounts delete ai-proxy-sa@rajdphd-prep.iam.gserviceaccount.com \
  --project=rajdphd-prep --quiet