Runbook & reference
Everything you need to operate, extend, and tear down this deployment. Share-safe — contains no API key material. See splash for the public face.
Architecture
End-to-end request flow from an OpenAI-compatible client (curl, the OpenAI SDKs, the pi coding agent, etc.) through this Cloud Run proxy to a Gemini model on Vertex AI. Solid arrows are the request path; dashed arrows are the streamed SSE response.
%%{init: {
'theme': 'base',
'themeVariables': {
'background': 'transparent',
'primaryColor': '#152036',
'primaryTextColor': '#e8eef4',
'primaryBorderColor': '#7fb3ff',
'secondaryColor': '#0f1626',
'tertiaryColor': '#0a1120',
'lineColor': '#7fb3ff',
'clusterBkg': 'rgba(127,179,255,0.05)',
'clusterBorder': 'rgba(127,179,255,0.35)',
'fontFamily': 'Inter, system-ui, sans-serif',
'fontSize': '14px'
}
}}%%
flowchart LR
C["OpenAI-compatible client
(curl · OpenAI SDK · pi agent)"]
subgraph GCP ["Google Cloud Project · rajdphd-prep"]
direction TB
R["Cloud Run
ai-proxy
FastAPI"]
V["Vertex AI
OpenAI-compat endpoint"]
M(["Gemini 2.5 Flash
Gemini 3.1 Pro Preview"])
end
C ==>|"POST /v1/chat/completions
x-api-key: sk_live_…"| R
R ==>|"validate key · rewrite alias
attach SA bearer"| V
V ==> M
M -. "SSE chunks" .-> V
V -. "SSE chunks" .-> R
R -. "SSE chunks" .-> C
1. What this is
A single Cloud Run service (ai-proxy) that fronts
Vertex AI's OpenAI-compatible Chat Completions endpoint. Its only jobs are:
- Terminate TLS on your own hostname (
dev.aiaggies.net). - Validate a caller-supplied
x-api-keyagainst an issued-key list. - Rewrite short model aliases (
flash) to the full Google model IDs. - Attach a Google service-account bearer and forward to Vertex.
- Pass the upstream JSON back unchanged.
There is no Apigee, no database, no queue, no Load Balancer. The service scales to zero when idle; at single-developer usage the non-model bill is effectively $0.
2. URLs & base paths
- Custom domain
- https://dev.aiaggies.net
- Cloud Run URL
- http://dev.aiaggies.net
- OpenAI SDK base_url
- https://dev.aiaggies.net/v1
- Splash
- /
- Docs (this page)
- /docs
- Health
- /healthz
- Models
- /v1/models
- Chat
- /v1/chat/completions
While the managed TLS cert for dev.aiaggies.net is still issuing, call the
Cloud Run URL directly. The two are byte-identical in behavior.
3. Authentication
Every request to /v1/* must include either an
x-api-key header or a standard OpenAI-style
Authorization: Bearer header. Both map to the same issued-key
table; use whichever your client sends by default.
# pi, curl examples, anything explicit x-api-key: sk_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # stock OpenAI SDKs, the OpenAI Agents SDK Authorization: Bearer sk_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Missing, unknown, or disabled keys return 401. Keys are loaded from the
API_KEYS_JSON environment variable at startup.
4. Endpoints
Public splash page. No auth.
This page. No auth.
Liveness. Returns {"ok": true, "models": [...]}.
OpenAI-compatible model list. Requires x-api-key.
OpenAI-compatible chat completion. Requires x-api-key.
5. Model aliases
Clients use short aliases; the proxy rewrites them server-side before calling Vertex. This means you can change backing models without touching client code.
flash → google/gemini-2.5-flashpro-preview → google/gemini-3.1-pro-preview
Controlled by MODEL_FLASH_ID and MODEL_PRO_PREVIEW_ID env vars.
6. Quick start
.env
# replace with your issued key
AI_API_BASE=https://dev.aiaggies.net/v1
API_KEY=sk_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
curl
source .env # list models curl -sS "$AI_API_BASE/models" -H "x-api-key: $API_KEY" | jq # chat completion curl -sS -X POST "$AI_API_BASE/chat/completions" \ -H "content-type: application/json" \ -H "x-api-key: $API_KEY" \ -d '{"model":"flash","messages":[{"role":"user","content":"Say hi."}]}' | jq
Python (OpenAI SDK)
import os
from openai import OpenAI
client = OpenAI(
base_url=os.environ["AI_API_BASE"],
api_key=os.environ["API_KEY"],
)
# list
for m in client.models.list().data:
print(m.id, "->", getattr(m, "vertex_id", ""))
# chat
resp = client.chat.completions.create(
model="flash",
messages=[{"role": "user", "content": "Say hi."}],
)
print(resp.choices[0].message.content)
JavaScript / Node (OpenAI SDK)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: process.env.AI_API_BASE,
apiKey: process.env.API_KEY,
});
const resp = await client.chat.completions.create({
model: "flash",
messages: [{ role: "user", content: "Say hi." }],
});
console.log(resp.choices[0].message.content);
7. Using pi as a sandboxed client
pi is a minimal terminal coding agent that speaks the OpenAI Chat Completions wire format, so it plugs directly into this proxy. The pi author recommends running it inside a container — there are no permission prompts by design. The setup below keeps pi isolated from everything on the host except the project directory you launch it from.
What was set up
- A throwaway Docker image (
pi-sandbox:latest) built from~/Development/pi-sandbox/Dockerfile—node:22-bookworm-slim+@mariozechner/pi-coding-agent+ git, ripgrep, jq, python3, curl. Runs as a non-root user. - A host-side wrapper at
~/Development/pi-sandbox/pithat runsdocker runwith exactly two bind mounts: the current directory as/workspaceand~/.pi-sandboxas the container's~/.pi(so sessions, auth, and installed pi packages persist). Nothing else from$HOMEis visible to pi. - A custom provider declared in
~/.pi-sandbox/agent/models.jsonpointing at this Cloud Run service. The wrapper forwardsAI_API_BASEandAPI_KEYfrom the shell, and the provider config tells pi to sendx-api-key(Bearer also works if you switch pi to that style). - A streaming-SSE passthrough added to
/v1/chat/completionson this proxy. pi defaults tostream: true; without passthrough the proxy would buffer the SSE response and return an empty completion. See section 11 (Operations) for the revision that shipped this.
~/.pi-sandbox/agent/models.json
This file makes pi aware of the aiaggies provider and its two
models. "API_KEY" is an env var name — pi resolves it
against the container env at request time, so no key is stored on disk.
{
"providers": {
"aiaggies": {
"baseUrl": "http://dev.aiaggies.net/v1",
"api": "openai-completions",
"apiKey": "API_KEY",
"headers": { "x-api-key": "API_KEY" },
"compat": {
"supportsDeveloperRole": false,
"supportsReasoningEffort": false,
"maxTokensField": "max_tokens"
},
"models": [
{
"id": "flash",
"name": "Gemini 2.5 Flash (via aiaggies)",
"reasoning": false,
"input": ["text", "image"],
"contextWindow": 1048576,
"maxTokens": 65536,
"cost": { "input": 0.075, "output": 0.30, "cacheRead": 0.01875, "cacheWrite": 0 }
},
{
"id": "pro-preview",
"name": "Gemini 3.1 Pro Preview (via aiaggies)",
"reasoning": true,
"input": ["text", "image"],
"contextWindow": 1048576,
"maxTokens": 65536,
"cost": { "input": 1.25, "output": 10.0, "cacheRead": 0.3125, "cacheWrite": 0 }
}
]
}
}
}
Launching pi
# 1. load AI_API_BASE + API_KEY into the shell so the wrapper forwards them cd ~/Development/vertex-ai-dev set -a && . ./.env && set +a # 2. interactive TUI, isolated to the current directory ~/Development/pi-sandbox/pi --provider aiaggies --model flash # non-interactive one-shot ~/Development/pi-sandbox/pi -p "summarize the repo" --provider aiaggies --model flash # list configured models (confirms aiaggies/flash + pro-preview are registered) ~/Development/pi-sandbox/pi --list-models # drop into a shell inside the sandbox ~/Development/pi-sandbox/pi shell # rebuild the image when upgrading pi itself ~/Development/pi-sandbox/pi rebuild
Put the wrapper on your PATH for just pi:
ln -s ~/Development/pi-sandbox/pi /usr/local/bin/pi.
Sandbox reach
- Current dir (
$(pwd)) - →
/workspace· read/write ~/.pi-sandbox- →
~/.piin the container · persistent state ~/.ssh,~/.aws, rest of$HOME- not mounted · unreachable
- Docker socket
- not mounted · no container escape
- Network
- enabled · required for LLM API calls
If pi returns an empty assistant message, the proxy is probably on a revision without SSE passthrough — check section 11 and redeploy.
8. Using the OpenAI Agents SDK
The OpenAI Agents
SDK drives multi-step agents with tool calling and streaming. This
deployment uses its Chat Completions model shape
(OpenAIChatCompletionsModel) so the same POST /v1/chat/completions
path as curl, pi, and the OpenAI Python SDK works unchanged. Tracing to
OpenAI's hosted observability is disabled because your key is for this
gateway, not api.openai.com.
8a. Review: what shipped
Everything below lives in the repo folder agents-harness/
next to proxy/. See section 13 for a file list.
Phase 1 — plain agents (no container)
- Examples
01_hello_agent.py,02_agent_with_tool.py,03_streaming.py— proveRunner,@function_tool, andRunner.run_streamedagainst Gemini through the proxy. harness/client.py— oneAsyncOpenAIclient withbase_url=http://dev.aiaggies.net/v1(or your custom domain) andOpenAIChatCompletionsModelfor aliasesflash/pro-preview.- Proxy auth — the service accepts
Authorization: Bearer <sk_live_…>(what the OpenAI SDK sends) as well asx-api-key, so Agents SDK code needs no custom headers.
Phase 2 — Docker sandbox harness
- Examples
04_sandbox_docker.py,05_shell_edit.py,06_skills_sandbox.py—SandboxAgent+DockerSandboxClient, workspace manifest,python:3.14-slim. - Shell-only tools for Chat Completions — the SDK's
default
Filesystemcapability registersapply_patch, which is a hosted tool shape and does not serialize on the Chat Completions code path. These examples useharness/chat_completions_sandbox.py(Shell + optional Skills) and edit files viaexec_command. Full detail:agents-harness/PHASE2.md. - Compaction is omitted in those examples; it is aimed at the Responses API compaction channel and is not required for short demos.
8b. Flow — Phase 1 (Agent & Runner)
Your Python process calls Vertex Gemini through this proxy. Solid
arrows: HTTP request; dashed: streamed text (when stream: true).
%%{init: {
'theme': 'base',
'themeVariables': {
'background': 'transparent',
'primaryColor': '#152036',
'primaryTextColor': '#e8eef4',
'primaryBorderColor': '#7fb3ff',
'secondaryColor': '#0f1626',
'tertiaryColor': '#0a1120',
'lineColor': '#7fb3ff',
'clusterBkg': 'rgba(127,179,255,0.05)',
'clusterBorder': 'rgba(127,179,255,0.35)',
'fontFamily': 'Inter, system-ui, sans-serif',
'fontSize': '14px'
}
}}%%
flowchart TB
subgraph dev ["Your machine · agents-harness"]
EX["examples/01-03.py
Agent · Runner"]
HC["harness/client.py
AsyncOpenAI · Bearer"]
EX --> HC
end
subgraph cr ["Cloud Run"]
PX["ai-proxy
validate key · map flash to Vertex ID"]
end
subgraph vtx ["Vertex AI"]
GM["Gemini"]
end
HC ==>|"POST /v1/chat/completions"| PX
PX ==> GM
GM -. "tokens / JSON" .-> PX
PX -. "SSE or JSON body" .-> HC
8c. Flow — Phase 2 (Docker sandbox)
Tool execution happens inside a disposable container; LLM calls still originate from your Python process on the host and use the same Bearer-authenticated path to the proxy as Phase 1.
%%{init: {
'theme': 'base',
'themeVariables': {
'background': 'transparent',
'primaryColor': '#152036',
'primaryTextColor': '#e8eef4',
'primaryBorderColor': '#7fb3ff',
'secondaryColor': '#0f1626',
'tertiaryColor': '#0a1120',
'lineColor': '#7fb3ff',
'clusterBkg': 'rgba(127,179,255,0.05)',
'clusterBorder': 'rgba(127,179,255,0.35)',
'fontFamily': 'Inter, system-ui, sans-serif',
'fontSize': '14px'
}
}}%%
flowchart LR
subgraph host ["Host OS"]
PY["Python · examples/04-06
SandboxAgent · Runner"]
CL["gemini_flash()
same client as Phase 1"]
PY --> CL
end
subgraph dock ["Docker"]
CTR["Container
/workspace · shell tools"]
end
subgraph gw ["Gateway + model"]
PX2["ai-proxy"]
GM2["Gemini"]
end
PY <-->|"exec_command · Skills"| CTR
CL ==>|"HTTPS Bearer"| PX2
PX2 ==> GM2
GM2 -.-> PX2
PX2 -.-> CL
8d. How to run
From the repo root, load the same env you use for curl (or define
AIAGGIES_BASE_URL and AIAGGIES_API_KEY in
agents-harness/.env).
cd ~/Development/vertex-ai-dev set -a && . ./.env && set +a cd agents-harness python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt # openai-agents[docker] + python-dotenv # optional: map names the harness expects export AIAGGIES_BASE_URL="$AI_API_BASE" export AIAGGIES_API_KEY="$API_KEY"
Scripts (run from agents-harness/ with venv active):
| Script | Needs Docker | What it proves |
|---|---|---|
01_hello_agent.py | no | Minimal Runner.run |
02_agent_with_tool.py | no | Python @function_tool round-trip |
03_streaming.py | no | SSE token stream via proxy |
04_sandbox_docker.py | yes | Read workspace with shell |
05_shell_edit.py | yes | Edit file via exec_command |
06_skills_sandbox.py | yes | Inline Skills + shell |
python examples/01_hello_agent.py
python examples/02_agent_with_tool.py
python examples/03_streaming.py
# Phase 2
python examples/04_sandbox_docker.py
python examples/05_shell_edit.py
python examples/06_skills_sandbox.py
8e. Code snippets (reference)
Shared wiring (harness/client.py)
from agents import AsyncOpenAI, OpenAIChatCompletionsModel, set_tracing_disabled
set_tracing_disabled(True)
_client = AsyncOpenAI(
base_url=os.environ["AIAGGIES_BASE_URL"], # http://dev.aiaggies.net/v1 or dev.aiaggies.net/v1
api_key=os.environ["AIAGGIES_API_KEY"], # sk_live_…
)
def gemini_flash():
return OpenAIChatCompletionsModel(model="flash", openai_client=_client)
def gemini_pro():
return OpenAIChatCompletionsModel(model="pro-preview", openai_client=_client)
Bare agent
import asyncio
from agents import Agent, Runner
from harness import gemini_flash
async def main():
agent = Agent(
name="Greeter",
instructions="Reply in one short sentence.",
model=gemini_flash(),
)
result = await Runner.run(agent, "Say hello and name the model you are.")
print(result.final_output)
asyncio.run(main())
Tool-calling
Confirms that Gemini plans tool calls through the OpenAI-compat surface and that our proxy serializes them correctly.
from agents import Agent, Runner, function_tool
from harness import gemini_flash
@function_tool
def add(a: int, b: int) -> int:
# Add two integers and return the sum.
return a + b
agent = Agent(
name="Calculator",
instructions="Use the `add` tool for any sum. After it returns, state the result briefly.",
model=gemini_flash(),
tools=[add],
)
result = await Runner.run(agent, "What is 2024 plus 1776?")
print(result.final_output) # -> "The sum is 3800."
Streaming
from agents import Agent, Runner
from harness import gemini_flash
from openai.types.responses import ResponseTextDeltaEvent
agent = Agent(name="Storyteller", instructions="Be vivid.", model=gemini_flash())
result = Runner.run_streamed(agent, "Describe a rainy evening in three paragraphs.")
async for event in result.stream_events():
if event.type == "raw_response_event" and isinstance(event.data, ResponseTextDeltaEvent):
print(event.data.delta, end="", flush=True)
8f. Notes & limitations
- Tracing must be off. The SDK's default tracer posts
to
api.openai.com; your aiaggies key does not authenticate against that host. Always callset_tracing_disabled(True). - Model aliases only. Use
"flash"or"pro-preview"— the proxy rewrites these to the full Vertex model IDs. Passing a raw Vertex ID will 400. - OpenAIChatCompletionsModel, not OpenAIResponsesModel. This proxy exposes Chat Completions; the Responses API is not forwarded. The two model shapes support different feature sets, so the SDK docs recommend picking one per workflow.
- Sandbox harness — see
section 8c and
8a. Do not enable the default
Filesystemcapability if you stay on Chat Completions + Gemini; use Shell (+ Skills) andPHASE2.mdfor rationale.
9. Managing API keys
Keys live in the API_KEYS_JSON Cloud Run env var as a JSON array. Each entry
has an id (for logs), the key, and an enabled flag.
No database required — revoke by flipping enabled to false
(or removing the entry) and updating the service.
Format
[
{"id": "raj-laptop", "key": "sk_live_xxxx...", "enabled": true},
{"id": "raj-ci", "key": "sk_live_yyyy...", "enabled": true}
]
Generate a new key
python3 -c 'import secrets; print("sk_live_" + secrets.token_urlsafe(32))'
Update keys without rebuilding the image
gcloud run services update ai-proxy \ --region=us-central1 --project=rajdphd-prep \ --set-env-vars="^##^API_KEYS_JSON=$(cat api-keys.json | jq -c .)"
Env vars are visible to anyone with roles/run.viewer on the project. For real
separation of duties, move API_KEYS_JSON into Secret Manager and read it at
startup with roles/secretmanager.secretAccessor.
10. GCP resources
- Project
- rajdphd-prep
- Region
- us-central1
- Cloud Run service
- ai-proxy
- Service account
- ai-proxy-sa@rajdphd-prep.iam.gserviceaccount.com
- IAM role
- roles/aiplatform.user
- Domain mapping
- dev.aiaggies.net → ai-proxy
- DNS record
- dev CNAME ghs.googlehosted.com.
APIs enabled: run.googleapis.com, aiplatform.googleapis.com,
artifactregistry.googleapis.com, cloudbuild.googleapis.com.
11. Operations
Redeploy after a code change
cd ~/Development/vertex-ai-dev gcloud run deploy ai-proxy \ --source=./proxy \ --region=us-central1 --project=rajdphd-prep \ --service-account=ai-proxy-sa@rajdphd-prep.iam.gserviceaccount.com \ --allow-unauthenticated \ --min-instances=0 --max-instances=3 \ --quiet
Read structured logs
gcloud logging read \ 'resource.type="cloud_run_revision" AND resource.labels.service_name="ai-proxy"' \ --project=rajdphd-prep --limit=50 --format=json --freshness=30m
Current revision & traffic
gcloud run services describe ai-proxy \ --region=us-central1 --project=rajdphd-prep \ --format='value(status.latestReadyRevisionName,status.url)'
Roll back one revision
gcloud run services update-traffic ai-proxy \ --region=us-central1 --project=rajdphd-prep \ --to-revisions=<PREVIOUS_REVISION_NAME>=100
Check custom-domain cert status
gcloud beta run domain-mappings describe \ --domain=dev.aiaggies.net \ --region=us-central1 --project=rajdphd-prep \ --format='value(status.conditions[].type,status.conditions[].status,status.conditions[].message)'
When CertificateProvisioned=True and Ready=True, HTTPS on the
custom domain is live.
12. Cost
- Cloud Run: scale-to-zero; free tier covers ~2M requests / 360k vCPU-s / 180k GiB-s per month.
- Artifact Registry: a few MB; pennies per month.
- Cloud Build: only runs on deploy; free tier covers casual use.
- Domain mapping: $0. No Load Balancer.
- Cloud Logging: 50 GiB/mo ingest free.
- Vertex AI tokens: pay per token at Google's published rates — the only real cost.
At single-user scale, expect $0 per month for infra. Cost scales linearly with actual use only.
13. Files in the repo
- proxy/main.py
- FastAPI app: routes, auth, Vertex forwarding.
- proxy/pages.py
- Splash + this runbook HTML.
- proxy/Dockerfile
- Container used by
gcloud run deploy --source. - proxy/requirements.txt
- Python deps (FastAPI, httpx, google-auth, requests).
- deploy.sh
- One-shot idempotent deploy script.
- SPEC.md
- Design contract this implementation satisfies.
- agents-harness/
- OpenAI Agents SDK:
harness/client.py, examples 01–06,PHASE2.md,chat_completions_sandbox.py. - html/index.html
- Local documentation site (runs via
python3 serve.py). - .env
- Local-only; holds
AI_API_BASEandAPI_KEY.
14. Teardown
Removes all GCP resources this project created. Vertex AI itself stays enabled.
gcloud beta run domain-mappings delete --domain=dev.aiaggies.net \ --region=us-central1 --project=rajdphd-prep --quiet gcloud run services delete ai-proxy \ --region=us-central1 --project=rajdphd-prep --quiet gcloud projects remove-iam-policy-binding rajdphd-prep \ --member="serviceAccount:ai-proxy-sa@rajdphd-prep.iam.gserviceaccount.com" \ --role="roles/aiplatform.user" --condition=None --quiet gcloud iam service-accounts delete ai-proxy-sa@rajdphd-prep.iam.gserviceaccount.com \ --project=rajdphd-prep --quiet