Per-Agent Pod Architecture Plan
Issue: TEC-62 Goal: Eliminate single-pod process-sharing for claude_local agents, replace with K8s Jobs per run.
Current State (already done)
| Item | Status |
|---|---|
Fork paperclipai/paperclip → craigedmunds/paperclip | ✅ |
integration branch created and active | ✅ |
upstream remote set to paperclipai/paperclip | ✅ |
opencode_remote adapter added to integration branch | ✅ |
images.yaml enrollment updated to craigedmunds/paperclip@integration | ✅ |
Remaining Work
1. Regenerate image-factory CDK8s manifests
The generated dist/cdk8s/image-factory.k8s.yaml still references paperclipai/paperclip@master.
Run CDK8s synthesis to pick up the images.yaml changes and point the Kargo Warehouse at the correct repo/branch.
Steps:
cd repos/image-factory/cdk8s && python main.py(or equivalent task)- Commit the updated
dist/cdk8s/image-factory.k8s.yamlto image-factory-state - Push → ArgoCD syncs → Kargo Warehouse watches
craigedmunds/paperclip@integration
Files: repos/image-factory-state/dist/cdk8s/image-factory.k8s.yaml
2. claude_k8s adapter
New adapter package: packages/adapters/claude-k8s/
Interface: Implements ServerAdapterModule (same as claude_local):
execute(ctx: AdapterExecutionContext): Promise<AdapterExecutionResult>testEnvironment(ctx): Promise<AdapterEnvironmentTestResult>
Execution flow (replaces runChildProcess with K8s Job):
execute() called
→ Build Job spec:
name: paperclip-run-{runId}
namespace: <config.namespace>
serviceAccountName: <config.serviceAccount>
image: <config.image> # same claude image as claude_local
command: ["claude", "--print", "-", "--output-format", "stream-json", ...]
env: PAPERCLIP_* vars + ANTHROPIC_API_KEY (from secret ref)
resources: config.resources (cpu/memory limits per agent)
volumeMounts:
- name: workspace
mountPath: /workspace
subPath: workspaces/<agentId> # isolation per agent
→ Create Job via @kubernetes/client-node BatchV1Api
→ Stream logs from pod stdout/stderr via CoreV1Api log streaming
→ Parse streaming JSON (reuse claude-local parse.ts)
→ Delete Job on completion
→ Return AdapterExecutionResult
Adapter config fields:
namespace: paperclip-agents # K8s namespace for Jobs
image: ghcr.io/craigedmunds/paperclip:latest
serviceAccount: paperclip-agent-runner
resources:
requests: { cpu: "500m", memory: "2Gi" }
limits: { cpu: "2", memory: "4Gi" }
pvcName: paperclip-agent-workspace # shared RWX PVC
graceSec: 30Workspace isolation:
- Shared PVC mounted at
/workspacewithsubPath: workspaces/<agentId>/ - No git credentials in pods — all git ops go through Paperclip API (push/pull handled by server)
- Session state persisted to PVC subpath across runs
Registration:
- Add to
server/src/adapters/registry.tsalongside claude_local
3. K8s infrastructure for agent pods
New manifests in repos/k8s-lab/ or as part of the Paperclip app:
- Namespace:
paperclip-agents - ServiceAccount:
paperclip-agent-runnerwith minimal RBAC (read secrets, write to workspace PVC) - PVC:
paperclip-agent-workspace— RWX (shared across Jobs), large enough (e.g. 50Gi) - Secret:
claude-api-key— ANTHROPIC_API_KEY for agent pods - NetworkPolicy: agents can only reach Paperclip API + external APIs, not internal cluster services
4. Image factory .builders/ per adapter
In the integration branch of craigedmunds/paperclip, add builder directories:
.builders/claude-k8s/— Dockerfile for the claude_k8s image (claude CLI + Node runtime).builders/opencode-remote/— if a dedicated image is needed
These feed into the image-factory pipeline (CDK8s dockerfile config per builder).
Implementation Order
- CDK8s regen (quick, unblocks Kargo pipeline) → PR to image-factory-state
- claude_k8s adapter (core feature) → PR to craigedmunds/paperclip@integration
- K8s infra manifests (namespace, SA, PVC, RBAC) → PR to k8s-lab
- Image factory builder dirs → PR to craigedmunds/paperclip@integration
- Wire it up — update Paperclip agent config in cluster to use claude_k8s
Notes
- The
integrationbranch branch name is confirmed (board requestedreleaseorintegration, notcustom-adapters) - Merge discipline:
upstream/master → origin/master → origin/integration(cherry-pick upstream releases) - Per-run Job naming uses
runIdto avoid conflicts; TTL or manual delete on completion onSpawncallback in execute context can be used to report the Job/pod name instead of a local PID