Per-Agent Pod Architecture Plan

Issue: TEC-62 Goal: Eliminate single-pod process-sharing for claude_local agents, replace with K8s Jobs per run.


Current State (already done)

ItemStatus
Fork paperclipai/paperclipcraigedmunds/paperclip
integration branch created and active
upstream remote set to paperclipai/paperclip
opencode_remote adapter added to integration branch
images.yaml enrollment updated to craigedmunds/paperclip@integration

Remaining Work

1. Regenerate image-factory CDK8s manifests

The generated dist/cdk8s/image-factory.k8s.yaml still references paperclipai/paperclip@master. Run CDK8s synthesis to pick up the images.yaml changes and point the Kargo Warehouse at the correct repo/branch.

Steps:

  • cd repos/image-factory/cdk8s && python main.py (or equivalent task)
  • Commit the updated dist/cdk8s/image-factory.k8s.yaml to image-factory-state
  • Push → ArgoCD syncs → Kargo Warehouse watches craigedmunds/paperclip@integration

Files: repos/image-factory-state/dist/cdk8s/image-factory.k8s.yaml


2. claude_k8s adapter

New adapter package: packages/adapters/claude-k8s/

Interface: Implements ServerAdapterModule (same as claude_local):

  • execute(ctx: AdapterExecutionContext): Promise<AdapterExecutionResult>
  • testEnvironment(ctx): Promise<AdapterEnvironmentTestResult>

Execution flow (replaces runChildProcess with K8s Job):

execute() called
  → Build Job spec:
      name: paperclip-run-{runId}
      namespace: <config.namespace>
      serviceAccountName: <config.serviceAccount>
      image: <config.image>  # same claude image as claude_local
      command: ["claude", "--print", "-", "--output-format", "stream-json", ...]
      env: PAPERCLIP_* vars + ANTHROPIC_API_KEY (from secret ref)
      resources: config.resources (cpu/memory limits per agent)
      volumeMounts:
        - name: workspace
          mountPath: /workspace
          subPath: workspaces/<agentId>   # isolation per agent
  → Create Job via @kubernetes/client-node BatchV1Api
  → Stream logs from pod stdout/stderr via CoreV1Api log streaming
  → Parse streaming JSON (reuse claude-local parse.ts)
  → Delete Job on completion
  → Return AdapterExecutionResult

Adapter config fields:

namespace: paperclip-agents       # K8s namespace for Jobs
image: ghcr.io/craigedmunds/paperclip:latest
serviceAccount: paperclip-agent-runner
resources:
  requests: { cpu: "500m", memory: "2Gi" }
  limits: { cpu: "2", memory: "4Gi" }
pvcName: paperclip-agent-workspace  # shared RWX PVC
graceSec: 30

Workspace isolation:

  • Shared PVC mounted at /workspace with subPath: workspaces/<agentId>/
  • No git credentials in pods — all git ops go through Paperclip API (push/pull handled by server)
  • Session state persisted to PVC subpath across runs

Registration:

  • Add to server/src/adapters/registry.ts alongside claude_local

3. K8s infrastructure for agent pods

New manifests in repos/k8s-lab/ or as part of the Paperclip app:

  • Namespace: paperclip-agents
  • ServiceAccount: paperclip-agent-runner with minimal RBAC (read secrets, write to workspace PVC)
  • PVC: paperclip-agent-workspace — RWX (shared across Jobs), large enough (e.g. 50Gi)
  • Secret: claude-api-key — ANTHROPIC_API_KEY for agent pods
  • NetworkPolicy: agents can only reach Paperclip API + external APIs, not internal cluster services

4. Image factory .builders/ per adapter

In the integration branch of craigedmunds/paperclip, add builder directories:

  • .builders/claude-k8s/ — Dockerfile for the claude_k8s image (claude CLI + Node runtime)
  • .builders/opencode-remote/ — if a dedicated image is needed

These feed into the image-factory pipeline (CDK8s dockerfile config per builder).


Implementation Order

  1. CDK8s regen (quick, unblocks Kargo pipeline) → PR to image-factory-state
  2. claude_k8s adapter (core feature) → PR to craigedmunds/paperclip@integration
  3. K8s infra manifests (namespace, SA, PVC, RBAC) → PR to k8s-lab
  4. Image factory builder dirs → PR to craigedmunds/paperclip@integration
  5. Wire it up — update Paperclip agent config in cluster to use claude_k8s

Notes

  • The integration branch branch name is confirmed (board requested release or integration, not custom-adapters)
  • Merge discipline: upstream/master → origin/master → origin/integration (cherry-pick upstream releases)
  • Per-run Job naming uses runId to avoid conflicts; TTL or manual delete on completion
  • onSpawn callback in execute context can be used to report the Job/pod name instead of a local PID