OpenCode Slack Integration - Epic Breakdown

Overview

This document provides the complete epic and story breakdown for OpenCode Slack Integration, decomposing the requirements from the PRD and Architecture into implementable stories.

Requirements Inventory

Functional Requirements

FR-1: Slack Command Interface (P0)

  • /opencode start opens a workflow form
  • Form includes: category, project, repositories, task title/description, priority
  • Form submission creates a new work session
  • System creates dedicated thread for the work session
  • User receives confirmation with session details

FR-2: BMAD Routing (P0)

  • System analyzes task title and description
  • Provides routing suggestion (architect/PM/builder/party-mode) with confidence score
  • Suggests AI model: Claude Code (API) for architecture/reasoning, Qwen 2.5 Coder 32B (local) for code generation
  • Configurable in global application state
  • Shows reasoning for agent type and model suggestion
  • User can confirm or override routing and model decision
  • Routing decision recorded in session state

FR-3: Workspace Management (P0)

  • System invokes task builder:init BUILDER_NAME={project} REPOS={repos}
  • Workspace includes workspace-root and specified repositories
  • Each workspace isolated from other sessions
  • Workspace state persists across OpenCode session restarts
  • Workspace can be cleaned up when work is complete

FR-4: OpenCode Session Bridge (P0)

  • System spawns OpenCode session in builder workspace
  • Session integrates with standard OpenCode session management (visible in opencode session list)
  • Session appears in OpenCode web UI alongside manually-created sessions
  • User messages from Slack forwarded to OpenCode
  • OpenCode output parsed and formatted for Slack
  • Questions from agents detected and presented as interactive messages
  • Progress milestones posted to thread
  • Session accessible from both Slack and OpenCode web UI (with conflict handling)

FR-5: Lab Deployment (P0)

  • System creates namespace: {project}-lab
  • Deployment uses ConfigMap/PVC strategy (no image builds for MVP)
  • Ingress created at {project}.builder.lab.ctoaas.co
  • System waits for pods ready before declaring success
  • Lab URL posted to Slack thread

FR-6: Test Execution (P0)

  • Unit tests run in OpenCode session context
  • Integration tests run against lab URL
  • Acceptance tests verify critical flows
  • Test results formatted and posted to Slack
  • All-passing status clearly indicated

FR-7: Question Handling (P1)

  • Questions posted to Slack with interactive buttons
  • Timeout configured (default: 2 business hours)
  • Recommended option indicated
  • If no response within timeout, work continues with recommended option
  • Late responses accepted and handled appropriately
  • Questions can spawn new threads for complex discussions

FR-8: Progress Updates (P1)

  • Major milestones posted to thread (phase complete, tests passing, deployment ready)
  • Significant changes within a phase are reported
  • Summary posted every 1-2 hours if no milestones reached
  • Updates configurable per project
  • Updates include emojis and clear formatting

FR-9: Multi-Project Support (P0)

  • Each project has isolated workspace and session
  • Sessions can run concurrently
  • User can switch between projects via Slack
  • Session state tracked per project
  • No cross-contamination of context between projects

FR-10: Session Persistence (P1)

  • Session state stored in git (.session-state.yaml)
  • OpenCode sessions can be resumed after restart
  • Work context preserved across days/weeks
  • User can explicitly pause/resume sessions

Non-Functional Requirements

NFR-1: Response Time (P0)

  • Slack commands acknowledge within 3 seconds
  • Deployment to lab namespace completes in < 30 seconds
  • Question responses forwarded to OpenCode within 1 second

NFR-2: Reliability (P0)

  • System recovers gracefully from Slack webhook failures (retry logic)
  • Session state never lost (git-backed persistence)
  • Deployment failures clearly communicated with recovery options

NFR-3: Security (P0)

  • Slack webhook signatures verified
  • K8s service account with least-privilege RBAC
  • Lab namespaces isolated from production
  • No secrets exposed in Slack messages

NFR-4: Scalability (P2)

  • MVP: Single user, < 10 concurrent sessions
  • Future: Multiple users, 50+ concurrent sessions

Additional Requirements

From Architecture:

Bridge Plugin WebSocket Integration:

  • Bridge plugin establishes WebSocket connection to Gateway on startup
  • Request/response pattern using request_id for correlation
  • Event streaming for real-time updates (no polling)
  • Reconnection logic with exponential backoff
  • Plugin handles commands: session.create, session.message, session.list, session.abort
  • Plugin streams events: session.created, message.part.streamed, tool.executed, agent.milestone, session.idle/active

State Persistence Strategy:

  • Mount PVC to ~/.local/share/opencode for OpenCode state persistence
  • OpenCode auth tokens persist across pod restarts
  • Session metadata and history persist across pod restarts
  • Plugin code loaded from PVC workspace (/workspace/.opencode-plugins/opencode-bridge)
  • No Docker rebuild needed for plugin changes

OpenCode SDK Session Management:

  • Use @opencode-ai/sdk/dist/v2/client.js for programmatic session control
  • Sessions created via SDK appear in OpenCode web UI
  • Agent routing via agent: 'builder' parameter
  • Model selection via model: { providerID, modelID } parameter

Kubernetes Integration:

  • Three components: LGTM Stack (observability), Gateway (Python FastAPI), Codev Pod (OpenCode + Bridge Plugin)
  • Gateway namespace: ai-dev
  • Gateway service: ClusterIP gateway.ai-dev.svc.cluster.local:8000
  • Secrets via ClusterExternalSecret pattern
  • Shared PVC: code-server-storage (ReadWriteMany) mounted by codev + gateway
  • No public ingress for Gateway (Slack Socket Mode outbound)
  • ArgoCD auto-sync for lab environment

Observability Requirements:

  • Gateway logs → Loki
  • OpenCode plugin logs → Loki
  • Grafana dashboards for session tracking, question latency, Slack interactions
  • Distributed tracing for question flow across components (future)

Code Review Follow-ups (Medium Priority):

  • M1: Race condition in double-click handling (services/gateway/services/slack_app.py:64-66)
  • M2: Request validation in permission endpoint (services/gateway/api/opencode.py:36)
  • M3: Hardcoded timeout values inconsistent (bridge plugin 310s vs gateway 300s)

FR Coverage Map

Functional Requirements:

  • FR-1 (Slack Command Interface) → Epic 2
  • FR-2 (BMAD Routing) → Epic 2
  • FR-3 (Workspace Management) → Epic 2
  • FR-4 (OpenCode Session Bridge) → Epic 2
  • FR-5 (Lab Deployment) → Epic 3
  • FR-6 (Test Execution) → Epic 3
  • FR-7 (Question Handling) → Epic 1
  • FR-8 (Progress Updates) → Epic 2
  • FR-9 (Multi-Project Support) → Epic 2
  • FR-10 (Session Persistence) → Epic 2

Non-Functional Requirements:

  • NFR-1 (Response Time) → Epic 1 (baseline), Epic 4 (optimization)
  • NFR-2 (Reliability) → Epic 1 (foundation), Epic 4 (hardening)
  • NFR-3 (Security) → Epic 1 (baseline), Epic 4 (hardening)
  • NFR-4 (Scalability) → Deferred (post-MVP)

Architecture Requirements:

  • WebSocket Integration → Epic 1
  • State Persistence → Epic 1
  • Plugin Loading from PVC → Epic 1
  • K8s Deployment → Epic 1
  • Observability (LGTM) → Epic 1
  • Code Review Follow-ups (M1-M3) → Epic 1

Epic List

Epic 1: Steel Thread Production Deployment

Goal: Permission bridge is fully deployed to K8s with WebSocket integration, state persistence, and observability - production-ready foundation.

User Outcome: Developers receive Slack notifications when OpenCode agents request permissions and can approve/deny from mobile. System runs in production with full observability and zero-rebuild iteration capability.

FRs covered: FR-7 (Question Handling)

Architecture covered: WebSocket bidirectional communication, state persistence (PVC mounts), plugin loading from PVC, K8s deployment (Gateway + Codev), LGTM observability, ArgoCD integration, code review hardening (M1-M3)


Epic 2: Session Management & Async Interaction

Goal: Enable developers to start OpenCode sessions from Slack, receive progress updates, and interact asynchronously across days/weeks.

User Outcome: Developers can initiate development work via /opencode start command, system routes to appropriate BMAD agents, manages isolated workspaces, and provides async progress updates. Work context persists across sessions.

FRs covered: FR-1 (Slack Command Interface), FR-2 (BMAD Routing), FR-3 (Workspace Management), FR-4 (OpenCode Session Bridge), FR-8 (Progress Updates), FR-9 (Multi-Project Support), FR-10 (Session Persistence)


Epic 3: Lab Deployment & Testing

Goal: Automatically deploy work-in-progress code to isolated Kubernetes lab namespaces and execute automated tests.

User Outcome: Developers can deploy completed work to isolated K8s namespaces with automatic ingress creation, run automated tests (unit/integration/acceptance), and receive test results in Slack.

FRs covered: FR-5 (Lab Deployment), FR-6 (Test Execution)


Epic 4: Production Optimization

Goal: Harden system for production reliability, performance, and security beyond MVP baseline.

User Outcome: System meets production SLAs with comprehensive error handling, retry logic, performance monitoring, and security hardening.

FRs covered: NFR-1 (Response Time optimization), NFR-2 (Reliability hardening), NFR-3 (Security hardening)


Epic 1: Steel Thread Production Deployment

Goal: Permission bridge is fully deployed to K8s with WebSocket integration, state persistence, and observability - production-ready foundation.

Story 1.1: WebSocket Server in Gateway

As a developer, I want the Gateway to accept WebSocket connections from the Bridge Plugin, So that bidirectional real-time communication is established for session commands and event streaming.

Acceptance Criteria:

Given Gateway service is running When Bridge Plugin initiates WebSocket connection to ws://gateway:8000/ws/bridge Then Connection is accepted and established And Gateway logs successful connection with connection ID

Given WebSocket connection is established When Bridge sends JSON message with type: 'ping' Then Gateway responds with type: 'pong' And Connection remains active

Given WebSocket connection is lost When Bridge attempts to reconnect Then Gateway accepts reconnection And Previous connection state is cleaned up

Given Invalid JSON is received When Gateway processes the message Then Error is logged And Connection remains open (doesn’t crash)


Story 1.2: WebSocket Client in Bridge Plugin ✅

As a developer, I want the Bridge Plugin to establish WebSocket connection to Gateway on startup, So that it can send events and receive commands from Gateway.

Acceptance Criteria:

  • AC1: Bridge Plugin starts with GATEWAY_WS=ws://gateway:8000/ws/bridge - WebSocket connection to Gateway is established - Connection ready event is logged
  • AC2: WebSocket connection established - Gateway sends command { type: 'ping' } - Plugin responds with { type: 'pong' } - Round-trip latency is logged
  • AC3: Connection is lost - Plugin detects disconnect - Plugin attempts reconnection with exponential backoff (1s, 2s, 4s, 8s, max 30s) - Reconnection attempts are logged
  • AC4: Gateway is unreachable on startup - Plugin initialization runs - Plugin continues to retry connection in background - Logs connection failures without crashing

Implementation:

  • plugins/opencode-bridge/src/websocket-client.ts - BridgeWebSocketClient class
  • plugins/opencode-bridge/src/plugin.ts:25-35 - WebSocket initialization in plugin
  • plugins/opencode-bridge/src/plugin.test.ts:54-73 - Test for WebSocket client integration

Tests: 26 passing (all plugin tests)


Story 1.3: Session Command Protocol (Gateway → Bridge) ✅

As a Gateway developer, I want to send session management commands to Bridge via WebSocket, So that I can create, message, list, and abort OpenCode sessions programmatically.

Acceptance Criteria:

  • AC1: WebSocket connection established - Gateway sends session.create - Bridge creates OpenCode session using SDK - Bridge responds with session.created
  • AC2: Session exists - Gateway sends session.message - Bridge sends message to OpenCode session via SDK - Bridge responds with session.message.sent
  • AC3: Multiple sessions exist - Gateway sends session.list - Bridge queries OpenCode SDK - Bridge responds with session.list
  • AC4: Session running - Gateway sends session.abort - Bridge aborts session via SDK - Bridge responds with session.aborted
  • AC5: Command fails (invalid session ID) - Bridge processes command - Bridge responds with session.error

Implementation:

  • plugins/opencode-bridge/src/handlers/session-commands.ts - Session command handlers (handleSessionCreate, handleSessionMessage, handleSessionList, handleSessionAbort)
  • plugins/opencode-bridge/src/plugin.ts:45-76 - WebSocket message handler routing
  • plugins/opencode-bridge/src/handlers/session-commands.test.ts - 12 tests for all session commands

Tests: 39 passing (12 new session command tests + 27 existing)


Story 1.4: Event Streaming Protocol (Bridge → Gateway) ✅

As a Bridge Plugin developer, I want to stream OpenCode events to Gateway via WebSocket, So that Gateway can format and forward events to Slack in real-time.

Acceptance Criteria:

  • AC1: OpenCode streams message part - Bridge receives message.part.streamed - Bridge sends to Gateway with session_id, message_id, content, role
  • AC2: OpenCode executes tool - Bridge receives tool.executed - Bridge sends to Gateway with session_id, tool, file, status
  • AC3: Session becomes idle - Bridge receives session.idle - Bridge sends to Gateway with session_id
  • AC4: Session becomes active - Bridge receives session.active - Bridge sends to Gateway with session_id
  • AC5: Message streaming starts - Bridge receives message.started - Bridge sends to Gateway with session_id, message_id
  • AC6: Message streaming completes - Bridge receives message.completed - Bridge sends to Gateway with session_id, message_id

Implementation:

  • plugins/opencode-bridge/src/handlers/event-streaming.ts - Event streaming handler
  • plugins/opencode-bridge/src/plugin.ts:147 - Integrated into event hook
  • services/gateway/main.py:157-168 - Gateway receives and logs events
  • plugins/opencode-bridge/src/handlers/event-streaming.test.ts - 10 tests for all event types
  • services/gateway/test_websocket.py:87-197 - 6 tests for Gateway event reception

Tests:

  • Bridge: 65 passing (10 new event streaming tests)
  • Gateway: 36 passing (6 new event reception tests)
  • Coverage: 95.52% (100% on event-streaming.ts)

Story 1.5: OpenCode State Persistence via PVC ✅

As a developer, I want OpenCode state persisted to PVC-mounted storage, So that sessions, auth tokens, and history survive pod restarts.

Acceptance Criteria:

  • AC1: Codev pod configured with PVC mount - volumeMount to /home/opencode/.local/share/opencode from subPath .opencode-data - OpenCode state directory points to PVC
  • AC2: User authenticates with Anthropic OAuth - Auth token saved to auth.json - Token written to PVC at .opencode-data/auth.json - Token persists after pod restart
  • AC3: OpenCode session created - Session metadata written to storage/session/ - Session data written to PVC - Session appears in list after pod restart
  • AC4: Message history exists - Pod restarts - Full message history available in resumed session - No data loss
  • AC5: PVC mount fails on startup - Pod initialization runs - Pod fails with clear error - Error logged to stdout/stderr (K8s default behavior)

Implementation:

  • infrastructure/kustomize/components/codev/deployment.yaml - Codev deployment with PVC mounts
  • infrastructure/kustomize/components/codev/pvc.yaml - PVC definition (ReadWriteMany, 10Gi)
  • infrastructure/kustomize/components/codev/service.yaml - ClusterIP service for OpenCode
  • infrastructure/kustomize/components/codev/service-account.yaml - ServiceAccount for pod
  • infrastructure/kustomize/components/codev/README.md - Deployment and testing documentation

K8s Configuration:

volumeMounts:
  - name: code-server-storage
    mountPath: /workspace                              # Code files
  - name: code-server-storage
    mountPath: /home/opencode/.local/share/opencode   # State persistence
    subPath: .opencode-data
  - name: code-server-storage
    mountPath: /workspace/.opencode-plugins            # Plugin code (Story 1.6)
    subPath: .opencode-plugins

PVC Structure:

code-server-storage/
├── repos/                  # Workspace code
├── .opencode-data/         # OpenCode state (NEW)
│   ├── auth.json          # OAuth tokens
│   └── storage/           # Sessions, messages, parts
└── .opencode-plugins/      # Plugin code (Story 1.6)

Story 1.6: Plugin Loading from PVC Workspace

As a developer, I want Bridge Plugin loaded from PVC workspace on pod startup, So that plugin code changes don’t require Docker rebuilds.

Acceptance Criteria:

Given Plugin source exists at /workspace/.opencode-plugins/opencode-bridge/ When Codev pod starts and runs entrypoint script Then Script runs cd /workspace/.opencode-plugins/opencode-bridge && npm install && npm link And Plugin is globally linked

Given Plugin is linked globally When Entrypoint script runs cd ~/.config/opencode && npm link @opencode-bridge Then OpenCode can discover the plugin And Plugin loads on OpenCode startup

Given Plugin code is modified on PVC When Pod restarts Then New plugin code is loaded (via npm install + link) And No Docker rebuild is required

Given Plugin has npm dependencies When npm install runs in plugin directory Then Dependencies are installed to plugin’s node_modules And Installation completes successfully

Given Plugin npm install fails When Entrypoint script detects failure Then Pod logs error details And Pod continues startup (graceful degradation)


Story 1.7: Gateway Pod K8s Deployment

As a platform operator, I want Gateway deployed as K8s pod in ai-dev namespace, So that it runs in production with proper resource limits and secrets.

Acceptance Criteria:

Given Gateway image is built and pushed to ghcr.io/craigedmunds/opencode-slack-gateway:latest When Kustomize applies infrastructure/kustomize/components/opencode-slack-gateway/ Then Deployment creates Gateway pod in ai-dev namespace And Pod is running with status Ready

Given Gateway pod is deployed When Pod starts Then Environment variables are loaded from ClusterExternalSecret And SLACK_BOT_TOKEN and SLACK_APP_TOKEN are available

Given Gateway pod needs persistent storage When Pod mounts PVC code-server-storage Then PVC is mounted at /workspace (shared with Codev) And Gateway can read/write to shared filesystem

Given Gateway service is created When Service manifest is applied Then ClusterIP service gateway.ai-dev.svc.cluster.local:8000 is created And Service routes to Gateway pod port 8000

Given Resource limits are defined When Pod is scheduled Then Pod requests 256Mi memory, 100m CPU And Pod limits 512Mi memory, 500m CPU

Given Gateway crashes When Pod exit occurs Then K8s restarts pod automatically And Restart count increments


Story 1.8: Codev Pod Updates for Bridge Plugin

As a platform operator, I want Codev pod updated to load Bridge Plugin and configure Gateway URL, So that Bridge can connect to Gateway on pod startup.

Acceptance Criteria:

Given Codev pod Dockerfile is updated When Image is built Then Entrypoint script includes plugin loading logic And GATEWAY_URL environment variable is set to http://gateway.ai-dev.svc.cluster.local:8000

Given Pod starts with plugin source on PVC When Entrypoint runs plugin installation Then Bridge Plugin is installed and linked And OpenCode loads plugin on startup

Given OpenCode starts with Bridge Plugin loaded When Plugin initialization runs Then WebSocket connection to $GATEWAY_URL/ws/bridge is established And Connection success is logged

Given Gateway is not yet running When Bridge Plugin attempts connection Then Plugin retries with exponential backoff And Pod doesn’t crash waiting for Gateway

Given Codev pod restarts When Pod comes back up Then Bridge Plugin reconnects to Gateway And WebSocket connection is re-established


Story 1.9: LGTM Observability Stack Deployment

As a platform operator, I want LGTM stack deployed to capture logs and metrics, So that I can monitor Gateway and Bridge behavior in production.

Acceptance Criteria:

Given LGTM component exists in k8s-lab/components/lgtm/ When ArgoCD syncs the component Then LGTM pod is running in lgtm namespace And Services are available: Grafana (3000), Loki (3100), OTLP (4317/4318)

Given LGTM stack is running When Ingress is created at lgtm.lab.ctoaas.co Then Grafana UI is accessible via browser And Default dashboards are loaded

Given Gateway pod is running When Gateway logs to stdout Then Logs are captured by Loki And Logs are queryable in Grafana

Given Bridge Plugin logs events When Plugin writes to OpenCode log files Then Logs are captured by Loki And Logs are searchable by session ID

Given PVC is mounted for LGTM data When Pod restarts Then Grafana dashboards and Loki data persist And No data loss occurs


Story 1.10: ArgoCD Integration for Auto-Sync

As a platform operator, I want ArgoCD Application configured for ai-dev components, So that Git commits automatically deploy to lab environment.

Acceptance Criteria:

Given ArgoCD Application is defined in k8s-lab/other-seeds/ai-dev.yaml When Application manifest specifies source https://github.com/craigedmunds/ai-dev Then ArgoCD syncs from infrastructure/kustomize/components path And Target namespace is ai-dev

Given Auto-sync is enabled When Git commit is pushed to ai-dev repo Then ArgoCD detects change within 3 minutes And New manifests are applied automatically

Given Sync fails (invalid YAML) When ArgoCD attempts sync Then Application status shows degraded And Error details are visible in ArgoCD UI

Given Multiple components exist (gateway, codev updates) When ArgoCD syncs Then All components are applied in correct order And Dependencies are respected

Given Manual sync is triggered When Operator clicks “Sync” in ArgoCD UI Then Sync completes successfully And All resources show healthy status


Story 1.11: Code Review Hardening (M1-M3)

As a developer, I want code review follow-ups addressed, So that production deployment is hardened against race conditions, validation gaps, and timeout inconsistencies.

Acceptance Criteria:

Given Double-click can occur on Slack button (M1) When Two clicks arrive simultaneously Then Lock is acquired before pop() operation And Only first click processes, second returns early And No duplicate responses occur

Given Permission request arrives at Gateway (M2) When Request contains session_id Then Session ID format is validated (length >= 8, alphanumeric) And Invalid session IDs return HTTP 400 with error message

Given Timeout values are hardcoded (M3) When Gateway starts Then Timeout is read from env var PERMISSION_TIMEOUT_SECONDS (default: 300) And Bridge Plugin uses GATEWAY_TIMEOUT + 10 buffer And Timeouts are configurable without code changes

Given Lock implementation is added (M1) When Concurrent clicks occur in load test Then No race conditions occur in 1000 click simulation And All responses are correctly deduplicated

Given Session validation is added (M2) When Invalid session IDs are sent (empty, too short, special chars) Then All invalid formats are rejected And Valid formats pass through


Epic 2: Session Management & Async Interaction

Goal: Enable developers to start OpenCode sessions from Slack, receive progress updates, and interact asynchronously across days/weeks.

Story 2.1: Slack Slash Command /opencode start

As a developer, I want to initiate development work via /opencode start command in Slack, So that I can start OpenCode sessions from mobile without terminal access.

Acceptance Criteria:

Given Slack app is installed in workspace When User types /opencode start in any channel Then Workflow form appears with fields: category, project, repositories, task title, task description, priority And Form loads within 3 seconds

Given Form is displayed When User fills required fields and submits Then Gateway receives form data via Socket Mode And Slack acknowledges submission with “Starting session…” message

Given Form submission fails (network error) When Submission times out Then User sees error message “Failed to submit. Please try again.” And Form data is preserved for retry

Given User cancels form When Cancel button is clicked Then Form closes without action And No session is created


Story 2.2: BMAD Agent Routing Suggestion

As a developer, I want system to suggest appropriate BMAD agent based on my task, So that work is routed to the right agent type (architect/PM/builder/party-mode).

Acceptance Criteria:

Given Task title is “Design authentication architecture” When Routing algorithm analyzes task Then Suggestion is “architect” with confidence >80% And Reasoning includes “contains ‘design’ and ‘architecture’ keywords”

Given Task title is “Implement login API” When Routing algorithm analyzes task Then Suggestion is “builder” with confidence >70% And Reasoning includes “contains ‘implement’ keyword”

Given Task is ambiguous “Fix the thing” When Routing algorithm analyzes task Then Suggestion is “builder” (default) with confidence <50% And Reasoning includes “insufficient context for confident routing”

Given Routing suggestion is displayed When User reviews suggestion Then User can confirm or override routing decision And Override option shows all 4 agent types

Given Routing decision is made When Session is created Then Routing decision is recorded in session state And OpenCode session uses selected agent type


Story 2.3: AI Model Selection Suggestion

As a developer, I want system to suggest AI model based on task complexity, So that work uses Claude Code for complex reasoning or Qwen Coder for implementation.

Acceptance Criteria:

Given Agent type is “architect” or “pm” When Model selection runs Then Suggestion is “Claude Code” (Anthropic API) And Reasoning is “Complex reasoning required for architecture/planning”

Given Agent type is “builder” When Model selection runs Then Suggestion is “Qwen 2.5 Coder 32B” (local) And Reasoning is “Code generation optimized for local model”

Given Model suggestion is displayed When User reviews suggestion Then User can confirm or override model choice And Override shows both available models

Given Model selection is made When Session is created Then Model choice is recorded in session state And OpenCode session uses selected model

Given Suggested model is unavailable When Health check fails Then User is prompted to select alternative model And Session creation waits for user decision


Story 2.4: Builder Workspace Creation via Par

As a developer, I want isolated builder workspace created for my task, So that my work doesn’t interfere with other projects.

Acceptance Criteria:

Given Session is approved with project “domain-apis” and repos “workspace-root,domain-apis” When Gateway invokes task builder:init BUILDER_NAME=domain-apis-auth REPOS=workspace-root,domain-apis Then Par creates worktree at .builders/domain-apis-auth/repos/ And Workspace includes workspace-root and domain-apis repositories

Given Workspace creation succeeds When Gateway checks workspace directory Then Directory .builders/domain-apis-auth/repos/workspace-root exists And Directory .builders/domain-apis-auth/repos/domain-apis exists

Given Workspace creation fails (repo not found) When Par returns error Then Gateway posts error to Slack thread And Session creation is aborted

Given Workspace already exists for builder name When Gateway invokes builder:init Then Par reuses existing workspace And Workspace is reset to clean state

Given Multiple sessions are active When Each session has different builder name Then Each workspace is isolated in separate .builders/ subdirectory And No cross-contamination occurs


Story 2.5: OpenCode Session Creation via WebSocket

As a developer, I want OpenCode session created programmatically in builder workspace, So that session integrates with standard OpenCode session list and web UI.

Acceptance Criteria:

Given Workspace exists at .builders/domain-apis-auth/repos/ When Gateway sends WebSocket command { type: 'session.create', workspace: '/workspace/.builders/domain-apis-auth', task: 'Build login API', agent: 'builder', model: { providerID: 'anthropic', modelID: 'claude-sonnet-4' } } Then Bridge creates session via OpenCode SDK And Bridge responds with { type: 'session.created', session_id: 'ses_xyz', session: {...} }

Given Session is created When User runs opencode session list Then Session appears in list with title “Build login API” And Session directory shows workspace path

Given Session is created When User opens OpenCode web UI Then Session is visible in session list And Session can be accessed from web UI

Given Session creation fails (invalid workspace) When Bridge attempts to create session Then Bridge responds with { type: 'session.error', error: 'Workspace not found' } And Gateway posts error to Slack


Story 2.6: Session State Persistence to Git

As a developer, I want session state persisted to git, So that work context survives days/weeks and service restarts.

Acceptance Criteria:

Given Session is created with ID ses_xyz When Gateway writes session state Then File .session-state.yaml is created in builder workspace And State includes session_id, slack_thread_ts, routing_decision, model_choice, status

Given Session state changes (question answered) When Gateway updates state Then .session-state.yaml is updated on PVC And Git commit is created with message “chore: Session state checkpoint”

Given Gateway pod restarts When Pod comes back up and reads .session-state.yaml Then Session state is loaded from file And Session can be resumed without data loss

Given Session completes When Final state is written Then State file shows status “completed” And Git commit records completion

Given State file is corrupted When Gateway attempts to read state Then Error is logged And Session is marked as unrecoverable


Story 2.7: Slack Thread Creation and Mapping

As a developer, I want dedicated Slack thread created for my session, So that all updates for this work are organized in one conversation.

Acceptance Criteria:

Given Session is created with ID ses_xyz When Gateway creates Slack thread Then Thread is created in project-specific channel (or user’s DM) And Initial message includes session ID, task title, routing decision, model choice

Given Thread is created with timestamp thread_ts_123 When Gateway maps session to thread Then Mapping ses_xyz → thread_ts_123 is stored in session state And Mapping persists to .session-state.yaml

Given Session state is loaded after restart When Gateway reads thread mapping Then Future updates post to correct thread And No orphaned messages occur

Given Thread creation fails (channel not found) When Gateway attempts to create thread Then Error is logged And Session creation is aborted with user notification


Story 2.8: Agent Progress Milestone Updates

As a developer, I want major milestones posted to Slack thread, So that I know when agent completes phases without constant monitoring.

Acceptance Criteria:

Given Agent completes analysis phase When Bridge streams { type: 'agent.milestone', session_id: 'ses_xyz', milestone: 'analysis_complete', description: 'Requirements analyzed' } Then Gateway posts to thread: ”✅ Milestone: Requirements analyzed” And Message includes timestamp

Given Agent completes implementation phase When Milestone event is received Then Gateway posts: ”🎉 Milestone: Implementation complete” And Message includes summary of changes

Given Tests pass When Milestone tests_passing is received Then Gateway posts: ”✅ Milestone: All tests passing (45/45)” And Message formatted with emoji and clear status

Given Multiple milestones occur rapidly When Events arrive within 30 seconds Then Gateway batches updates into single message And Slack thread isn’t spammed


Story 2.9: Agent Output Streaming to Slack

As a developer, I want agent output streamed to Slack in real-time, So that I can follow agent’s thinking and progress.

Acceptance Criteria:

Given Agent streams message content When Bridge sends { type: 'message.part.streamed', session_id: 'ses_xyz', content: 'I will implement authentication using JWT...' } Then Gateway updates Slack message with accumulated content And Message shows ”🤔 Agent is thinking…”

Given Message streaming completes When Bridge sends { type: 'message.completed', session_id: 'ses_xyz', message_id: 'msg_456' } Then Gateway posts final message with full content And “Thinking…” indicator is removed

Given Output exceeds Slack message limit (3000 chars) When Content accumulates beyond limit Then Gateway posts multiple messages in sequence And Messages are numbered “(1/3), (2/3), (3/3)”

Given Streaming is interrupted (connection lost) When Reconnection occurs Then Gateway resumes from last known position And No duplicate content is posted


Story 2.10: Tool Execution Visibility

As a developer, I want tool executions reported to Slack, So that I know what files agent is modifying.

Acceptance Criteria:

Given Agent writes file When Bridge sends { type: 'tool.executed', tool: 'file_write', file: 'src/auth.ts', status: 'success' } Then Gateway posts: ”🔧 Wrote file: src/auth.ts

Given Agent runs tests When Tool execution event for bash with command npm test Then Gateway posts: ”🧪 Running tests: npm test

Given Agent reads files When Multiple file_read events occur rapidly Then Gateway batches into summary: ”📖 Read 5 files”

Given Tool execution fails When Status is ‘error’ Then Gateway posts: ”❌ Tool failed: file_write - Permission denied” And Error details included


Story 2.11: Multi-Project Concurrent Sessions

As a developer, I want to run multiple projects in parallel, So that I can context-switch between different work streams.

Acceptance Criteria:

Given Session 1 exists for “domain-apis” project When User starts session 2 for “market-making” project Then Both sessions run in isolated workspaces And Sessions have different builder names and workspace directories

Given Multiple sessions are active When Events arrive for different session IDs Then Each event routes to correct Slack thread And No cross-contamination occurs

Given User views session list in Slack When Command /opencode list is issued (future feature placeholder) Then All active sessions are displayed with status And User can identify which sessions are active

Given Sessions exceed limit (10 concurrent - NFR-4) When User attempts 11th session Then Error message: “Maximum concurrent sessions reached (10)” And User prompted to complete existing session first


Story 2.12: UI-Initiated Session Configuration Collection

As a developer, I want sessions created in OpenCode UI to collect configuration lazily based on actual needs, So that I can start exploratory sessions with zero friction and only provide builder/Slack config when required.

Acceptance Criteria:

Given User creates session in OpenCode UI with title “Add rate limiting” When Session is created Then Session is registered with minimal state (id, title, directory) And No builder or Slack configuration is collected yet And Session type is “exploratory”

Given Exploratory session is active When User asks questions and agent reads code Then Work proceeds without any configuration prompts And Agent uses Read, Grep, Glob tools freely

Given User attempts first write operation (Edit, Write, Bash with file modification) When Plugin intercepts write tool execution Then Execution is paused And User is prompted for builder configuration

Given User is in OpenCode web UI when write is attempted When Builder config prompt is needed Then OpenCode modal appears with fields:

  • Project: [Dropdown of projects from .ai/projectlist.md or “Create New”]
  • Workspace Name: [Auto-filled: {project-id}-{task-slug}]
  • Repositories: [Multi-select, pre-selected based on detected repos]
  • Category: [Select, inferred from project] And Optional section: “Setup Slack notifications now?” (unchecked by default)

Given User is in Slack when write is attempted When Builder config prompt is needed Then Slack form appears in existing thread or DM with same fields And Optional section: “Setup Slack notifications?” (checked by default)

Given Builder config is provided When User submits form Then Gateway invokes task builder:init BUILDER_NAME={workspace_name} REPOS={repos} And Builder workspace is created at .builders/{workspace_name}/ And Session is moved to builder workspace And Session type changes to “work” And Write operation proceeds

Given Builder config form includes optional Slack section When User enables Slack notifications and submits Then Both builder and Slack configs are saved And Slack thread is created in specified channel And Future notifications route to Slack

Given User is in OpenCode web UI working When User goes offline (presence detection: no web activity >10min) Then Next notification triggers Slack config collection And Slack form appears: “Where should I post updates for ‘{session.title}’?” And Fields: Channel (inferred), Priority (medium default)

Given Slack config is provided via form When User submits Then Slack thread is created in chosen channel And Pending notification is posted to thread And Mapping session_id → thread_ts is saved to .session-state.yaml

Given User creates session, goes offline, then starts building When Both configs are eventually needed Then Slack config collected first (when going offline) And Builder config collected second (when attempting write in Slack) And Both configs can be collected in either order

Given Project inference runs on session title “Add auth to OpenCode Slack” When System matches against .ai/projectlist.md Then Project 0012 (OpenCode Slack Integration) is suggested And Workspace name defaults to “0012-add-auth-to-opencode-slack” And Repositories default to [“ai-dev”] (from project metadata) And Category defaults to “ai-dev”

Given Project inference cannot determine project with confidence When Multiple projects match or none match Then Dropdown shows all active projects And “Create New Project” option is available And If selected, next available project ID (e.g., 0014) is assigned

Given Session with builder config exists When Gateway restarts Then .session-state.yaml is loaded from PVC And Builder config (project_id, workspace_name, repos, category) is restored And Session can resume work without re-prompting

Given Session with Slack config exists When Gateway restarts Then Slack thread mapping is loaded from state file And Future notifications route to correct thread And No duplicate threads are created

Given User opts to provide both configs at once When Builder config form shows optional Slack section Then User can check “Setup Slack notifications now” And Single form submission provides both configs And No second prompt occurs later

Implementation Notes:

  • Builder config collection: SessionConfigManager.ensure_builder_config()
  • Slack config collection: SessionConfigManager.ensure_slack_config()
  • Project ID inference: Match session title/repos/category to .ai/projectlist.md
  • Workspace naming: .builders/{project-id}-{slugified-title}/
  • State persistence: Both configs saved to .session-state.yaml on PVC
  • Presence detection: Track last_web_activity to determine if user is in OpenCode UI
  • Write detection: Plugin hook on tool.beforeExecute for write operations

Test Scenarios:

  1. Exploratory session (no config): User asks “How does auth work?” - no prompts
  2. Build in UI: User attempts edit - OpenCode modal appears - builder initialized
  3. Go offline then notify: User leaves - agent has question - Slack form appears - thread created
  4. Build while offline: User in Slack asks to “Add feature” - Slack form for builder config - workspace created
  5. Bundle both configs: User checks optional Slack section in builder form - both configs saved - no second prompt
  6. Project inference: Session title matches project 0012 - workspace name auto-filled “0012-add-notifications”
  7. State recovery: Gateway restarts - session state loaded - configs restored - work resumes

Epic 3: Lab Deployment & Testing

Goal: Automatically deploy work-in-progress code to isolated Kubernetes lab namespaces and execute automated tests.

Story 3.1: Namespace Creation for Builder

As a developer, I want dedicated K8s namespace created for my work, So that deployment is isolated from other projects.

Acceptance Criteria:

Given Agent completes implementation When User clicks “Deploy to Lab” button in Slack Then Gateway creates namespace {builder-name}-lab And Namespace is labeled with builder name and project

Given Namespace already exists When Deploy is triggered Then Gateway reuses existing namespace And Previous resources are cleaned up first

Given Namespace creation fails (RBAC) When Gateway lacks permissions Then Error is posted to Slack with RBAC details And Deployment is aborted

Given Namespace is created When Deployment completes or fails Then Namespace remains active for testing And User manually cleans up with /opencode cleanup


Story 3.2: ConfigMap/PVC Deployment Strategy

As a developer, I want code deployed via ConfigMaps and PVCs, So that deployment is fast without Docker image builds.

Acceptance Criteria:

Given Small files (<1MB) exist in workspace When Gateway creates ConfigMap Then ConfigMap contains file contents as data entries And ConfigMap is named {builder-name}-config

Given Large files (>1MB) exist When Gateway prepares deployment Then Files are written to PVC And Deployment mounts PVC for file access

Given ConfigMap deployment is created When Pod starts Then ConfigMap data is mounted at /app/config And Application can read files

Given Code changes occur When Re-deployment is triggered Then ConfigMap is updated And Pods are restarted to pick up changes


Story 3.3: Ingress Creation with Cert

As a developer, I want ingress created at predictable URL, So that I can access deployed service from browser/Postman.

Acceptance Criteria:

Given Service is deployed in namespace domain-apis-auth-lab When Gateway creates ingress Then Ingress host is domain-apis-auth.lab.ctoaas.co And Ingress routes to service port

Given Ingress is created When Cert-manager processes ingress Then TLS certificate is issued within 2 minutes And HTTPS is available

Given Deployment completes When Gateway posts lab URL to Slack Then URL is https://domain-apis-auth.lab.ctoaas.co And URL is clickable in Slack

Given Ingress creation fails (DNS) When Gateway detects failure Then Error is posted to Slack with details And User can retry deployment


Story 3.4: Pod Readiness Waiting

As a developer, I want deployment to wait for pods to be ready, So that I don’t get lab URL before service is actually running.

Acceptance Criteria:

Given Deployment is applied When Pods are starting Then Gateway polls pod status every 5 seconds And Slack shows ”⏳ Waiting for pods to be ready…”

Given Pods become ready When All pods show status Running with readiness probe passing Then Gateway posts ”✅ Deployment ready” And Lab URL is posted to thread

Given Pods fail to become ready (CrashLoopBackOff) When 3 minutes elapse without ready state Then Gateway posts error: ”❌ Deployment failed - pods not ready” And Pod logs are attached to Slack message

Given Deployment times out (>5 minutes) When Timeout is reached Then Deployment is marked failed And User is prompted to check logs


Story 3.5: Unit Test Execution

As a developer, I want unit tests run in OpenCode session, So that I know tests pass before deploying to lab.

Acceptance Criteria:

Given Agent completes implementation When Tests are run via npm test or pytest Then Test output is captured And Results are parsed for pass/fail status

Given Tests pass (exit code 0) When Gateway formats results Then Slack shows ”✅ Unit Tests: 45/45 passing” And Test summary includes duration

Given Tests fail (exit code 1) When Gateway formats results Then Slack shows ”❌ Unit Tests: 42/45 passing (3 failures)” And Failed test names are listed

Given Tests cannot run (missing dependencies) When Test command fails Then Error is posted: “⚠️ Tests skipped - dependencies missing” And Deployment proceeds with warning


Story 3.6: Integration Test Execution Against Lab

As a developer, I want integration tests run against deployed lab URL, So that I validate end-to-end flows before promoting to production.

Acceptance Criteria:

Given Service is deployed at https://domain-apis-auth.lab.ctoaas.co When Integration tests run with TEST_URL=https://domain-apis-auth.lab.ctoaas.co npm run test:integration Then Tests execute against live deployment And Results are captured

Given Integration tests pass When Gateway formats results Then Slack shows ”✅ Integration Tests: 12/12 passing”

Given Integration tests fail When Gateway formats results Then Slack shows ”❌ Integration Tests: 10/12 passing (2 failures)” And Failed test details are included

Given Lab URL is not reachable When Integration tests attempt connection Then Tests fail with connection error And Gateway posts: ”❌ Lab deployment not reachable”


Story 3.7: Test Results Summary in Slack

As a developer, I want comprehensive test summary posted to Slack, So that I can quickly assess quality before promoting.

Acceptance Criteria:

Given All tests complete (unit + integration) When Gateway compiles results Then Slack message includes:

  • ✅ Unit Tests: 45/45 passing
  • ✅ Integration Tests: 12/12 passing
  • 🎉 All tests passing

Given Some tests fail When Summary is posted Then Failed test names are listed And Overall status shows “⚠️ Some tests failing”

Given Tests are skipped (not applicable) When Summary is posted Then Status shows “⏭️ Tests skipped” And Reason is included (e.g., “no test files found”)


Epic 4: Production Optimization

Goal: Harden system for production reliability, performance, and security beyond MVP baseline.

Story 4.1: Response Time Monitoring

As an operator, I want response time metrics collected and visualized, So that I can ensure SLA compliance (<3s Slack ack, <30s deploy, <1s question forward).

Acceptance Criteria:

Given Slack command is received When Gateway processes command Then Duration is recorded to Prometheus metric slack_command_duration_seconds And Metric is labeled with command type

Given Deployment is triggered When Deployment completes Then Duration is recorded to deployment_duration_seconds And Metric includes namespace label

Given Permission question is forwarded When Question reaches Slack Then Latency is recorded to permission_forward_latency_seconds

Given Grafana dashboard is loaded When Operator views dashboard Then p50, p95, p99 latencies are visible And SLA violations are highlighted


Story 4.2: Webhook Retry Logic

As an operator, I want Slack webhook failures retried automatically, So that transient network issues don’t lose events.

Acceptance Criteria:

Given Slack API call fails (503 Service Unavailable) When Gateway detects failure Then Request is retried with exponential backoff (1s, 2s, 4s) And Maximum 5 retries are attempted

Given Retry succeeds on attempt 3 When Request completes Then Success is logged And Retry count is recorded to metrics

Given All retries fail When Maximum retries reached Then Event is logged to dead letter queue And Alert is sent to operator


Story 4.3: Security Hardening (Webhook Signature Verification)

As an operator, I want Slack webhook signatures verified, So that malicious requests are rejected.

Acceptance Criteria:

Given Slack webhook is received When Gateway validates signature using SLACK_SIGNING_SECRET Then Request is accepted only if signature matches And Invalid signatures return HTTP 401

Given Signature validation fails When Request is rejected Then Rejection is logged with source IP And Metric webhook_rejections_total increments


Story 4.4: Comprehensive Error Recovery

As a developer, I want deployment failures clearly communicated with recovery options, So that I know how to fix issues and retry.

Acceptance Criteria:

Given Deployment fails (pod crash) When Gateway detects failure Then Slack message includes:

  • ❌ Error description
  • 📋 Pod logs (last 50 lines)
  • 🔁 Retry button
  • 🛠️ Debug instructions

Given User clicks retry button When Retry is triggered Then Deployment is attempted again And Previous failed resources are cleaned up first


All epics and stories complete.

Saving to document…