OpenCode Slack Integration - Epic Breakdown
Overview
This document provides the complete epic and story breakdown for OpenCode Slack Integration, decomposing the requirements from the PRD and Architecture into implementable stories.
Requirements Inventory
Functional Requirements
FR-1: Slack Command Interface (P0)
/opencode startopens a workflow form- Form includes: category, project, repositories, task title/description, priority
- Form submission creates a new work session
- System creates dedicated thread for the work session
- User receives confirmation with session details
FR-2: BMAD Routing (P0)
- System analyzes task title and description
- Provides routing suggestion (architect/PM/builder/party-mode) with confidence score
- Suggests AI model: Claude Code (API) for architecture/reasoning, Qwen 2.5 Coder 32B (local) for code generation
- Configurable in global application state
- Shows reasoning for agent type and model suggestion
- User can confirm or override routing and model decision
- Routing decision recorded in session state
FR-3: Workspace Management (P0)
- System invokes
task builder:init BUILDER_NAME={project} REPOS={repos} - Workspace includes workspace-root and specified repositories
- Each workspace isolated from other sessions
- Workspace state persists across OpenCode session restarts
- Workspace can be cleaned up when work is complete
FR-4: OpenCode Session Bridge (P0)
- System spawns OpenCode session in builder workspace
- Session integrates with standard OpenCode session management (visible in
opencode session list) - Session appears in OpenCode web UI alongside manually-created sessions
- User messages from Slack forwarded to OpenCode
- OpenCode output parsed and formatted for Slack
- Questions from agents detected and presented as interactive messages
- Progress milestones posted to thread
- Session accessible from both Slack and OpenCode web UI (with conflict handling)
FR-5: Lab Deployment (P0)
- System creates namespace:
{project}-lab - Deployment uses ConfigMap/PVC strategy (no image builds for MVP)
- Ingress created at
{project}.builder.lab.ctoaas.co - System waits for pods ready before declaring success
- Lab URL posted to Slack thread
FR-6: Test Execution (P0)
- Unit tests run in OpenCode session context
- Integration tests run against lab URL
- Acceptance tests verify critical flows
- Test results formatted and posted to Slack
- All-passing status clearly indicated
FR-7: Question Handling (P1)
- Questions posted to Slack with interactive buttons
- Timeout configured (default: 2 business hours)
- Recommended option indicated
- If no response within timeout, work continues with recommended option
- Late responses accepted and handled appropriately
- Questions can spawn new threads for complex discussions
FR-8: Progress Updates (P1)
- Major milestones posted to thread (phase complete, tests passing, deployment ready)
- Significant changes within a phase are reported
- Summary posted every 1-2 hours if no milestones reached
- Updates configurable per project
- Updates include emojis and clear formatting
FR-9: Multi-Project Support (P0)
- Each project has isolated workspace and session
- Sessions can run concurrently
- User can switch between projects via Slack
- Session state tracked per project
- No cross-contamination of context between projects
FR-10: Session Persistence (P1)
- Session state stored in git (
.session-state.yaml) - OpenCode sessions can be resumed after restart
- Work context preserved across days/weeks
- User can explicitly pause/resume sessions
Non-Functional Requirements
NFR-1: Response Time (P0)
- Slack commands acknowledge within 3 seconds
- Deployment to lab namespace completes in < 30 seconds
- Question responses forwarded to OpenCode within 1 second
NFR-2: Reliability (P0)
- System recovers gracefully from Slack webhook failures (retry logic)
- Session state never lost (git-backed persistence)
- Deployment failures clearly communicated with recovery options
NFR-3: Security (P0)
- Slack webhook signatures verified
- K8s service account with least-privilege RBAC
- Lab namespaces isolated from production
- No secrets exposed in Slack messages
NFR-4: Scalability (P2)
- MVP: Single user, < 10 concurrent sessions
- Future: Multiple users, 50+ concurrent sessions
Additional Requirements
From Architecture:
Bridge Plugin WebSocket Integration:
- Bridge plugin establishes WebSocket connection to Gateway on startup
- Request/response pattern using
request_idfor correlation - Event streaming for real-time updates (no polling)
- Reconnection logic with exponential backoff
- Plugin handles commands: session.create, session.message, session.list, session.abort
- Plugin streams events: session.created, message.part.streamed, tool.executed, agent.milestone, session.idle/active
State Persistence Strategy:
- Mount PVC to
~/.local/share/opencodefor OpenCode state persistence - OpenCode auth tokens persist across pod restarts
- Session metadata and history persist across pod restarts
- Plugin code loaded from PVC workspace (
/workspace/.opencode-plugins/opencode-bridge) - No Docker rebuild needed for plugin changes
OpenCode SDK Session Management:
- Use
@opencode-ai/sdk/dist/v2/client.jsfor programmatic session control - Sessions created via SDK appear in OpenCode web UI
- Agent routing via
agent: 'builder'parameter - Model selection via
model: { providerID, modelID }parameter
Kubernetes Integration:
- Three components: LGTM Stack (observability), Gateway (Python FastAPI), Codev Pod (OpenCode + Bridge Plugin)
- Gateway namespace:
ai-dev - Gateway service: ClusterIP
gateway.ai-dev.svc.cluster.local:8000 - Secrets via ClusterExternalSecret pattern
- Shared PVC:
code-server-storage(ReadWriteMany) mounted by codev + gateway - No public ingress for Gateway (Slack Socket Mode outbound)
- ArgoCD auto-sync for lab environment
Observability Requirements:
- Gateway logs → Loki
- OpenCode plugin logs → Loki
- Grafana dashboards for session tracking, question latency, Slack interactions
- Distributed tracing for question flow across components (future)
Code Review Follow-ups (Medium Priority):
- M1: Race condition in double-click handling (services/gateway/services/slack_app.py:64-66)
- M2: Request validation in permission endpoint (services/gateway/api/opencode.py:36)
- M3: Hardcoded timeout values inconsistent (bridge plugin 310s vs gateway 300s)
FR Coverage Map
Functional Requirements:
- FR-1 (Slack Command Interface) → Epic 2
- FR-2 (BMAD Routing) → Epic 2
- FR-3 (Workspace Management) → Epic 2
- FR-4 (OpenCode Session Bridge) → Epic 2
- FR-5 (Lab Deployment) → Epic 3
- FR-6 (Test Execution) → Epic 3
- FR-7 (Question Handling) → Epic 1
- FR-8 (Progress Updates) → Epic 2
- FR-9 (Multi-Project Support) → Epic 2
- FR-10 (Session Persistence) → Epic 2
Non-Functional Requirements:
- NFR-1 (Response Time) → Epic 1 (baseline), Epic 4 (optimization)
- NFR-2 (Reliability) → Epic 1 (foundation), Epic 4 (hardening)
- NFR-3 (Security) → Epic 1 (baseline), Epic 4 (hardening)
- NFR-4 (Scalability) → Deferred (post-MVP)
Architecture Requirements:
- WebSocket Integration → Epic 1
- State Persistence → Epic 1
- Plugin Loading from PVC → Epic 1
- K8s Deployment → Epic 1
- Observability (LGTM) → Epic 1
- Code Review Follow-ups (M1-M3) → Epic 1
Epic List
Epic 1: Steel Thread Production Deployment
Goal: Permission bridge is fully deployed to K8s with WebSocket integration, state persistence, and observability - production-ready foundation.
User Outcome: Developers receive Slack notifications when OpenCode agents request permissions and can approve/deny from mobile. System runs in production with full observability and zero-rebuild iteration capability.
FRs covered: FR-7 (Question Handling)
Architecture covered: WebSocket bidirectional communication, state persistence (PVC mounts), plugin loading from PVC, K8s deployment (Gateway + Codev), LGTM observability, ArgoCD integration, code review hardening (M1-M3)
Epic 2: Session Management & Async Interaction
Goal: Enable developers to start OpenCode sessions from Slack, receive progress updates, and interact asynchronously across days/weeks.
User Outcome: Developers can initiate development work via /opencode start command, system routes to appropriate BMAD agents, manages isolated workspaces, and provides async progress updates. Work context persists across sessions.
FRs covered: FR-1 (Slack Command Interface), FR-2 (BMAD Routing), FR-3 (Workspace Management), FR-4 (OpenCode Session Bridge), FR-8 (Progress Updates), FR-9 (Multi-Project Support), FR-10 (Session Persistence)
Epic 3: Lab Deployment & Testing
Goal: Automatically deploy work-in-progress code to isolated Kubernetes lab namespaces and execute automated tests.
User Outcome: Developers can deploy completed work to isolated K8s namespaces with automatic ingress creation, run automated tests (unit/integration/acceptance), and receive test results in Slack.
FRs covered: FR-5 (Lab Deployment), FR-6 (Test Execution)
Epic 4: Production Optimization
Goal: Harden system for production reliability, performance, and security beyond MVP baseline.
User Outcome: System meets production SLAs with comprehensive error handling, retry logic, performance monitoring, and security hardening.
FRs covered: NFR-1 (Response Time optimization), NFR-2 (Reliability hardening), NFR-3 (Security hardening)
Epic 1: Steel Thread Production Deployment
Goal: Permission bridge is fully deployed to K8s with WebSocket integration, state persistence, and observability - production-ready foundation.
Story 1.1: WebSocket Server in Gateway
As a developer, I want the Gateway to accept WebSocket connections from the Bridge Plugin, So that bidirectional real-time communication is established for session commands and event streaming.
Acceptance Criteria:
Given Gateway service is running
When Bridge Plugin initiates WebSocket connection to ws://gateway:8000/ws/bridge
Then Connection is accepted and established
And Gateway logs successful connection with connection ID
Given WebSocket connection is established
When Bridge sends JSON message with type: 'ping'
Then Gateway responds with type: 'pong'
And Connection remains active
Given WebSocket connection is lost When Bridge attempts to reconnect Then Gateway accepts reconnection And Previous connection state is cleaned up
Given Invalid JSON is received When Gateway processes the message Then Error is logged And Connection remains open (doesn’t crash)
Story 1.2: WebSocket Client in Bridge Plugin ✅
As a developer, I want the Bridge Plugin to establish WebSocket connection to Gateway on startup, So that it can send events and receive commands from Gateway.
Acceptance Criteria:
- AC1: Bridge Plugin starts with
GATEWAY_WS=ws://gateway:8000/ws/bridge- WebSocket connection to Gateway is established - Connection ready event is logged - AC2: WebSocket connection established - Gateway sends command
{ type: 'ping' }- Plugin responds with{ type: 'pong' }- Round-trip latency is logged - AC3: Connection is lost - Plugin detects disconnect - Plugin attempts reconnection with exponential backoff (1s, 2s, 4s, 8s, max 30s) - Reconnection attempts are logged
- AC4: Gateway is unreachable on startup - Plugin initialization runs - Plugin continues to retry connection in background - Logs connection failures without crashing
Implementation:
plugins/opencode-bridge/src/websocket-client.ts- BridgeWebSocketClient classplugins/opencode-bridge/src/plugin.ts:25-35- WebSocket initialization in pluginplugins/opencode-bridge/src/plugin.test.ts:54-73- Test for WebSocket client integration
Tests: 26 passing (all plugin tests)
Story 1.3: Session Command Protocol (Gateway → Bridge) ✅
As a Gateway developer, I want to send session management commands to Bridge via WebSocket, So that I can create, message, list, and abort OpenCode sessions programmatically.
Acceptance Criteria:
- AC1: WebSocket connection established - Gateway sends session.create - Bridge creates OpenCode session using SDK - Bridge responds with session.created
- AC2: Session exists - Gateway sends session.message - Bridge sends message to OpenCode session via SDK - Bridge responds with session.message.sent
- AC3: Multiple sessions exist - Gateway sends session.list - Bridge queries OpenCode SDK - Bridge responds with session.list
- AC4: Session running - Gateway sends session.abort - Bridge aborts session via SDK - Bridge responds with session.aborted
- AC5: Command fails (invalid session ID) - Bridge processes command - Bridge responds with session.error
Implementation:
plugins/opencode-bridge/src/handlers/session-commands.ts- Session command handlers (handleSessionCreate, handleSessionMessage, handleSessionList, handleSessionAbort)plugins/opencode-bridge/src/plugin.ts:45-76- WebSocket message handler routingplugins/opencode-bridge/src/handlers/session-commands.test.ts- 12 tests for all session commands
Tests: 39 passing (12 new session command tests + 27 existing)
Story 1.4: Event Streaming Protocol (Bridge → Gateway) ✅
As a Bridge Plugin developer, I want to stream OpenCode events to Gateway via WebSocket, So that Gateway can format and forward events to Slack in real-time.
Acceptance Criteria:
- AC1: OpenCode streams message part - Bridge receives message.part.streamed - Bridge sends to Gateway with session_id, message_id, content, role
- AC2: OpenCode executes tool - Bridge receives tool.executed - Bridge sends to Gateway with session_id, tool, file, status
- AC3: Session becomes idle - Bridge receives session.idle - Bridge sends to Gateway with session_id
- AC4: Session becomes active - Bridge receives session.active - Bridge sends to Gateway with session_id
- AC5: Message streaming starts - Bridge receives message.started - Bridge sends to Gateway with session_id, message_id
- AC6: Message streaming completes - Bridge receives message.completed - Bridge sends to Gateway with session_id, message_id
Implementation:
plugins/opencode-bridge/src/handlers/event-streaming.ts- Event streaming handlerplugins/opencode-bridge/src/plugin.ts:147- Integrated into event hookservices/gateway/main.py:157-168- Gateway receives and logs eventsplugins/opencode-bridge/src/handlers/event-streaming.test.ts- 10 tests for all event typesservices/gateway/test_websocket.py:87-197- 6 tests for Gateway event reception
Tests:
- Bridge: 65 passing (10 new event streaming tests)
- Gateway: 36 passing (6 new event reception tests)
- Coverage: 95.52% (100% on event-streaming.ts)
Story 1.5: OpenCode State Persistence via PVC ✅
As a developer, I want OpenCode state persisted to PVC-mounted storage, So that sessions, auth tokens, and history survive pod restarts.
Acceptance Criteria:
- AC1: Codev pod configured with PVC mount - volumeMount to
/home/opencode/.local/share/opencodefrom subPath.opencode-data- OpenCode state directory points to PVC - AC2: User authenticates with Anthropic OAuth - Auth token saved to
auth.json- Token written to PVC at.opencode-data/auth.json- Token persists after pod restart - AC3: OpenCode session created - Session metadata written to
storage/session/- Session data written to PVC - Session appears in list after pod restart - AC4: Message history exists - Pod restarts - Full message history available in resumed session - No data loss
- AC5: PVC mount fails on startup - Pod initialization runs - Pod fails with clear error - Error logged to stdout/stderr (K8s default behavior)
Implementation:
infrastructure/kustomize/components/codev/deployment.yaml- Codev deployment with PVC mountsinfrastructure/kustomize/components/codev/pvc.yaml- PVC definition (ReadWriteMany, 10Gi)infrastructure/kustomize/components/codev/service.yaml- ClusterIP service for OpenCodeinfrastructure/kustomize/components/codev/service-account.yaml- ServiceAccount for podinfrastructure/kustomize/components/codev/README.md- Deployment and testing documentation
K8s Configuration:
volumeMounts:
- name: code-server-storage
mountPath: /workspace # Code files
- name: code-server-storage
mountPath: /home/opencode/.local/share/opencode # State persistence
subPath: .opencode-data
- name: code-server-storage
mountPath: /workspace/.opencode-plugins # Plugin code (Story 1.6)
subPath: .opencode-pluginsPVC Structure:
code-server-storage/
├── repos/ # Workspace code
├── .opencode-data/ # OpenCode state (NEW)
│ ├── auth.json # OAuth tokens
│ └── storage/ # Sessions, messages, parts
└── .opencode-plugins/ # Plugin code (Story 1.6)
Story 1.6: Plugin Loading from PVC Workspace
As a developer, I want Bridge Plugin loaded from PVC workspace on pod startup, So that plugin code changes don’t require Docker rebuilds.
Acceptance Criteria:
Given Plugin source exists at /workspace/.opencode-plugins/opencode-bridge/
When Codev pod starts and runs entrypoint script
Then Script runs cd /workspace/.opencode-plugins/opencode-bridge && npm install && npm link
And Plugin is globally linked
Given Plugin is linked globally
When Entrypoint script runs cd ~/.config/opencode && npm link @opencode-bridge
Then OpenCode can discover the plugin
And Plugin loads on OpenCode startup
Given Plugin code is modified on PVC When Pod restarts Then New plugin code is loaded (via npm install + link) And No Docker rebuild is required
Given Plugin has npm dependencies
When npm install runs in plugin directory
Then Dependencies are installed to plugin’s node_modules
And Installation completes successfully
Given Plugin npm install fails When Entrypoint script detects failure Then Pod logs error details And Pod continues startup (graceful degradation)
Story 1.7: Gateway Pod K8s Deployment
As a platform operator, I want Gateway deployed as K8s pod in ai-dev namespace, So that it runs in production with proper resource limits and secrets.
Acceptance Criteria:
Given Gateway image is built and pushed to ghcr.io/craigedmunds/opencode-slack-gateway:latest
When Kustomize applies infrastructure/kustomize/components/opencode-slack-gateway/
Then Deployment creates Gateway pod in ai-dev namespace
And Pod is running with status Ready
Given Gateway pod is deployed
When Pod starts
Then Environment variables are loaded from ClusterExternalSecret
And SLACK_BOT_TOKEN and SLACK_APP_TOKEN are available
Given Gateway pod needs persistent storage
When Pod mounts PVC code-server-storage
Then PVC is mounted at /workspace (shared with Codev)
And Gateway can read/write to shared filesystem
Given Gateway service is created
When Service manifest is applied
Then ClusterIP service gateway.ai-dev.svc.cluster.local:8000 is created
And Service routes to Gateway pod port 8000
Given Resource limits are defined When Pod is scheduled Then Pod requests 256Mi memory, 100m CPU And Pod limits 512Mi memory, 500m CPU
Given Gateway crashes When Pod exit occurs Then K8s restarts pod automatically And Restart count increments
Story 1.8: Codev Pod Updates for Bridge Plugin
As a platform operator, I want Codev pod updated to load Bridge Plugin and configure Gateway URL, So that Bridge can connect to Gateway on pod startup.
Acceptance Criteria:
Given Codev pod Dockerfile is updated
When Image is built
Then Entrypoint script includes plugin loading logic
And GATEWAY_URL environment variable is set to http://gateway.ai-dev.svc.cluster.local:8000
Given Pod starts with plugin source on PVC When Entrypoint runs plugin installation Then Bridge Plugin is installed and linked And OpenCode loads plugin on startup
Given OpenCode starts with Bridge Plugin loaded
When Plugin initialization runs
Then WebSocket connection to $GATEWAY_URL/ws/bridge is established
And Connection success is logged
Given Gateway is not yet running When Bridge Plugin attempts connection Then Plugin retries with exponential backoff And Pod doesn’t crash waiting for Gateway
Given Codev pod restarts When Pod comes back up Then Bridge Plugin reconnects to Gateway And WebSocket connection is re-established
Story 1.9: LGTM Observability Stack Deployment
As a platform operator, I want LGTM stack deployed to capture logs and metrics, So that I can monitor Gateway and Bridge behavior in production.
Acceptance Criteria:
Given LGTM component exists in k8s-lab/components/lgtm/
When ArgoCD syncs the component
Then LGTM pod is running in lgtm namespace
And Services are available: Grafana (3000), Loki (3100), OTLP (4317/4318)
Given LGTM stack is running
When Ingress is created at lgtm.lab.ctoaas.co
Then Grafana UI is accessible via browser
And Default dashboards are loaded
Given Gateway pod is running When Gateway logs to stdout Then Logs are captured by Loki And Logs are queryable in Grafana
Given Bridge Plugin logs events When Plugin writes to OpenCode log files Then Logs are captured by Loki And Logs are searchable by session ID
Given PVC is mounted for LGTM data When Pod restarts Then Grafana dashboards and Loki data persist And No data loss occurs
Story 1.10: ArgoCD Integration for Auto-Sync
As a platform operator, I want ArgoCD Application configured for ai-dev components, So that Git commits automatically deploy to lab environment.
Acceptance Criteria:
Given ArgoCD Application is defined in k8s-lab/other-seeds/ai-dev.yaml
When Application manifest specifies source https://github.com/craigedmunds/ai-dev
Then ArgoCD syncs from infrastructure/kustomize/components path
And Target namespace is ai-dev
Given Auto-sync is enabled
When Git commit is pushed to ai-dev repo
Then ArgoCD detects change within 3 minutes
And New manifests are applied automatically
Given Sync fails (invalid YAML) When ArgoCD attempts sync Then Application status shows degraded And Error details are visible in ArgoCD UI
Given Multiple components exist (gateway, codev updates) When ArgoCD syncs Then All components are applied in correct order And Dependencies are respected
Given Manual sync is triggered When Operator clicks “Sync” in ArgoCD UI Then Sync completes successfully And All resources show healthy status
Story 1.11: Code Review Hardening (M1-M3)
As a developer, I want code review follow-ups addressed, So that production deployment is hardened against race conditions, validation gaps, and timeout inconsistencies.
Acceptance Criteria:
Given Double-click can occur on Slack button (M1)
When Two clicks arrive simultaneously
Then Lock is acquired before pop() operation
And Only first click processes, second returns early
And No duplicate responses occur
Given Permission request arrives at Gateway (M2)
When Request contains session_id
Then Session ID format is validated (length >= 8, alphanumeric)
And Invalid session IDs return HTTP 400 with error message
Given Timeout values are hardcoded (M3)
When Gateway starts
Then Timeout is read from env var PERMISSION_TIMEOUT_SECONDS (default: 300)
And Bridge Plugin uses GATEWAY_TIMEOUT + 10 buffer
And Timeouts are configurable without code changes
Given Lock implementation is added (M1) When Concurrent clicks occur in load test Then No race conditions occur in 1000 click simulation And All responses are correctly deduplicated
Given Session validation is added (M2) When Invalid session IDs are sent (empty, too short, special chars) Then All invalid formats are rejected And Valid formats pass through
Epic 2: Session Management & Async Interaction
Goal: Enable developers to start OpenCode sessions from Slack, receive progress updates, and interact asynchronously across days/weeks.
Story 2.1: Slack Slash Command /opencode start
As a developer,
I want to initiate development work via /opencode start command in Slack,
So that I can start OpenCode sessions from mobile without terminal access.
Acceptance Criteria:
Given Slack app is installed in workspace
When User types /opencode start in any channel
Then Workflow form appears with fields: category, project, repositories, task title, task description, priority
And Form loads within 3 seconds
Given Form is displayed When User fills required fields and submits Then Gateway receives form data via Socket Mode And Slack acknowledges submission with “Starting session…” message
Given Form submission fails (network error) When Submission times out Then User sees error message “Failed to submit. Please try again.” And Form data is preserved for retry
Given User cancels form When Cancel button is clicked Then Form closes without action And No session is created
Story 2.2: BMAD Agent Routing Suggestion
As a developer, I want system to suggest appropriate BMAD agent based on my task, So that work is routed to the right agent type (architect/PM/builder/party-mode).
Acceptance Criteria:
Given Task title is “Design authentication architecture” When Routing algorithm analyzes task Then Suggestion is “architect” with confidence >80% And Reasoning includes “contains ‘design’ and ‘architecture’ keywords”
Given Task title is “Implement login API” When Routing algorithm analyzes task Then Suggestion is “builder” with confidence >70% And Reasoning includes “contains ‘implement’ keyword”
Given Task is ambiguous “Fix the thing” When Routing algorithm analyzes task Then Suggestion is “builder” (default) with confidence <50% And Reasoning includes “insufficient context for confident routing”
Given Routing suggestion is displayed When User reviews suggestion Then User can confirm or override routing decision And Override option shows all 4 agent types
Given Routing decision is made When Session is created Then Routing decision is recorded in session state And OpenCode session uses selected agent type
Story 2.3: AI Model Selection Suggestion
As a developer, I want system to suggest AI model based on task complexity, So that work uses Claude Code for complex reasoning or Qwen Coder for implementation.
Acceptance Criteria:
Given Agent type is “architect” or “pm” When Model selection runs Then Suggestion is “Claude Code” (Anthropic API) And Reasoning is “Complex reasoning required for architecture/planning”
Given Agent type is “builder” When Model selection runs Then Suggestion is “Qwen 2.5 Coder 32B” (local) And Reasoning is “Code generation optimized for local model”
Given Model suggestion is displayed When User reviews suggestion Then User can confirm or override model choice And Override shows both available models
Given Model selection is made When Session is created Then Model choice is recorded in session state And OpenCode session uses selected model
Given Suggested model is unavailable When Health check fails Then User is prompted to select alternative model And Session creation waits for user decision
Story 2.4: Builder Workspace Creation via Par
As a developer, I want isolated builder workspace created for my task, So that my work doesn’t interfere with other projects.
Acceptance Criteria:
Given Session is approved with project “domain-apis” and repos “workspace-root,domain-apis”
When Gateway invokes task builder:init BUILDER_NAME=domain-apis-auth REPOS=workspace-root,domain-apis
Then Par creates worktree at .builders/domain-apis-auth/repos/
And Workspace includes workspace-root and domain-apis repositories
Given Workspace creation succeeds
When Gateway checks workspace directory
Then Directory .builders/domain-apis-auth/repos/workspace-root exists
And Directory .builders/domain-apis-auth/repos/domain-apis exists
Given Workspace creation fails (repo not found) When Par returns error Then Gateway posts error to Slack thread And Session creation is aborted
Given Workspace already exists for builder name When Gateway invokes builder:init Then Par reuses existing workspace And Workspace is reset to clean state
Given Multiple sessions are active
When Each session has different builder name
Then Each workspace is isolated in separate .builders/ subdirectory
And No cross-contamination occurs
Story 2.5: OpenCode Session Creation via WebSocket
As a developer, I want OpenCode session created programmatically in builder workspace, So that session integrates with standard OpenCode session list and web UI.
Acceptance Criteria:
Given Workspace exists at .builders/domain-apis-auth/repos/
When Gateway sends WebSocket command { type: 'session.create', workspace: '/workspace/.builders/domain-apis-auth', task: 'Build login API', agent: 'builder', model: { providerID: 'anthropic', modelID: 'claude-sonnet-4' } }
Then Bridge creates session via OpenCode SDK
And Bridge responds with { type: 'session.created', session_id: 'ses_xyz', session: {...} }
Given Session is created
When User runs opencode session list
Then Session appears in list with title “Build login API”
And Session directory shows workspace path
Given Session is created When User opens OpenCode web UI Then Session is visible in session list And Session can be accessed from web UI
Given Session creation fails (invalid workspace)
When Bridge attempts to create session
Then Bridge responds with { type: 'session.error', error: 'Workspace not found' }
And Gateway posts error to Slack
Story 2.6: Session State Persistence to Git
As a developer, I want session state persisted to git, So that work context survives days/weeks and service restarts.
Acceptance Criteria:
Given Session is created with ID ses_xyz
When Gateway writes session state
Then File .session-state.yaml is created in builder workspace
And State includes session_id, slack_thread_ts, routing_decision, model_choice, status
Given Session state changes (question answered)
When Gateway updates state
Then .session-state.yaml is updated on PVC
And Git commit is created with message “chore: Session state checkpoint”
Given Gateway pod restarts
When Pod comes back up and reads .session-state.yaml
Then Session state is loaded from file
And Session can be resumed without data loss
Given Session completes When Final state is written Then State file shows status “completed” And Git commit records completion
Given State file is corrupted When Gateway attempts to read state Then Error is logged And Session is marked as unrecoverable
Story 2.7: Slack Thread Creation and Mapping
As a developer, I want dedicated Slack thread created for my session, So that all updates for this work are organized in one conversation.
Acceptance Criteria:
Given Session is created with ID ses_xyz
When Gateway creates Slack thread
Then Thread is created in project-specific channel (or user’s DM)
And Initial message includes session ID, task title, routing decision, model choice
Given Thread is created with timestamp thread_ts_123
When Gateway maps session to thread
Then Mapping ses_xyz → thread_ts_123 is stored in session state
And Mapping persists to .session-state.yaml
Given Session state is loaded after restart When Gateway reads thread mapping Then Future updates post to correct thread And No orphaned messages occur
Given Thread creation fails (channel not found) When Gateway attempts to create thread Then Error is logged And Session creation is aborted with user notification
Story 2.8: Agent Progress Milestone Updates
As a developer, I want major milestones posted to Slack thread, So that I know when agent completes phases without constant monitoring.
Acceptance Criteria:
Given Agent completes analysis phase
When Bridge streams { type: 'agent.milestone', session_id: 'ses_xyz', milestone: 'analysis_complete', description: 'Requirements analyzed' }
Then Gateway posts to thread: ”✅ Milestone: Requirements analyzed”
And Message includes timestamp
Given Agent completes implementation phase When Milestone event is received Then Gateway posts: ”🎉 Milestone: Implementation complete” And Message includes summary of changes
Given Tests pass
When Milestone tests_passing is received
Then Gateway posts: ”✅ Milestone: All tests passing (45/45)”
And Message formatted with emoji and clear status
Given Multiple milestones occur rapidly When Events arrive within 30 seconds Then Gateway batches updates into single message And Slack thread isn’t spammed
Story 2.9: Agent Output Streaming to Slack
As a developer, I want agent output streamed to Slack in real-time, So that I can follow agent’s thinking and progress.
Acceptance Criteria:
Given Agent streams message content
When Bridge sends { type: 'message.part.streamed', session_id: 'ses_xyz', content: 'I will implement authentication using JWT...' }
Then Gateway updates Slack message with accumulated content
And Message shows ”🤔 Agent is thinking…”
Given Message streaming completes
When Bridge sends { type: 'message.completed', session_id: 'ses_xyz', message_id: 'msg_456' }
Then Gateway posts final message with full content
And “Thinking…” indicator is removed
Given Output exceeds Slack message limit (3000 chars) When Content accumulates beyond limit Then Gateway posts multiple messages in sequence And Messages are numbered “(1/3), (2/3), (3/3)”
Given Streaming is interrupted (connection lost) When Reconnection occurs Then Gateway resumes from last known position And No duplicate content is posted
Story 2.10: Tool Execution Visibility
As a developer, I want tool executions reported to Slack, So that I know what files agent is modifying.
Acceptance Criteria:
Given Agent writes file
When Bridge sends { type: 'tool.executed', tool: 'file_write', file: 'src/auth.ts', status: 'success' }
Then Gateway posts: ”🔧 Wrote file: src/auth.ts”
Given Agent runs tests
When Tool execution event for bash with command npm test
Then Gateway posts: ”🧪 Running tests: npm test”
Given Agent reads files When Multiple file_read events occur rapidly Then Gateway batches into summary: ”📖 Read 5 files”
Given Tool execution fails
When Status is ‘error’
Then Gateway posts: ”❌ Tool failed: file_write - Permission denied”
And Error details included
Story 2.11: Multi-Project Concurrent Sessions
As a developer, I want to run multiple projects in parallel, So that I can context-switch between different work streams.
Acceptance Criteria:
Given Session 1 exists for “domain-apis” project When User starts session 2 for “market-making” project Then Both sessions run in isolated workspaces And Sessions have different builder names and workspace directories
Given Multiple sessions are active When Events arrive for different session IDs Then Each event routes to correct Slack thread And No cross-contamination occurs
Given User views session list in Slack
When Command /opencode list is issued (future feature placeholder)
Then All active sessions are displayed with status
And User can identify which sessions are active
Given Sessions exceed limit (10 concurrent - NFR-4) When User attempts 11th session Then Error message: “Maximum concurrent sessions reached (10)” And User prompted to complete existing session first
Story 2.12: UI-Initiated Session Configuration Collection
As a developer, I want sessions created in OpenCode UI to collect configuration lazily based on actual needs, So that I can start exploratory sessions with zero friction and only provide builder/Slack config when required.
Acceptance Criteria:
Given User creates session in OpenCode UI with title “Add rate limiting” When Session is created Then Session is registered with minimal state (id, title, directory) And No builder or Slack configuration is collected yet And Session type is “exploratory”
Given Exploratory session is active When User asks questions and agent reads code Then Work proceeds without any configuration prompts And Agent uses Read, Grep, Glob tools freely
Given User attempts first write operation (Edit, Write, Bash with file modification) When Plugin intercepts write tool execution Then Execution is paused And User is prompted for builder configuration
Given User is in OpenCode web UI when write is attempted When Builder config prompt is needed Then OpenCode modal appears with fields:
- Project: [Dropdown of projects from
.ai/projectlist.mdor “Create New”] - Workspace Name: [Auto-filled:
{project-id}-{task-slug}] - Repositories: [Multi-select, pre-selected based on detected repos]
- Category: [Select, inferred from project] And Optional section: “Setup Slack notifications now?” (unchecked by default)
Given User is in Slack when write is attempted When Builder config prompt is needed Then Slack form appears in existing thread or DM with same fields And Optional section: “Setup Slack notifications?” (checked by default)
Given Builder config is provided
When User submits form
Then Gateway invokes task builder:init BUILDER_NAME={workspace_name} REPOS={repos}
And Builder workspace is created at .builders/{workspace_name}/
And Session is moved to builder workspace
And Session type changes to “work”
And Write operation proceeds
Given Builder config form includes optional Slack section When User enables Slack notifications and submits Then Both builder and Slack configs are saved And Slack thread is created in specified channel And Future notifications route to Slack
Given User is in OpenCode web UI working When User goes offline (presence detection: no web activity >10min) Then Next notification triggers Slack config collection And Slack form appears: “Where should I post updates for ‘{session.title}’?” And Fields: Channel (inferred), Priority (medium default)
Given Slack config is provided via form
When User submits
Then Slack thread is created in chosen channel
And Pending notification is posted to thread
And Mapping session_id → thread_ts is saved to .session-state.yaml
Given User creates session, goes offline, then starts building When Both configs are eventually needed Then Slack config collected first (when going offline) And Builder config collected second (when attempting write in Slack) And Both configs can be collected in either order
Given Project inference runs on session title “Add auth to OpenCode Slack”
When System matches against .ai/projectlist.md
Then Project 0012 (OpenCode Slack Integration) is suggested
And Workspace name defaults to “0012-add-auth-to-opencode-slack”
And Repositories default to [“ai-dev”] (from project metadata)
And Category defaults to “ai-dev”
Given Project inference cannot determine project with confidence When Multiple projects match or none match Then Dropdown shows all active projects And “Create New Project” option is available And If selected, next available project ID (e.g., 0014) is assigned
Given Session with builder config exists
When Gateway restarts
Then .session-state.yaml is loaded from PVC
And Builder config (project_id, workspace_name, repos, category) is restored
And Session can resume work without re-prompting
Given Session with Slack config exists When Gateway restarts Then Slack thread mapping is loaded from state file And Future notifications route to correct thread And No duplicate threads are created
Given User opts to provide both configs at once When Builder config form shows optional Slack section Then User can check “Setup Slack notifications now” And Single form submission provides both configs And No second prompt occurs later
Implementation Notes:
- Builder config collection:
SessionConfigManager.ensure_builder_config() - Slack config collection:
SessionConfigManager.ensure_slack_config() - Project ID inference: Match session title/repos/category to
.ai/projectlist.md - Workspace naming:
.builders/{project-id}-{slugified-title}/ - State persistence: Both configs saved to
.session-state.yamlon PVC - Presence detection: Track
last_web_activityto determine if user is in OpenCode UI - Write detection: Plugin hook on
tool.beforeExecutefor write operations
Test Scenarios:
- Exploratory session (no config): User asks “How does auth work?” - no prompts
- Build in UI: User attempts edit - OpenCode modal appears - builder initialized
- Go offline then notify: User leaves - agent has question - Slack form appears - thread created
- Build while offline: User in Slack asks to “Add feature” - Slack form for builder config - workspace created
- Bundle both configs: User checks optional Slack section in builder form - both configs saved - no second prompt
- Project inference: Session title matches project 0012 - workspace name auto-filled “0012-add-notifications”
- State recovery: Gateway restarts - session state loaded - configs restored - work resumes
Epic 3: Lab Deployment & Testing
Goal: Automatically deploy work-in-progress code to isolated Kubernetes lab namespaces and execute automated tests.
Story 3.1: Namespace Creation for Builder
As a developer, I want dedicated K8s namespace created for my work, So that deployment is isolated from other projects.
Acceptance Criteria:
Given Agent completes implementation
When User clicks “Deploy to Lab” button in Slack
Then Gateway creates namespace {builder-name}-lab
And Namespace is labeled with builder name and project
Given Namespace already exists When Deploy is triggered Then Gateway reuses existing namespace And Previous resources are cleaned up first
Given Namespace creation fails (RBAC) When Gateway lacks permissions Then Error is posted to Slack with RBAC details And Deployment is aborted
Given Namespace is created
When Deployment completes or fails
Then Namespace remains active for testing
And User manually cleans up with /opencode cleanup
Story 3.2: ConfigMap/PVC Deployment Strategy
As a developer, I want code deployed via ConfigMaps and PVCs, So that deployment is fast without Docker image builds.
Acceptance Criteria:
Given Small files (<1MB) exist in workspace
When Gateway creates ConfigMap
Then ConfigMap contains file contents as data entries
And ConfigMap is named {builder-name}-config
Given Large files (>1MB) exist When Gateway prepares deployment Then Files are written to PVC And Deployment mounts PVC for file access
Given ConfigMap deployment is created
When Pod starts
Then ConfigMap data is mounted at /app/config
And Application can read files
Given Code changes occur When Re-deployment is triggered Then ConfigMap is updated And Pods are restarted to pick up changes
Story 3.3: Ingress Creation with Cert
As a developer, I want ingress created at predictable URL, So that I can access deployed service from browser/Postman.
Acceptance Criteria:
Given Service is deployed in namespace domain-apis-auth-lab
When Gateway creates ingress
Then Ingress host is domain-apis-auth.lab.ctoaas.co
And Ingress routes to service port
Given Ingress is created When Cert-manager processes ingress Then TLS certificate is issued within 2 minutes And HTTPS is available
Given Deployment completes
When Gateway posts lab URL to Slack
Then URL is https://domain-apis-auth.lab.ctoaas.co
And URL is clickable in Slack
Given Ingress creation fails (DNS) When Gateway detects failure Then Error is posted to Slack with details And User can retry deployment
Story 3.4: Pod Readiness Waiting
As a developer, I want deployment to wait for pods to be ready, So that I don’t get lab URL before service is actually running.
Acceptance Criteria:
Given Deployment is applied When Pods are starting Then Gateway polls pod status every 5 seconds And Slack shows ”⏳ Waiting for pods to be ready…”
Given Pods become ready When All pods show status Running with readiness probe passing Then Gateway posts ”✅ Deployment ready” And Lab URL is posted to thread
Given Pods fail to become ready (CrashLoopBackOff) When 3 minutes elapse without ready state Then Gateway posts error: ”❌ Deployment failed - pods not ready” And Pod logs are attached to Slack message
Given Deployment times out (>5 minutes) When Timeout is reached Then Deployment is marked failed And User is prompted to check logs
Story 3.5: Unit Test Execution
As a developer, I want unit tests run in OpenCode session, So that I know tests pass before deploying to lab.
Acceptance Criteria:
Given Agent completes implementation
When Tests are run via npm test or pytest
Then Test output is captured
And Results are parsed for pass/fail status
Given Tests pass (exit code 0) When Gateway formats results Then Slack shows ”✅ Unit Tests: 45/45 passing” And Test summary includes duration
Given Tests fail (exit code 1) When Gateway formats results Then Slack shows ”❌ Unit Tests: 42/45 passing (3 failures)” And Failed test names are listed
Given Tests cannot run (missing dependencies) When Test command fails Then Error is posted: “⚠️ Tests skipped - dependencies missing” And Deployment proceeds with warning
Story 3.6: Integration Test Execution Against Lab
As a developer, I want integration tests run against deployed lab URL, So that I validate end-to-end flows before promoting to production.
Acceptance Criteria:
Given Service is deployed at https://domain-apis-auth.lab.ctoaas.co
When Integration tests run with TEST_URL=https://domain-apis-auth.lab.ctoaas.co npm run test:integration
Then Tests execute against live deployment
And Results are captured
Given Integration tests pass When Gateway formats results Then Slack shows ”✅ Integration Tests: 12/12 passing”
Given Integration tests fail When Gateway formats results Then Slack shows ”❌ Integration Tests: 10/12 passing (2 failures)” And Failed test details are included
Given Lab URL is not reachable When Integration tests attempt connection Then Tests fail with connection error And Gateway posts: ”❌ Lab deployment not reachable”
Story 3.7: Test Results Summary in Slack
As a developer, I want comprehensive test summary posted to Slack, So that I can quickly assess quality before promoting.
Acceptance Criteria:
Given All tests complete (unit + integration) When Gateway compiles results Then Slack message includes:
- ✅ Unit Tests: 45/45 passing
- ✅ Integration Tests: 12/12 passing
- 🎉 All tests passing
Given Some tests fail When Summary is posted Then Failed test names are listed And Overall status shows “⚠️ Some tests failing”
Given Tests are skipped (not applicable) When Summary is posted Then Status shows “⏭️ Tests skipped” And Reason is included (e.g., “no test files found”)
Epic 4: Production Optimization
Goal: Harden system for production reliability, performance, and security beyond MVP baseline.
Story 4.1: Response Time Monitoring
As an operator, I want response time metrics collected and visualized, So that I can ensure SLA compliance (<3s Slack ack, <30s deploy, <1s question forward).
Acceptance Criteria:
Given Slack command is received
When Gateway processes command
Then Duration is recorded to Prometheus metric slack_command_duration_seconds
And Metric is labeled with command type
Given Deployment is triggered
When Deployment completes
Then Duration is recorded to deployment_duration_seconds
And Metric includes namespace label
Given Permission question is forwarded
When Question reaches Slack
Then Latency is recorded to permission_forward_latency_seconds
Given Grafana dashboard is loaded When Operator views dashboard Then p50, p95, p99 latencies are visible And SLA violations are highlighted
Story 4.2: Webhook Retry Logic
As an operator, I want Slack webhook failures retried automatically, So that transient network issues don’t lose events.
Acceptance Criteria:
Given Slack API call fails (503 Service Unavailable) When Gateway detects failure Then Request is retried with exponential backoff (1s, 2s, 4s) And Maximum 5 retries are attempted
Given Retry succeeds on attempt 3 When Request completes Then Success is logged And Retry count is recorded to metrics
Given All retries fail When Maximum retries reached Then Event is logged to dead letter queue And Alert is sent to operator
Story 4.3: Security Hardening (Webhook Signature Verification)
As an operator, I want Slack webhook signatures verified, So that malicious requests are rejected.
Acceptance Criteria:
Given Slack webhook is received
When Gateway validates signature using SLACK_SIGNING_SECRET
Then Request is accepted only if signature matches
And Invalid signatures return HTTP 401
Given Signature validation fails
When Request is rejected
Then Rejection is logged with source IP
And Metric webhook_rejections_total increments
Story 4.4: Comprehensive Error Recovery
As a developer, I want deployment failures clearly communicated with recovery options, So that I know how to fix issues and retry.
Acceptance Criteria:
Given Deployment fails (pod crash) When Gateway detects failure Then Slack message includes:
- ❌ Error description
- 📋 Pod logs (last 50 lines)
- 🔁 Retry button
- 🛠️ Debug instructions
Given User clicks retry button When Retry is triggered Then Deployment is attempted again And Previous failed resources are cleaned up first
All epics and stories complete.
Saving to document…