Product Requirements Document - Grid Exit Strategy - Phases 2-5

Author: Craig Date: 2026-02-01

Success Criteria

User Success (You as Grid Trader)

Decision Confidence:

  • You can articulate WHY you entered and exited every position using audit trail evidence
  • You have at least 30 minutes warning time between exit state transitions in 90%+ of cases
  • You can answer “why didn’t you exit here?” for any historical moment using immutable decision records

Capital Protection:

  • Zero stop-loss breaches during normal market conditions (excluding “world-defining moments”)
  • System provides WARNING state at least 1-2 hours before LATEST_ACCEPTABLE_EXIT
  • At least 2 hours between LATEST_ACCEPTABLE_EXIT and MANDATORY_EXIT states
  • No catastrophic exits (defined as: hitting exchange stop-loss instead of graceful exit)

Operational Clarity:

  • Exit state transitions are clear and actionable (you know what WARNING/LATEST_ACCEPTABLE/MANDATORY mean in real-time)
  • System evaluates regime hourly with consistent decision logic
  • Restart gates prevent premature re-entry after trend stops

Business Success (Capital Scaling & Investor Readiness)

Capital Scaling Milestone:

  • Double capital stake from £1K to £2K within Phase 2-5 validation period (2-4 weeks live operation)
  • System proven ready to support £10K capital allocation (risk calculations, position sizing, audit trails all scale)

Investor Credibility:

  • Complete immutable audit trail in Git showing every decision with timestamps
  • Backtesting results demonstrate exit strategy would have prevented historical drawdowns
  • Ability to generate “decision quality” reports showing regime classification accuracy vs outcomes
  • Clean separation of “recommendation quality” (was regime correct?) vs “action quality” (did I follow the recommendation?)

Exit Quality Metrics:

  • KPI framework operational and tracking exit quality (how early did we exit vs when we should have?)
  • Historical analysis showing system identified trend breakouts before significant capital loss
  • Documented evidence of false-positive rate (stopped grids that stayed range-bound)

Technical Success

Phase 2 Complete:

  • Three-gate restart logic implemented and tested (Directional Energy Decay → Mean Reversion Return → Tradable Volatility)
  • Exit state transitions functional (WARNING → LATEST_ACCEPTABLE_EXIT → MANDATORY_EXIT)
  • State transition tracking in decision records
  • Historical data loading supports gate evaluation

Phase 3 Complete:

  • KuCoin position tracker integrated and returning accurate position data
  • Capital risk calculator quantifying exposure in real-time
  • Enhanced notifications include risk metrics (current exposure, distance to stop-loss, time in exit state)

Phase 4 Complete:

  • 100% test coverage for new exit logic (matching Phase 1 quality: 60+ tests, all passing)
  • Backtesting framework operational and validated against 3-6 months of historical data
  • CI/CD integration preventing regression
  • Documented test scenarios covering edge cases (volatility spikes, gap moves, data failures)

Phase 5 Complete:

  • Hourly evaluation cadence operational with monitoring
  • Audit logging captures all state transitions with context (regime metrics, confidence scores, gate status)
  • KPI tracking framework operational
  • Documentation complete for investor presentation

Measurable Outcomes

Completion Criteria (Phases 2-5 “Done”):

  • ✅ All code implemented with 100% test pass rate
  • ✅ Backtested against historical trend breakouts (3-6 months data)
  • ✅ Validated with £1K live capital for 2-4 weeks
  • ✅ Capital doubled to £2K during validation period
  • ✅ Zero stop-loss breaches during validation period (excluding black swan events)
  • ✅ Audit trail complete and investor-ready
  • ✅ System ready to scale to £10K capital allocation

3-Month Success (Post Phase 2-5):

  • Operating at £10K capital with same exit quality metrics
  • Clean track record of exit decisions with measurable outcomes
  • Investor presentation materials complete with backtesting evidence

12-Month Vision:

  • £100K+ capital with external investment
  • Exit strategy proven across multiple market regimes
  • Published track record of regime classification accuracy
  • Multi-symbol support (beyond single grid)

Product Scope

MVP - Minimum Viable Product (Phases 2-5)

Core Exit Strategy:

  • Exit state machine (WARNING → LATEST_ACCEPTABLE_EXIT → MANDATORY_EXIT)
  • Three-gate restart logic preventing premature grid restart
  • Position risk quantification from KuCoin API
  • Enhanced notifications with risk context

Quality & Validation:

  • Comprehensive test coverage (60+ tests, 100% pass)
  • Backtesting framework with 3-6 months historical validation
  • CI/CD integration

Operational Foundation:

  • Hourly evaluation cadence with monitoring
  • Complete audit logging in Git
  • KPI tracking framework
  • Static HTML dashboards with Chart.js

Explicitly Out of Scope for MVP:

  • Multi-symbol concurrent grids (single ETH-USDT only)
  • Automated grid creation (human approval required)
  • 15-minute evaluation cadence (hourly sufficient based on research)
  • Advanced real-time dashboards
  • Performance optimization beyond functional requirements

Post-MVP Growth Path

Detailed roadmap documented in Project Scoping section below, including:

  • Phase 6 (3-month): Capital scaling to £10K with proven system
  • Phase 7 (6-month): Investor preparation and multi-symbol validation
  • Phase 8 (12-month+): Enhanced automation, ML refinements, multi-exchange support

User Journeys

Journey 1: Craig - Active Grid Trader (Exit Protection)

Situation: It’s Tuesday morning, 9:15 AM. Craig has an active ETH-USDT grid running with £1,200 capital deployed. The grid has been harvesting profitable oscillations for 3 days in a clean range between 3,200. His phone buzzes with a Pushover notification: “⚠️ WARNING - ETH regime transitioning. Confidence 0.68 → 0.54. Review recommended.”

Opening Scene - Warning Detection:

Craig opens the notification link on his phone. The decision interface shows:

  • Current regime: TRANSITION (was RANGE_OK 15 minutes ago)
  • Confidence: 0.54 (dropped from 0.68)
  • Exit state: WARNING
  • Key metrics: ADX rising (25 → 32), Bollinger Bandwidth expanding (0.034 → 0.041)
  • Gate status: Gate 1 (Directional Energy Decay) FAILING - TrendScore crossed 35 threshold
  • Time in WARNING: 15 minutes
  • Estimated time to LATEST_ACCEPTABLE_EXIT: 1-2 hours

Craig thinks: “This is exactly what Phase 1 was built for - early warning before things get ugly.”

Rising Action - Monitoring Escalation:

45 minutes later, another notification: ”🔶 LATEST_ACCEPTABLE_EXIT - ETH trend strengthening. ADX 38, efficiency ratio 0.72. Exit recommended within 2 hours.”

Craig checks the decision record:

  • Regime: TRANSITION → TREND (confirmed for 3/5 bars)
  • ADX: 38 and rising
  • Efficiency Ratio: 0.72 (directional persistence strong)
  • Exit state: LATEST_ACCEPTABLE_EXIT
  • Current position: Grid is net long 0.8 ETH (market moving up, sold into strength)
  • Distance to stop-loss: $280 (still 8.7% buffer)
  • Audit trail shows: “WARNING triggered at 09:15, LATEST_ACCEPTABLE_EXIT at 10:00”

Craig has a decision to make: Exit now with graceful unwinding, or wait and risk MANDATORY_EXIT?

Climax - Decisive Action:

Craig decides to exit. He manually stops the grid in KuCoin (3 clicks: Stop Grid → Keep Assets → Confirm). Within 2 minutes, the grid is stopped. Current PnL: +£47 profit on this grid session.

He updates the decision record via the web interface:

  • Action taken: STOP_GRID
  • Reason: “Trend confirmed, ADX rising through 35, efficiency ratio shows directional persistence”
  • Outcome: Graceful exit with profit intact

The system records:

  • Exit state progression: WARNING (09:15) → LATEST_ACCEPTABLE_EXIT (10:00) → USER_STOPPED (10:45)
  • Total warning time: 90 minutes
  • Stop-loss distance at exit: 8.7% (never threatened)
  • Grid cooldown: 60 minutes before restart eligibility

Resolution - Post-Exit Validation:

Two hours later, ETH has moved to 3,280. The system would have marked this as “catastrophic exit avoided.”

24 hours later, the system performs automatic evaluation:

  • Regime classification: CORRECT (remained TREND for 18 hours)
  • Exit timing: OPTIMAL (exited 90 minutes into trend, avoided 8% adverse move)
  • Warning lead time: 90 minutes (met success criteria: 30+ min warning)
  • KPI recorded: Exit quality score 9/10 (early exit, preserved capital, clean audit trail)

Craig’s new reality:

  • Capital preserved with profit
  • Clean audit trail showing “system warned → I exited → trend confirmed”
  • Confidence in system’s protective capability
  • Restart gates now active: waiting for directional energy decay before re-entry

Journey 2: Craig - Historical Decision Reviewer (Investor Preparation)

Situation: It’s Friday evening, 3 months into validation. Craig is preparing materials for his personal decision to scale from £1K to £10K. He needs to demonstrate to himself that the exit system actually works before committing serious capital.

Opening Scene:

Craig opens the market-maker-data Git repository containing 3 months of immutable decision records. He runs the analysis script:

task analyze-exit-quality --period 2026-01-01 to 2026-03-31

Rising Action:

The KPI dashboard generates:

  • Total exit events: 12
  • SLAR (Stop-Loss Avoidance Rate): 100% (12/12 exits before stop-loss)
  • PRR (Profit Retention Ratio): 82% average (preserved £394 of £480 potential profit)
  • TTDR (True Transition Detection): 83% (10/12 regime breaks correctly identified)
  • FER (False Exit Rate): 17% (2/12 exits where range resumed after)
  • Average warning time: 95 minutes (exceeds 30-minute minimum)

Climax:

Craig reviews the 2 false positives in detail:

  • Exit #3 (Feb 12): WARNING → stopped grid → range resumed after 6 hours. Lost £18 in potential profit but preserved £67 existing profit. Restart gates prevented immediate re-entry; missed 2 days of ranging.
  • Exit #7 (Mar 5): Similar pattern - cautious exit, range continued.

The audit trail shows his reasoning at the time: “ADX rising, efficiency ratio climbing, better safe than sorry.” Looking back, the system was correctly identifying volatility expansion, even though the regime ultimately held.

Resolution:

Craig’s conclusion: “2 false exits cost me £45 in missed profit. But the 10 true exits saved me from an estimated £620 in stop-loss hits. Net benefit: £575. More importantly - I can articulate WHY every decision was made, and the false positives were defensible given the data available.”

He updates his personal scaling decision document: “Exit system validated. Ready for £10K capital.”


Journey 3: External Investor - Track Record Evaluation

Situation: 18 months later. Craig is meeting with Sarah, an angel investor considering deploying £100K into his systematic grid trading fund. She’s reviewing his track record before committing capital.

Opening Scene:

Sarah receives access to Craig’s investor presentation repository. She’s evaluating whether this is “real systematic trading” or “lucky gambling with post-hoc justification.”

Rising Action:

Sarah reviews the evidence:

  1. Immutable Decision Records (Git):

    • Every recommendation timestamped and committed before action
    • No retroactive editing (Git history proves it)
    • Clear separation: “What did the system recommend?” vs “What did Craig do?”
  2. Exit Quality Metrics (18 months):

    • 87 total exit events
    • SLAR: 97% (3 stop-loss hits during black swan events)
    • PRR: 79% (preserved majority of range-trading profits)
    • Monthly capital growth: 4.2% average (compounded)
  3. Failure Analysis:

    • Craig documents the 3 stop-loss hits:
      • May 2026: Exchange outage prevented manual exit (system correctly identified MANDATORY_EXIT, Craig couldn’t execute)
      • Aug 2026: “Ignored LATEST_ACCEPTABLE_EXIT recommendation - my mistake, learned lesson”
      • Nov 2026: Flash crash exceeded all historical volatility bounds (unpredictable)

Climax:

Sarah asks the critical question: “How do I know you didn’t just get lucky? What happens when regimes behave differently?”

Craig shows her the backtesting framework:

  • Exit logic backtested against 3 years of historical data
  • Would have avoided 23/27 major drawdown periods
  • The 4 missed signals all occurred in low-liquidity Asian hours (now monitored)

Resolution:

Sarah’s conclusion: “This isn’t perfect, but it’s systematic, transparent, and learns from failures. The audit trail gives me confidence that capital is protected by process, not luck. I’m in.”


Journey 4: System Administrator - Deployment & Monitoring

Situation: Craig needs to deploy the Phase 2 restart gates logic to production after completing testing.

Opening Scene:

Craig (wearing his DevOps hat) reviews the deployment checklist:

  • All tests passing (62 tests, 100% coverage)
  • Backtesting complete
  • Configuration updated with new gate thresholds
  • Docker image built and pushed to registry

Rising Action:

He deploys using the standard workflow:

task deploy-metrics-service --env production
kubectl apply -f k8s/metrics-service/deployment.yaml

The ArgoCD pipeline automatically:

  • Validates configuration schema
  • Runs smoke tests against production API
  • Gradually rolls out new pods (blue-green deployment)
  • Monitors error rates and latency

Climax:

15 minutes after deployment, Craig receives a Slack alert: “Metrics service error rate: 0.2% (Gate evaluation failing for BTC-USDT)”

He checks the logs:

ERROR: Gate 1 evaluation failed - insufficient historical data for OU half-life calculation
Symbol: BTC-USDT, Required: 240 bars, Available: 187 bars

Resolution:

Craig realizes BTC-USDT is a newly added symbol without enough historical data. He updates the configuration to delay gate evaluation until sufficient data is collected:

grids:
  - id: btc-grid-1
    symbol: BTC-USDT
    gate_evaluation_delay: 48h  # Wait for data collection

Redeploys. Error rate returns to 0%. System is stable.

The incident is logged in the decision record system: “Deployment incident - insufficient data for new symbol gate evaluation. Resolution: delay gate evaluation. Prevention: add data sufficiency check to deployment validation.”


Journey 5: Kubernetes CronJob - Scheduled Evaluation

Situation: The regime evaluation system runs as a Kubernetes CronJob, executing hourly independently without external orchestration.

Opening Scene:

Every hour, Kubernetes triggers the metrics-service cronjob pod:

# k8s/metrics-service/cronjob.yaml
schedule: "0 * * * *"
command: ["task", "evaluate-regime"]

Rising Action:

The cronjob pod executes:

  1. Reads configuration from environment variables (overriding environment.yaml defaults)
  2. Fetches latest market data from KuCoin API
  3. Calculates all 6 regime metrics (ADX, efficiency ratio, autocorrelation, OU half-life, slope, Bollinger bandwidth)
  4. Evaluates three restart gates (if grid is stopped)
  5. Classifies regime and determines exit state
  6. Creates decision record and commits to Git
  7. Sends notifications via configured channels (Pushover directly, or webhook to n8n if available)

Climax:

The evaluation detects a regime transition:

  • Regime: RANGE_OK → TRANSITION
  • Exit state: NORMAL → WARNING
  • Decision record created: decisions/2026-02-01/dec-eth-091500.yaml

The cronjob attempts to commit to Git repository:

git add decisions/2026-02-01/dec-eth-091500.yaml
git commit -m "[ETH-USDT] WARNING state detected - regime TRANSITION"
git push origin main

Potential Issue:

Git push fails (network timeout). The cronjob implements retry logic:

  • Attempt 1: Failed (timeout)
  • Attempt 2 (30s delay): Failed
  • Attempt 3 (60s delay): Success

Decision record committed. Audit trail intact.

Resolution:

Notification sent via Pushover API (direct integration, no n8n dependency):

POST https://api.pushover.net/1/messages.json
{
  "token": "...",
  "user": "...",
  "message": "⚠️ WARNING - ETH regime transitioning. Confidence 0.68 → 0.54",
  "priority": 1,
  "url": "https://regime-dashboard/decisions/dec-eth-091500"
}

Craig receives notification on phone. System continues evaluating hourly.

Operational Notes:

  • Cronjob pod uses same Taskfile commands available locally: task evaluate-regime
  • Configuration via environment variables: MARKET_MAKER_DATA_REPOSITORY_BASE_PATH=/data/market-maker-data
  • Logs streamed to stdout, captured by Kubernetes logging
  • Pod exits cleanly after each evaluation (stateless execution)
  • Next evaluation triggered by Kubernetes scheduler in 1 hour

Journey Requirements Summary

These five journeys reveal the following capability requirements:

From Journey 1 (Active Trading):

  • Hourly regime evaluation with exit state classification
  • Exit state machine (NORMAL → WARNING → LATEST_ACCEPTABLE_EXIT → MANDATORY_EXIT)
  • Three-gate restart logic
  • Push notifications with context
  • Manual action recording
  • Cooldown enforcement

From Journey 2 (Historical Review):

  • KPI analysis framework (SLAR, PRR, TTDR, FER, ERT)
  • Git-backed immutable decision records
  • Analysis tooling (scripts, dashboards)
  • Time-period filtering
  • False positive/negative identification
  • Audit trail completeness

From Journey 3 (Investor Evaluation):

  • Investor-grade reporting
  • Backtesting framework (3+ years historical data)
  • Failure analysis documentation
  • Separation of recommendation vs action
  • Track record visualization
  • Credibility evidence (immutability, transparency)

From Journey 4 (DevOps):

  • Production deployment workflow
  • Blue-green deployment support
  • Error monitoring and alerting
  • Configuration validation
  • Data sufficiency checks
  • Incident logging
  • Rollback capability

From Journey 5 (Kubernetes CronJob):

  • Kubernetes CronJob deployment support
  • Taskfile-based execution (local simulation possible)
  • Environment variable configuration override system
  • Git commit retry logic with backoff
  • Direct Pushover API integration (no n8n dependency initially)
  • Stateless execution (each run independent)
  • Kubernetes logging integration
  • Graceful error handling and exit codes
  • Configuration validation on startup

Optional n8n Integration (Growth Feature):

  • Webhook endpoint for manual triggering
  • n8n workflow orchestration for advanced notification routing
  • Multi-channel notification distribution (Email, Slack, SMS via n8n)

Domain-Specific Requirements

Project Classification:

  • Domain: Fintech - Algorithmic Trading
  • Complexity: High
  • Context: Brownfield (adding Phases 2-5 exit strategy to existing regime management system)

Compliance & Regulatory

Current Scope (Phases 2-5 - Personal Capital Trading):

  • Personal capital trading (£1K-£10K scale) - no regulatory oversight required
  • Regulatory compliance deferred to post-Phase 5 (external capital threshold)
  • Git commit history provides sufficient audit integrity without independent verification or cryptographic signing
  • Assumption: Personal capital trading does not trigger FCA algorithmic trading requirements (see RAIA log A006)

Out of Scope for MVP:

  • FCA registration or compliance
  • MiFID II algorithmic trading requirements
  • External investor regulatory framework
  • Legal review scheduled before £100K external capital raise

Security Architecture

API Security:

  • KuCoin API keys with IP whitelist required + no-withdrawal permissions enforced
  • Threat model: Prevent unauthorized trading and capital extraction
  • Kubernetes secrets for sensitive configuration (not in code/config files)

Data Protection:

  • Decision records repository: Private Git repository (market-maker-data)
  • Access control: Restricted to operator only during validation phase
  • Data in transit: Pushover notifications encrypted, HTTPS for all API calls
  • GDPR: Personal trading data only (no third-party PII)

Decision Interface:

  • Authentication: Not required for MVP (local K8s cluster + VPN access)
  • Network isolation: Accessible only within VPN perimeter
  • Future enhancement: OAuth2 ingress mechanism available for public exposure post-MVP
  • Hosting: Private Kubernetes cluster (not public-facing)

Technical Constraints

Evaluation Cadence:

  • MVP (Phases 2-5): 1-hour evaluation cycle (schedule: "0 * * * *")
  • Rationale: Research indicates 12-24 hour warning window for regime transitions (see RAIA log A001, A004)
  • Future enhancement: Adaptive cadence (state-based evaluation frequency) if validation shows need for faster response
  • Action: Validate assumption via backtesting in Phase 4 (see RAIA Action 1)

Exchange Integration - KuCoin:

  • Grid management limitation: KuCoin spot grids cannot be managed via API (manual stop/start via UI only)
  • Human-in-loop requirement: System generates recommendations, human executes in KuCoin UI
  • Data dependencies: Market data (OHLCV), account balance, position tracking all via KuCoin API
  • Rate limits: 1-hour evaluation cycle well within KuCoin API rate limits
  • API call volume: Reduced overhead compared to 15-minute cadence

Configuration Management:

  • Schema validation: Configuration validated on startup with retry logic
  • Deployment safety: Invalid configuration keeps previous deployment running (blue-green deployment)
  • Environment overrides: Support environment variable overrides for Kubernetes deployment flexibility
  • Validation checks: Pre-deployment validation catches configuration errors before production rollout

Data Availability:

  • Historical data requirements: Sufficient data needed for gate evaluation (240+ bars for OU half-life)
  • Data sufficiency checks: Validate sufficient data exists before enabling gate evaluation for new symbols
  • Backfill support: Tools to collect historical data for new symbols before production use

Resilience & Failure Handling

Exchange Outage (Acceptable Risk):

  • Scenario: KuCoin unavailable during MANDATORY_EXIT state
  • Mitigation: Document as known limitation (see RAIA R002)
  • Rationale: Manual execution dependency means system cannot auto-execute anyway
  • Monitoring: Track exchange availability incidents for future multi-exchange planning (see RAIA Action 2)

Market Data Feed Failure (Retry with Backoff):

  • Scenario: KuCoin market data API fails during evaluation cycle
  • Mitigation: Retry 2-3 times with exponential backoff before declaring failure
  • Failure handling: Log error, skip current cycle, attempt next cycle in 1 hour
  • Alert threshold: After N consecutive failures, send “DATA UNAVAILABLE - MANUAL MONITORING REQUIRED” alert
  • Rationale: Transient API issues shouldn’t trigger false alarms, but prolonged outage needs operator awareness

Git Commit Failure (Acceptable Risk with Logging):

  • Scenario: Decision record created but Git push fails
  • Mitigation: Log failure locally, continue operation (see RAIA R003)
  • Rationale: Notification still delivered (Pushover), operator can act; audit gap is non-critical for validation phase
  • Future enhancement: Retry queue for failed commits (post-Phase 5)

Configuration Errors (Validation with Rollback):

  • Scenario: Invalid configuration deployed to production
  • Mitigation:
    • Pre-deployment: Schema validation in deployment pipeline
    • Startup validation: Validate configuration on pod startup, retry with backoff if validation fails
    • Deployment safety: Blue-green deployment keeps previous version running if new version fails validation
  • Rationale: Configuration errors are preventable and should never reach production

Notification Delivery Failure:

  • Scenario: Pushover API unavailable or rate-limited
  • Mitigation: Log failure, attempt retry on next evaluation cycle
  • Monitoring: Track notification delivery success rate
  • Rationale: Missing single notification is acceptable if subsequent cycle succeeds

Integration Requirements

KuCoin Exchange API:

  • Market data: OHLCV data at multiple timeframes (1m, 15m, 1h, 4h)
  • Account data: Balance queries for capital allocation calculations
  • Position tracking: Current grid status, order fills, PnL tracking
  • Authentication: API key + secret + passphrase with IP whitelist
  • Error handling: Graceful degradation on API failures, retry logic for transient errors

Git Repository (market-maker-data):

  • Decision records: Immutable YAML files, one per recommendation
  • Metrics history: Hourly snapshots of system state
  • Commit strategy: Atomic commits with descriptive messages including symbol and state
  • Push failures: Log and continue (acceptable gap in audit trail during outages)
  • Access control: Private repository, SSH key authentication from Kubernetes pods

Pushover Notifications:

  • Direct API integration: No n8n dependency for MVP
  • Priority levels: NORMAL, WARNING, LATEST_ACCEPTABLE_EXIT, MANDATORY_EXIT map to Pushover priority
  • Rate limiting: Prevent notification spam (max 1 notification per state transition)
  • Delivery tracking: Log notification attempts and responses

Optional n8n Integration (Post-MVP):

  • Webhook triggers: Manual evaluation triggering
  • Multi-channel notifications: Email, Slack, SMS routing
  • Workflow orchestration: Complex notification logic

Risk Mitigations

Domain-Specific Risks:

Fast Regime Transitions:

  • Risk: Regime may transition faster than 1-hour evaluation cycle can detect (see RAIA R001)
  • Mitigation:
    • Backtesting to validate 12-24 hour warning window assumption (see RAIA Action 1)
    • Monitor near-miss scenarios during validation
    • Prepared to implement 15-minute cadence if needed
  • Trigger: If >20% of regime transitions provide <2 hour warning window

Exchange Outage During Critical Exit:

  • Risk: Cannot execute manual exit when KuCoin is unavailable (see RAIA R002)
  • Mitigation: Accept as known limitation (manual execution dependency)
  • Future: Multi-exchange diversification (post-Phase 5)
  • Monitoring: Track incidents during validation (see RAIA Action 2)

API Rate Limiting:

  • Risk: Excessive API calls trigger rate limits, blocking market data access
  • Mitigation:
    • 1-hour evaluation cycle well within KuCoin rate limits
    • Retry logic with exponential backoff prevents rapid retry storms
    • Monitor API usage to stay under limits

Data Staleness:

  • Risk: Stale market data leads to incorrect regime classification (see RAIA R006)
  • Mitigation:
    • Timestamp all market data fetches
    • Retry logic ensures fresh data attempts before failure
    • Alert operator if data age exceeds acceptable threshold

Capital Loss from False Positives:

  • Risk: Excessive false exits erode capital through missed ranging periods (see RAIA R004)
  • Mitigation:
    • Three-gate restart logic prevents premature re-entry
    • Backtesting validates false positive rate <30% (see RAIA A005, Action 3)
    • KPI tracking measures false exit impact

Regulatory Change:

  • Risk: Crypto regulations change, grid trading becomes restricted (see RAIA R005)
  • Mitigation: Monitor regulatory landscape, prepared to halt operations if needed
  • Legal review: Scheduled before external capital raise (see RAIA Action 4)

Crypto Trading Domain Specifics

24/7 Market Operations:

  • Implication: No market close, regime can shift anytime (overnight, weekends)
  • Mitigation: 1-hour evaluation cycle runs continuously via Kubernetes CronJob
  • Monitoring: System uptime monitoring, alert on CronJob failures

High Volatility Environment:

  • Implication: Crypto moves faster than traditional markets, tighter response windows
  • Mitigation: Gate thresholds calibrated for crypto volatility patterns (not traditional asset volatility)
  • Validation: Backtesting with crypto-specific volatility scenarios (Phase 4)

Single Exchange Dependency (KuCoin):

  • Risk: Exchange-specific outages, API changes, or policy changes affect operations (see RAIA I002)
  • Mitigation: Accept as validation phase limitation
  • Future: Multi-exchange architecture (post-Phase 5)

Grid Trading Mechanics:

  • KuCoin limitation: Spot grids not manageable via API (manual UI interaction required) (see RAIA I001)
  • Implication: System is decision support only, not automated execution
  • Benefit: Human-in-loop preserves control, reduces regulatory complexity

Assumptions & Actions

Critical Assumptions Requiring Validation:

  • A001: Regime transitions provide 12-24 hour warning windows → Validate in Phase 4 backtesting
  • A004: 1-hour evaluation cadence sufficient for capital protection → Monitor during Phases 2-5
  • A005: False positive rate <30% is acceptable → Measure via KPI framework
  • A006: Personal capital trading exempt from FCA regulation → Legal review before £100K

Key Actions:

  1. Action 1: Validate 1-hour cadence assumption via backtesting (Phase 4, Due: 2026-04-01)
  2. Action 3: Measure false positive rate via KPI framework (Phase 4-5, Due: 2026-04-15)
  3. Action 5: Return to domain requirements after validation data available (Due: 2026-05-01)
  4. Action 6: Quarterly RAIA review (Next: 2026-05-01)

Full RAIA Log: See .ai/projects/market-making/RAIA.md for complete Risks, Assumptions, Issues, and Actions tracking.

Innovation & Novel Patterns

Detected Innovation Areas

1. Tiered Exit Urgency Model

Innovation: Progressive exit states with explicit time windows for human decision-making, replacing binary stop-loss logic.

Differentiator: Traditional grid trading uses binary stop-losses (triggered or not triggered). This system implements a tiered urgency model:

  • WARNING: Early signal (2+ warning conditions met), 4-hour notification rate limit, provides 1-2 hour buffer to LATEST_ACCEPTABLE_EXIT
  • LATEST_ACCEPTABLE_EXIT: Regime assumptions failing, 2-hour notification rate limit, recommended exit window of 4-8 hours
  • MANDATORY_EXIT: Confirmed regime break, 1-hour notification rate limit, immediate exit recommended

Why This Matters: Provides graduated response time appropriate to signal strength. Users aren’t forced to choose between “no alert” or “emergency exit” - there are intermediate states that allow thoughtful decision-making while preserving capital protection.

Novel Aspect: Explicit modeling of decision urgency as progressive states with corresponding time buffers, rather than treating all exit signals as equivalent.


2. Sequential Three-Gate Restart Logic

Innovation: Post-exit restart requires sequential validation through three gates (not parallel checks), preventing premature re-entry during trend continuations.

Gate Structure:

  • Gate 1 (Directional Energy Decay): Must pass FIRST - validates trend strength has subsided (ADX falling, TrendScore low, no persistent directional swings)
  • Gate 2 (Mean Reversion Return): Evaluated ONLY after Gate 1 passes - validates mean-reverting behavior has returned (negative autocorrelation, short OU half-life, price oscillations reverting)
  • Gate 3 (Tradable Volatility): Evaluated ONLY after Gate 2 passes - validates volatility is in tradable range (not too low, not expanding)

Differentiator: Traditional trading systems use simple cooldown periods (“don’t trade for N hours after stop”). This implements sequential validation - you can’t evaluate mean reversion until directional energy has decayed, you can’t evaluate volatility until mean reversion is confirmed.

Why This Matters: Prevents “stop-restart churn” where a grid is exited during a trend, then immediately re-entered before the trend fully resolves, leading to multiple stop-losses.

Novel Aspect: Sequential gating architecture (Gate N+1 only evaluated if Gate N passes) creates a forced progression through stability checks.


3. Multi-Metric Regime Consensus with 2+ Condition Triggering

Innovation: WARNING state requires 2+ warning conditions to trigger (not single condition), using consensus across 6 regime metrics.

Metrics Used:

  1. ADX (trend strength)
  2. Efficiency Ratio (directional persistence)
  3. Lag-1 Autocorrelation (mean reversion detection)
  4. OU Half-Life (mean reversion speed)
  5. Normalized Slope (directional bias)
  6. Bollinger Bandwidth (volatility regime)

Consensus Logic:

  • Single warning condition = NORMAL state (no alert)
  • 2+ warning conditions = WARNING state (alert sent)
  • This prevents false alarms from single noisy indicators

Differentiator: Most technical analysis uses individual indicators or simple “AND” logic. This implements a voting mechanism - regime classification emerges from consensus, and WARNING requires multiple independent signals.

Why This Matters: Reduces false positive rate while maintaining sensitivity to genuine regime transitions. A single spike in ADX doesn’t trigger an alert, but ADX rising + confidence declining + efficiency ratio increasing = legitimate warning.

Novel Aspect: The explicit 2+ condition requirement to trigger WARNING, preventing single-indicator noise from generating actionable alerts.


4. Asymmetric Automation Philosophy

Innovation: System can automatically reduce risk (send alerts), but NEVER automatically deploys capital.

Design Principle:

  • Auto-Alert, Manual-Execute: System generates exit recommendations 24/7, but human must execute in KuCoin UI
  • Asymmetric Authority: System can escalate warnings (NORMAL → WARNING → LATEST_ACCEPTABLE_EXIT → MANDATORY_EXIT) but cannot create grids or deploy capital without explicit approval
  • Human-in-Loop by Design: Not an afterthought or “manual override” - it’s the core architecture

Differentiator: Most trading systems are either fully automated (system trades without human input) or fully manual (human monitors 24/7). This explicitly separates monitoring (automated) from execution (manual).

Why This Matters:

  • Regulatory: Simpler compliance (no automated trading license needed)
  • Risk: Capital deployment requires human judgment, reducing catastrophic automation failures
  • Control: Operator maintains final authority while benefiting from 24/7 monitoring

Novel Aspect: The explicit articulation and implementation of “asymmetric automation” as a design philosophy, not just “we’ll add automation later.”


5. Investor-First Audit Trail Architecture

Innovation: Git-backed immutable decision records designed for investor scrutiny from day one (not added later).

Architecture:

  • Every recommendation committed to Git BEFORE notification sent
  • State transitions logged with timestamps, metrics, and reasoning
  • Separation of “system recommendation” vs “user action” tracked independently
  • No database, no retroactive editing - immutable audit trail via version control

Differentiator: Most trading systems add logging as an afterthought. This makes audit credibility a first-class design requirement, shaping the entire data architecture.

Why This Matters:

  • Investor Credibility: Can answer “why didn’t you exit here?” for any historical moment
  • Performance Analysis: Separate tracking of recommendation quality (was the system right?) vs action quality (did the operator follow advice?)
  • Scaling Enabler: Clean audit trail is prerequisite for external capital (£100K+)

Novel Aspect: Using Git version control as the primary data store specifically for investor-credible audit trails, rather than traditional database logging.


Market Context & Competitive Landscape

Existing Approaches to Grid Exit:

  1. Manual Monitoring: Trader watches markets 24/7, decides when to exit grids

    • Limitation: Doesn’t scale, requires constant attention, subject to emotion/fatigue
  2. Simple Stop-Loss: Set stop-loss at X% below grid range, exit when hit

    • Limitation: Binary decision, often triggers at maximum loss, no early warning
  3. Trailing Stops: Stop-loss moves with price, locks in some profit

    • Limitation: Still binary, no regime awareness, can trigger during normal volatility
  4. Automated Trading Bots: Fully automated grid management with various exit rules

    • Limitation: Black-box decision-making, no human judgment, regulatory complexity

How This Differs:

This system combines:

  • Regime structure analysis (not just price levels)
  • Tiered urgency (not binary triggers)
  • Multi-metric consensus (not single indicators)
  • Human-in-loop (not fully automated)
  • Sequential restart validation (not simple cooldowns)
  • Investor-grade audit trails (not just operator logs)

Positioning: Structured decision support for systematic grid traders who want to scale capital while maintaining human judgment and building credible track records.


Validation Approach

Critical Questions to Answer:

Q1: Does tiered exit urgency preserve more capital than binary stop-losses?

  • Validation Method: Backtesting (Phase 4) - compare tiered exit vs simple stop-loss on 3-6 months historical data
  • Success Metric: 75%+ profit retention ratio (preserve majority of range-trading profits)
  • Measure: Average exit timing (how early do we exit vs when stop-loss would have hit?)

Q2: Does the 2+ condition WARNING logic reduce false positives without missing real transitions?

  • Validation Method: Track False Exit Rate (FER) during validation phase
  • Success Metric: FER <30% (see RAIA A005)
  • Measure: Exits where range resumed after stop vs exits where trend confirmed

Q3: Do sequential restart gates prevent stop-restart churn?

  • Validation Method: Track re-entry timing after exits, measure stop-loss hits on restarted grids
  • Success Metric: <10% of restarted grids hit stop-loss within 24 hours
  • Measure: Time between exit and successful re-entry, profitability of restarted grids

Q4: Does 1-hour evaluation cadence provide sufficient warning time?

  • Validation Method: Backtesting to measure actual regime transition warning windows (see RAIA A001, A004)
  • Success Metric: ≥80% of transitions provide >2 hour warning window
  • Measure: Time from WARNING to MANDATORY_EXIT in historical data
  • Fallback: If <80%, implement 15-minute cadence or adaptive evaluation frequency

Q5: Does multi-metric consensus improve regime classification accuracy?

  • Validation Method: Compare 6-metric consensus vs individual metrics
  • Success Metric: Higher True Transition Detection Rate (TTDR) with consensus vs single indicators
  • Measure: Regime classification accuracy in backtesting (correctly identified RANGE vs TREND)

Risk Mitigation

Innovation Risk 1: Excessive Complexity

  • Risk: Tiered states, sequential gates, multi-metric consensus adds complexity that doesn’t improve outcomes vs simpler approaches
  • Mitigation: Backtesting comparison against simpler baselines (binary stop-loss, single indicator, no gates)
  • Fallback: If complex approach doesn’t outperform, simplify to best-performing baseline
  • Validation Trigger: If backtesting shows <10% improvement vs simple stop-loss, question complexity

Innovation Risk 2: False Positive Rate Too High

  • Risk: 2+ condition WARNING logic may still generate too many false exits (FER >30%)
  • Mitigation: Tunable thresholds via YAML config, conservative/aggressive presets available
  • Fallback: Increase WARNING requirement to 3+ conditions, or tighten individual condition thresholds
  • Validation Trigger: Track FER in Phase 4, adjust thresholds if >30%

Innovation Risk 3: 1-Hour Cadence Insufficient

  • Risk: Regime transitions may occur faster than 1-hour evaluation can detect (see RAIA R001)
  • Mitigation: Backtesting measures actual warning windows in historical data
  • Fallback: Implement 15-minute cadence or adaptive evaluation (NORMAL: 1h, WARNING: 15min, LATEST_ACCEPTABLE: 5min)
  • Validation Trigger: If >20% of transitions provide <2 hour warning, implement faster cadence

Innovation Risk 4: Sequential Gates Too Restrictive

  • Risk: Three-gate restart logic prevents timely re-entry, causing excessive opportunity cost
  • Mitigation: Track time-to-restart and profitability of missed ranging periods
  • Fallback: Parallel gate evaluation (all gates checked simultaneously) or reduce to 2 gates
  • Validation Trigger: If average time-to-restart >48 hours and missed profit >20% of preserved capital

Innovation Risk 5: Human-in-Loop Execution Delay

  • Risk: Manual execution introduces delay that negates early warning benefits
  • Mitigation: Measure Exit Reaction Time (ERT) - time from alert to actual exit
  • Fallback: If ERT consistently >30 minutes, consider API-based grid management (if KuCoin adds support) or multi-exchange architecture
  • Validation Trigger: Track ERT in operational phase, identify if manual execution is bottleneck

Innovation Risk 6: Audit Trail Overhead

  • Risk: Git commits for every decision create operational friction or repository bloat
  • Mitigation: Lightweight JSON/YAML files, daily aggregation, automated cleanup for old data
  • Fallback: Database logging with Git export for investor presentation
  • Validation Trigger: If Git operations slow evaluation >500ms or repo size >1GB, reconsider architecture

Backend Decision Support System - Specific Requirements

Project-Type Overview

This is a batch processing system with Git-based persistence, not a web API. The system runs as a Kubernetes CronJob executing Python modules directly with file-based output to a Git repository mounted on a Persistent Volume Claim (PVC).

Architecture:

KuCoin API → Python Evaluation → Git Commit (PVC) → Static Dashboard Generation → Git Push

Key Characteristics:

  • No HTTP API endpoints, no REST services, no client-server architecture
  • Scheduled Python execution (hourly via Kubernetes CronJob)
  • Git repository on PVC for persistence and retry capability
  • Static HTML dashboards with Chart.js visualizations generated every hour
  • Stateless job execution with all state loaded from/saved to Git

Data Pipeline

Processing Flow:

  1. Data Acquisition: Fetch OHLCV from KuCoin API, load recent metrics from Git PVC
  2. Regime Analysis: Calculate 6 metrics, classify regime, calculate confidence
  3. Exit State Evaluation: Evaluate WARNING/LATEST_ACCEPTABLE_EXIT/MANDATORY_EXIT conditions
  4. Gate Evaluation: If grid stopped, evaluate three sequential gates
  5. State Transition Tracking: Log state changes with rate limiting
  6. Decision Record Creation: Create immutable decision records
  7. Dashboard Generation: Generate HTML/JavaScript dashboard with Chart.js
  8. Data Persistence: Commit all files to Git (on PVC), push to remote with retry

No additional data transformation or aggregation stages for MVP - pipeline is complete as described.

Data Schemas: See SCHEMA.md for complete schema definitions (metrics, exit states, decision records, configuration).


Static Dashboard Generation

Execution: Dashboards generated as part of the same CronJob (not separate process)

Frequency: Every hour (regenerated with each evaluation)

Format: HTML with JavaScript charts (Chart.js library)

Structure: One dashboard HTML file per hour with embedded data for that evaluation period

  • File naming: dashboards/{symbol}/{YYYY-MM-DD}-{HH}.html
  • Self-contained: Data embedded in HTML (no external API calls)
  • Viewable via: file:// protocol locally, or simple HTTP server, or Git hosting

Visualizations (Essential - support recommendations/decisions):

  • Current regime classification and confidence
  • Exit state (NORMAL/WARNING/LATEST_ACCEPTABLE_EXIT/MANDATORY_EXIT)
  • All 6 metrics with current values and trends
  • Gate evaluation status (if grid stopped)
  • Recent state transition history
  • Decision recommendation (if actionable)

Technology Stack:

  • Chart.js for interactive visualizations
  • HTML5/CSS3 for layout
  • Embedded JSON data in <script> tags
  • No server-side rendering needed

Error Handling & Resilience

KuCoin API Failures

Strategy: Retry 2-3 times with exponential backoff, then skip cycle

Acceptable for MVP: Yes

Persistent Failure Alerting: Yes - if API fails for multiple consecutive cycles (threshold: 3+ consecutive failures), send alert notification

Monitoring: Track API response times, error rates, success/failure counts → send to Grafana Loki for observability

Git Push Failures

Strategy: Log locally on PVC, continue operation (acceptable gap in audit trail for MVP)

PVC Design: Yes - CronJob should use PVC so Git repo persists between runs and doesn’t require full clone each time

Retry on Subsequent Cycles: Yes - if push failed previously, retry push on next cycle before committing new data

Implementation:

# Pseudo-code
if has_unpushed_commits():
    try:
        git.push()
        logger.info("Pushed previously failed commits")
    except:
        logger.warning("Previous commits still not pushed")
 
# Continue with current evaluation

Metric Calculation Errors

Strategy: Evaluation should continue with remaining metrics if one fails

Critical vs Optional: All metrics are conceptually critical, BUT:

  • If a metric calculation fails (e.g., OU half-life non-stationary), continue with remaining metrics
  • If enough metrics succeed to calculate confidence, generate recommendation WITH additional error information
  • Include metric calculation errors in notification/dashboard

Implementation Approach:

  • Calculate all metrics with error handling per metric
  • Track which metrics succeeded vs failed
  • If confidence can be calculated (even with partial metrics), proceed
  • Include error context: “Recommendation based on 5/6 metrics (OU half-life calculation failed - data non-stationary)”

Error Notification: If confidence level is high enough for entry/exit recommendation, communicate the recommendation WITH error details about failed metrics

Configuration Validation Failures

Strategy: Fail fast on startup if config invalid

Acceptable: Yes - pod won’t start if config has errors

Blue-Green Deployment: Previous version should continue running if new version fails validation (Kubernetes deployment strategy)


Performance & Scalability

Processing Time Constraints

Maximum Acceptable Time: Not a concern functionally (even >1 hour would work), but:

  • Error threshold: If evaluation takes >5 minutes, log ERROR (potential performance issue)
  • Warning threshold: If evaluation takes >1 minute, log WARNING
  • Target: Complete evaluation in <30 seconds for typical case

No specific hard performance requirements - hourly cadence provides plenty of buffer

Git Repository Size Management

Retention Policy: Keep all historical data forever for MVP

Cleanup: No automated deletion for MVP

Action Item: Create action in RAIA log to revisit data retention policy at end of MVP (after validation phase complete)

Current Assessment: Not a concern - estimate ~10-50 KB per evaluation × 24 hours × 365 days ≈ 87-438 MB/year (manageable)


Configuration Management

Hot-Reload

Not required: Configuration changes take effect on next CronJob execution (no need for hot-reload)

Acceptable Delay: Up to 1 hour between config change and effect (next hourly run)

Versioning

Primary Versioning: Git commit hash of config file (tracked in decision records)

Image Versioning: Docker image version stored in image metadata (immutable)

Enhancement: Consider writing Docker image version to output files alongside config Git hash

  • Provides complete traceability: “This decision used config version X running on image version Y”
  • Useful for debugging if image code changes behavior

Implementation:

# In metrics files
system_version:
  config_git_hash: "a3f8d92e"
  image_version: "v1.2.3"  # From Docker image label

Monitoring & Observability

Metrics Collection: All metrics (Kubernetes pod metrics, application metrics, timing data) sent to Grafana Loki

Required Monitoring/Alerting:

  1. CronJob Execution Failures: Job didn’t run at expected time
  2. Evaluation Errors: Job ran but threw exceptions
  3. KuCoin API Degradation: High failure rate (3+ consecutive failures)
  4. Git Push Failures: Persistent issues (3+ consecutive push failures)
  5. Metric Calculation Anomalies: Values out of expected ranges or calculation failures
  6. Exit State Transitions: Log all WARNING/LATEST_ACCEPTABLE_EXIT/MANDATORY_EXIT transitions
  7. Performance Degradation: Evaluation taking >5 minutes (ERROR) or >1 minute (WARNING)

Raw Metrics to Loki:

  • KuCoin API response times
  • KuCoin API error counts and types
  • CronJob execution duration
  • Internal processing step timings (metric calculation, Git operations, dashboard generation)
  • Errors and exceptions with full stack traces

Alert Channels:

  • Pushover (direct API) for critical alerts
  • Grafana for historical metrics and dashboards
  • Optional: Webhook to n8n for advanced routing (future enhancement)

Data Retention & Cleanup

Current Policy: No automated cleanup for MVP

All Data Retained:

  • Raw metrics: Forever
  • Decision records: Forever (audit trail requirement)
  • Exit state transitions: Forever
  • Dashboards: Forever

Future Review: Action item in RAIA log to revisit retention policy after MVP validation phase


Technology Stack

Core Stack:

  • Python 3.11+
  • Pydantic for schema validation
  • GitPython for Git operations
  • PyYAML for YAML parsing
  • Requests for KuCoin API calls
  • Chart.js for dashboard visualizations

Additional Dependencies:

  • Jinja2 (or similar) for HTML template rendering
  • JSON for embedded data in dashboards

Deployment:

  • Kubernetes CronJob
  • PVC for Git repository persistence
  • ConfigMap for exit_strategy_config.yaml (sourced from Git repo)
  • ExternalSecrets for KuCoin API keys (central secret store)

Deployment & Operations

Kubernetes Deployment

Current Status: Already working in Kubernetes

CronJob Configuration:

  • Schedule: 0 * * * * (hourly, on the hour)
  • PVC mount: Git repository persists between runs
  • No need to clone repo each time
  • Retry capability for failed Git pushes

PVC Design:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: market-maker-data-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi  # Adjust based on growth

Configuration Sources

Configuration managed via:

  1. Git Repository: exit_strategy_config.yaml committed to Git, loaded from PVC
  2. ExternalSecrets: KuCoin API keys from central secret store (already implemented)
  3. Environment Variables: Overrides for deployment-specific settings (data paths, logging levels)

No ConfigMap needed - configuration comes from Git repo on PVC

Configuration Flow:

  1. Configuration YAML committed to Git (workspace-root)
  2. Git repo cloned/updated on PVC by CronJob
  3. Python reads config from PVC path
  4. Config Git hash recorded in decision records

Implementation Considerations

File Organization:

repos/market-making/metrics-service/
├── src/
│   ├── regime/              # Regime detection (Phase 1 complete)
│   ├── exit_strategy/       # Exit strategy (Phase 2 target)
│   ├── schemas/             # Pydantic models (Phase 2)
│   ├── persistence/         # Git operations with retry (Phase 2)
│   ├── dashboards/          # Dashboard generation (Phase 5)
│   └── monitoring/          # Loki integration (Phase 5)
├── config/
│   └── exit_strategy_config.yaml
└── k8s/
    ├── cronjob.yaml
    ├── pvc.yaml
    └── external-secrets.yaml

Git Operations with PVC:

  • First run: Clone repository to PVC
  • Subsequent runs: git pull to update, commit new files, push with retry
  • Failed push: Accumulates commits on PVC, retries on next cycle
  • PVC ensures no data loss even if push fails

Dashboard Generation:

  • Generate HTML file per hour: dashboards/{symbol}/{YYYY-MM-DD}-{HH}.html
  • Embed evaluation data as JSON in <script> tag
  • Chart.js renders interactive charts client-side
  • Self-contained files (no external API calls)
  • Commit dashboards to Git for version control and distribution

Stateless Job Execution:

  • CronJob pod starts, mounts PVC with Git repo
  • Loads config and historical data from Git
  • Performs evaluation
  • Writes new files to Git (on PVC)
  • Commits and pushes
  • Generates dashboard
  • Pod exits
  • Next run starts fresh (but Git repo on PVC persists)

Project Scoping & Phased Development

MVP Strategy & Philosophy

MVP Approach: Validation-First Capital Protection System

This MVP follows a prove-before-scale philosophy. The system must demonstrate capital protection capability with £1K before committing £10K. The MVP is NOT “minimum features to launch” - it’s “minimum features to confidently scale capital.”

Why This Scope:

  • Can’t skip validation: Backtesting + live testing required before £10K deployment
  • Need visibility: Position risk quantification essential for informed exit decisions
  • Must measure success: KPI tracking proves system works (not just feels right)
  • Investor readiness: Complete audit trail + track record enables external capital (12-month vision)

Resource Requirements:

  • Development: Solo developer (Craig) with AI assistance
  • Capital: £1K validation → £10K scale → £100K+ external investment
  • Timeline: 2-4 weeks validation after Phases 2-5 complete
  • Infrastructure: Kubernetes cluster (already operational), KuCoin API access

What This MVP Proves:

  1. Exit strategy preserves capital during regime transitions (75%+ profit retention)
  2. System provides actionable warnings before catastrophic exits (95%+ stop-loss avoidance)
  3. False positive rate acceptable (<30% - not stopping grids unnecessarily)
  4. Human-in-loop execution viable (operator responds within acceptable windows)
  5. Audit trail sufficient for investor scrutiny

MVP Feature Set (Phases 2-5)

Current State (Phase 1 - COMPLETE):

  • ✅ Six regime metrics operational (ADX, Efficiency Ratio, Autocorrelation, OU Half-Life, Normalized Slope, Bollinger Bandwidth)
  • ✅ Regime classification working (RANGE_OK, RANGE_WEAK, TRANSITION, TREND)
  • ✅ Git-backed storage with Kubernetes CronJob (hourly evaluation)
  • ✅ Basic Pushover notifications functional
  • ⚠️ Data Quality Issue: Hardcoded dummy values in engine.py must be fixed before Phases 2-5 (see implementation-plan.md Phase 1)

Core User Journeys Supported:

Journey 1: Active Grid Trader (Exit Protection)

  • Real-time exit state evaluation (WARNING → LATEST_ACCEPTABLE_EXIT → MANDATORY_EXIT)
  • Push notifications with actionable recommendations
  • Position risk visibility (capital at risk, profit give-back estimates)
  • Manual exit execution with state tracking

Journey 2: Historical Decision Reviewer (Self-Validation)

  • Git-backed immutable decision records
  • KPI analysis framework (SLAR, PRR, TTDR, FER metrics)
  • Backtesting framework showing system would have worked
  • Track record for personal scaling decision

Journey 5: Kubernetes CronJob (Scheduled Evaluation)

  • Stateless hourly execution with PVC-backed Git persistence
  • Retry logic for Git push failures
  • Direct Pushover API integration (no n8n dependency)
  • Static HTML dashboard generation with Chart.js

Must-Have Capabilities:

Phase 2: Exit Strategy Core

  • Exit State Machine: Progressive urgency states (NORMAL → WARNING → LATEST_ACCEPTABLE_EXIT → MANDATORY_EXIT)
  • Three-Gate Restart Logic: Sequential validation (Directional Energy Decay → Mean Reversion Return → Tradable Volatility)
  • Multi-Condition Triggering: Require 2+ warning conditions to prevent false alarms
  • State Transition Tracking: Git-logged transitions with timestamps and reasoning
  • Historical Data Loading: Load last 12-24 hours of metrics for persistence checks

Trigger Logic Implemented:

  • MANDATORY_EXIT: TREND regime detected, 2+ consecutive closes outside range, directional structure confirmed
  • LATEST_ACCEPTABLE_EXIT: TRANSITION persists (≥2×4h OR ≥4×1h bars), OU half-life ≥2× baseline, volatility expansion >1.25×
  • WARNING: 2+ conditions met (TRANSITION probability ≥40%, confidence declining, efficiency ratio rising, mean reversion slowing, volatility expanding)

Phase 3: Position Risk Quantification

  • KuCoin Position Tracking: Fetch real-time position data via API
  • Capital Risk Calculator: Quantify capital at risk, profit give-back estimates, stop-loss distance in ATR
  • Enhanced Notifications: All exit state alerts include position risk context
  • Graceful Degradation: System continues if KuCoin API unavailable (uses last known positions)

Notification Enhancements:

  • WARNING: “Capital at risk: $120.50, Review within 24h”
  • LATEST_ACCEPTABLE_EXIT: “Expected give-back if delayed 12h: $4-7, Exit within 4-12h”
  • MANDATORY_EXIT: “Stop-loss distance: 0.6 ATR (CRITICAL), Exit NOW”

Phase 4: Testing & Validation

  • Unit Tests: 60+ tests covering metric calculations, exit triggers, state transitions
  • Integration Tests: End-to-end flow (regime → exit state → notification → Git commit)
  • Backtesting Framework: Replay historical metrics (3-6 months data), validate exit quality
  • CI/CD Pipeline: GitHub Actions with quality gates (80%+ coverage, all tests pass)

Backtesting Success Criteria:

  • Profit Retention Ratio ≥75% (preserved majority of range profits)
  • Stop-Loss Avoidance Rate ≥95% (exited before stop-loss in 95%+ scenarios)
  • False Exit Rate ≤30% (acceptable false positive rate)
  • Average warning lead time ≥30 minutes (met timing requirements)

Phase 5: Operational Foundation

  • Evaluation Cadence: 1-hour CronJob execution (0 * * * *) - matches 12-24h warning window assumption
  • Audit Logging: Complete Git-backed state transitions, notification delivery tracking, operator action recording
  • KPI Tracking Framework: Calculate SLAR, PRR, TTDR, FER, MEC metrics from audit logs
  • Static Dashboards: HTML/Chart.js visualizations generated hourly, committed to Git
  • Monitoring Integration: All metrics/logs sent to Grafana Loki for observability

Dashboard Visualizations:

  • Current regime classification and confidence score
  • Exit state (NORMAL/WARNING/LATEST_ACCEPTABLE_EXIT/MANDATORY_EXIT)
  • All 6 metrics with current values and trends
  • Gate evaluation status (if grid stopped)
  • Recent state transition history

Out of Scope for MVP

Explicitly NOT Included:

Multi-Symbol Support:

  • MVP: Single ETH-USDT grid only (SINGLE_GRID mode)
  • Rationale: Prove exit strategy works for one symbol before scaling
  • Future: Multi-symbol portfolio management (post-MVP growth feature)

Automated Grid Creation:

  • MVP: Human approval required for all grid starts
  • Rationale: Preserve human judgment, reduce regulatory complexity
  • Future: Automated creation with high-confidence thresholds (post-MVP)

Advanced Dashboards:

  • MVP: Static HTML/Chart.js files generated hourly
  • Rationale: Sufficient for validation phase, investor-presentable
  • Future: Real-time interactive dashboards, performance attribution analysis (post-MVP)

15-Minute Evaluation Cadence:

  • MVP: 1-hour evaluation cycle
  • Rationale: Research indicates 12-24h warning windows (RAIA A001, A004) - hourly sufficient
  • Future: Adaptive cadence (state-based frequency) if validation shows need (post-MVP)

Automated Cleanup:

  • MVP: Keep all data forever (no retention policy)
  • Rationale: Preserve complete audit trail for validation analysis
  • Future: Revisit after MVP complete (RAIA action item)

Multi-Exchange Support:

  • MVP: KuCoin only
  • Rationale: Single exchange simplifies integration, acceptable for validation
  • Future: Multi-exchange diversification reduces outage risk (post-MVP)

Performance Optimization:

  • MVP: Functional performance (evaluation <5 minutes acceptable)
  • Rationale: 1-hour cadence provides plenty of buffer
  • Future: Caching, async processing if needed (post-MVP)

Post-MVP Roadmap

Phase 6: Capital Scaling (3-Month Horizon)

Objective: Operate at £10K capital with proven exit strategy

Prerequisites:

  • MVP validation complete (2-4 weeks live operation with £1K)
  • Capital doubled to £2K during validation
  • Zero stop-loss breaches during validation period
  • KPIs meet targets (SLAR ≥95%, PRR ≥75%, TTDR ≥70%)

Enhancements:

  • Track record documentation for personal scaling decision
  • Threshold tuning based on real performance data
  • KPI trend analysis (monthly reports)

Timeline: Month 4-6 after MVP complete


Phase 7: Investor Preparation (6-Month Horizon)

Objective: Package track record for external capital raise (£100K+)

Prerequisites:

  • 3+ months operation at £10K capital
  • Consistent monthly capital growth (4%+ average)
  • Clean failure analysis documentation
  • Backtesting validated against 3+ years historical data

Deliverables:

  • Investor Presentation: Track record visualization, backtesting evidence, failure analysis
  • Separation of Concerns: “System recommendation quality” vs “Operator action quality” metrics
  • Regulatory Review: Legal assessment before external capital (RAIA A006)
  • Multi-Symbol Validation: Expand beyond ETH-USDT, prove approach generalizes

Timeline: Month 7-12 after MVP complete


Phase 8: Growth Features (12-Month+ Horizon)

Objective: Scale operations with enhanced automation and intelligence

Enhanced Automation:

  • Automated grid creation with high-confidence thresholds (human override available)
  • Multi-symbol portfolio management (concurrent grids across symbols)
  • Dynamic capital allocation based on regime confidence

Analytics & Reporting:

  • Real-time visual dashboards (replace static HTML)
  • Automated investor reports (monthly performance summaries)
  • Performance attribution analysis (which decisions drove returns)
  • Regime classification accuracy tracking (learn from misclassifications)

Intelligence Enhancements:

  • Machine learning for regime classification refinement (adaptive to market structure changes)
  • Adaptive gate thresholds based on market conditions (not static YAML config)
  • Predictive exit timing optimization (earlier warnings for faster regime transitions)

Risk Management Expansion:

  • Portfolio-level risk limits (not just per-grid)
  • Correlation analysis across symbols (avoid concentrated exposure)
  • Multi-exchange support (KuCoin + Binance + others for outage protection)

Timeline: Month 13+ after MVP complete


Progressive Feature Roadmap Summary

MVP (Phases 2-5): Capital Protection Foundation

  • Exit strategy + validation + operational foundation
  • £1K validation → confident £10K scale
  • 2-4 weeks live operation
  • Done When: KPIs proven, audit trail complete, zero stop-loss breaches

Phase 6 (Post-MVP): Capital Scaling

  • Operate at £10K with proven system
  • 3 months track record building
  • Done When: Consistent 4%+ monthly growth, ready for investor presentation

Phase 7 (6-Month): Investor Readiness

  • Multi-symbol validation
  • External capital preparation (£100K+)
  • Done When: Investor presentation complete, regulatory review done

Phase 8 (12-Month+): Growth & Intelligence

  • Enhanced automation (within asymmetric philosophy)
  • ML-based refinements
  • Multi-exchange portfolio management
  • Done When: Operating at £100K+ scale with external investment

Risk Mitigation Strategy

Technical Risks:

Innovation Risk 1: 1-Hour Cadence Insufficient

  • Risk: Regime transitions may occur faster than hourly evaluation can detect (RAIA R001)
  • Mitigation: Backtesting validates actual warning windows in historical data (Phase 4)
  • Fallback: Implement 15-minute cadence if >20% of transitions provide <2h warning
  • Validation Trigger: Monitor during Phases 2-5, measure warning lead times in KPI framework

Innovation Risk 2: False Exit Rate Too High

  • Risk: 2+ condition WARNING logic may still generate excessive false exits (FER >30%)
  • Mitigation: Tunable thresholds via YAML config, conservative/aggressive presets
  • Fallback: Increase WARNING requirement to 3+ conditions, or tighten individual thresholds
  • Validation Trigger: Track FER in Phase 4 backtesting, adjust before live deployment

Innovation Risk 3: Sequential Gates Too Restrictive

  • Risk: Three-gate restart logic prevents timely re-entry, excessive opportunity cost
  • Mitigation: Track time-to-restart and profitability of missed ranging periods (KPI framework)
  • Fallback: Parallel gate evaluation or reduce to 2 gates
  • Validation Trigger: If average time-to-restart >48h AND missed profit >20% of preserved capital

Market Risks:

Fast Regime Transitions (RAIA R001)

  • Risk: Market moves faster than 1-hour cycle can detect, insufficient warning time
  • Mitigation:
    • Backtesting validates 12-24h warning window assumption (RAIA Action 1)
    • Monitor near-miss scenarios during validation
    • Prepared to implement 15-minute cadence if needed
  • Trigger: If >20% of transitions provide <2h warning window

Exchange Outage During Critical Exit (RAIA R002)

  • Risk: Cannot execute manual exit when KuCoin unavailable during MANDATORY_EXIT
  • Mitigation: Accept as known limitation (manual execution dependency)
  • Future: Multi-exchange diversification (Phase 8)
  • Monitoring: Track incidents during validation (RAIA Action 2)

Capital Loss from False Positives (RAIA R004)

  • Risk: Excessive false exits erode capital through missed ranging periods
  • Mitigation:
    • Three-gate restart logic prevents premature re-entry
    • Backtesting validates FER <30% (RAIA A005, Action 3)
    • KPI tracking measures false exit impact
  • Trigger: If FER >30% in backtesting, tighten WARNING thresholds

Resource Risks:

Data Quality Issues Block Progress

  • Risk: Phase 1 hardcoded dummy values must be fixed before Phases 2-5 trustworthy
  • Mitigation: Phase 1 prioritized, 40-60 hours estimated (see implementation-plan.md)
  • Status: In progress (ADX complete, 11% of Phase 1 done)
  • Contingency: Allocate 20% buffer time for unexpected data issues

Testing Reveals Major Bugs

  • Risk: Phase 4 backtesting shows exit logic fundamentally flawed
  • Mitigation: Test early (consider Phase 4 before Phases 2-3), iterate on thresholds
  • Fallback: Simplify trigger logic (remove complex conditions), use proven baselines
  • Contingency: Budget 50% additional time if major redesign needed

Scope Validation & Constraints

What Makes This the Right MVP:

Can Validate Core Value Proposition:

  • Exit strategy proven to preserve capital (backtesting + live testing)
  • Tiered urgency model tested (WARNING → LATEST_ACCEPTABLE → MANDATORY progression)
  • Sequential gates validated (prevents premature re-entry)
  • Human-in-loop execution proven viable (operator can respond in time)

Can Make Confident Scaling Decision:

  • KPI framework provides objective success measures (SLAR, PRR, TTDR)
  • Audit trail shows “did system work?” vs “did I follow advice?”
  • Backtesting + 2-4 weeks live operation = sufficient confidence for £10K
  • Track record foundation for future investor presentation

Can Be Completed in Reasonable Timeframe:

  • Phase 1: 2-3 weeks (data quality fix)
  • Phases 2-5: 4-6 weeks (exit strategy + validation + operational)
  • Total: 6-9 weeks development + 2-4 weeks validation = 2-3 months to “MVP Done”

Boundaries Tested:

Could validate without Phase 3 (Position Risk)? NO

  • Need “capital at risk: $120” visibility for informed exit decisions
  • Essential for £10K scale confidence
  • Position risk quantification is must-have

Could validate with basic text dashboards (no Chart.js)? NO

  • User explicitly requires charts for regime trend assessment
  • Visual confirmation of exit state transitions aids decision-making
  • Chart.js is lightweight, not over-engineering

Could validate without backtesting (Phase 4)? NO

  • Can’t trust exit logic without historical validation
  • Need objective proof of 75% profit retention, 95% stop-loss avoidance
  • De-risks £10K capital deployment
  • Backtesting is must-have

Could simplify Phase 5 (Operational)? YES - Potential optimization

  • Could defer fancy KPI dashboards (manual calculation acceptable)
  • 1-hour cadence already correct (not 15-min)
  • Simple YAML audit logs sufficient initially (enhance later)
  • Simplification Opportunity: Streamline Phase 5 to basic logging + manual KPIs

Phase Sequence Validation:

Current Plan: Phase 1 → 2 → 3 → 4 → 5 (sequential)

Alternative Considered: Phase 1 → 4 → 2 → 3 → 5 (backtest-first)

  • Benefit: Validate exit logic via backtesting BEFORE building Phases 2-3
  • Risk: Delays getting operational system, harder to iterate without working code
  • Decision: Keep current sequence (2→3→4) for faster feedback loop, but Phase 4 can start in parallel with Phase 3

Recommended Optimization:

  • Phase 1: Data Quality (BLOCKER - must complete first)
  • Phase 2 + Phase 4 (partial): Build exit strategy WHILE creating backtesting framework
  • Phase 3: Position Risk (can parallelize with Phase 4 backtesting)
  • Phase 4 (complete): Validate everything before deployment
  • Phase 5: Operational polish

Success Criteria (MVP “Done”)

Completion Criteria (All Must Be Met):

Code Complete:

  • All Phase 2-5 code implemented with 100% test pass rate
  • No critical bugs, no hardcoded dummy values
  • Configuration complete and validated

Backtesting Validation (Phase 4):

  • Exit logic tested against 3-6 months historical data
  • Profit Retention Ratio ≥75%
  • Stop-Loss Avoidance Rate ≥95%
  • False Exit Rate ≤30%
  • Average warning lead time ≥30 minutes

Live Capital Validation (2-4 Weeks):

  • Operated with £1K live capital for 2-4 weeks
  • Experienced multiple regime cycles (at least 2-3 TRANSITION events)
  • Zero stop-loss breaches during validation period (excluding black swan events)
  • KPIs meet targets in live operation (not just backtesting)

Capital Scaling Milestone:

  • Capital doubled from £1K to £2K during validation period
  • Proves system protects capital WHILE capturing ranging profits
  • Demonstrates profitability, not just capital preservation

Audit Trail Complete:

  • All decision records committed to Git with timestamps
  • State transitions logged with reasoning and metrics
  • Can answer “why didn’t you exit here?” for any historical moment
  • Separation of system recommendations vs operator actions tracked

System Ready for £10K:

  • Risk calculations scale correctly (position sizing, stop-loss placement)
  • Position tracking handles larger capital amounts
  • Notification system tested and reliable
  • Operator confident in decision-making process

MVP Declared “Done” When: All six completion criteria met + personal decision: “I’m ready to deploy £10K confidently.”

3-Month Success (Post Phase 2-5):

  • Operating at £10K capital with same exit quality metrics
  • Consistent monthly growth (4%+ average)
  • Clean track record of exit decisions with measurable outcomes
  • Investor presentation materials ready (if pursuing external capital)

12-Month Vision:

  • £100K+ capital with external investment
  • Exit strategy proven across multiple market regimes (bull, bear, ranging, volatile)
  • Published track record of regime classification accuracy
  • Multi-symbol support (beyond single ETH-USDT grid)

Functional Requirements

Regime Analysis & Classification

FR1: System can fetch OHLCV market data from exchange API
FR2: System can calculate six regime metrics (ADX, Efficiency Ratio, Autocorrelation, OU Half-Life, Normalized Slope, Bollinger Bandwidth)
FR3: System can classify market regime into four states (RANGE_OK, RANGE_WEAK, TRANSITION, TREND)
FR4: System can calculate regime confidence score
FR5: System can persist regime analysis results to version-controlled storage
FR6: System can load historical regime analysis for trend evaluation

Exit Strategy Management

FR7: System can evaluate current exit state based on regime analysis (NORMAL, WARNING, LATEST_ACCEPTABLE_EXIT, MANDATORY_EXIT)
FR8: System can detect MANDATORY_EXIT conditions (TREND regime, consecutive closes outside range, directional structure confirmed)
FR9: System can detect LATEST_ACCEPTABLE_EXIT conditions (TRANSITION persistence, mean reversion degradation, volatility expansion)
FR10: System can detect WARNING conditions requiring 2+ triggering metrics
FR11: System can track exit state transitions with timestamps and reasons
FR12: System can evaluate three sequential restart gates (Directional Energy Decay, Mean Reversion Return, Tradable Volatility)
FR13: System can enforce gate sequencing (Gate N+1 only evaluated if Gate N passes)
FR14: System can track gate status history for stopped grids
FR15: System can determine grid eligibility for restart based on gate progression

Risk Assessment

FR16: System can fetch active position data from exchange API
FR17: System can calculate unrealized PnL for active positions
FR18: System can calculate capital at risk based on current positions and stop-loss distance
FR19: System can estimate profit give-back if exit delayed by specified hours
FR20: System can calculate stop-loss distance in ATR units
FR21: System can track grid position health relative to configured boundaries
FR22: System can gracefully degrade when position data unavailable (use last known state)

Notification & Alerting

FR23: Operator can receive exit state notifications via push notification service
FR24: System can rate-limit notifications based on exit state urgency (WARNING: 4h, LATEST_ACCEPTABLE: 2h, MANDATORY: 1h)
FR25: System can include position risk context in notifications (capital at risk, profit give-back, stop-loss distance)
FR26: System can include regime metrics in notifications (confidence, verdict, triggering conditions)
FR27: System can track notification delivery status (sent, delivered, failed)
FR28: System can prevent duplicate notifications for unchanged exit states

Audit & Decision Tracking

FR29: System can create immutable decision records with timestamps, regime state, and exit recommendations
FR30: System can commit decision records to version-controlled storage before sending notifications
FR31: System can track configuration version (Git hash) used for each decision
FR32: System can track system image version used for each decision
FR33: Operator can query historical decision records by date range, symbol, or exit state
FR34: System can track operator actions (grid stopped, grid started, exit declined) separately from system recommendations
FR35: System can maintain separation between “system recommendation quality” and “operator action quality”
FR36: System can provide complete audit trail for investor scrutiny

Validation & Analysis

FR37: System can replay historical metrics for backtesting exit strategy
FR38: System can calculate Profit Retention Ratio (PRR) from historical data
FR39: System can calculate Stop-Loss Avoidance Rate (SLAR) from historical data
FR40: System can calculate True Transition Detection Rate (TTDR) from historical data
FR41: System can calculate False Exit Rate (FER) from historical data
FR42: System can calculate Exit Reaction Time (ERT) when operator action data available
FR43: System can generate KPI reports for specified time periods
FR44: System can identify false positive exits (regime returned to RANGE after exit)
FR45: System can identify false negative exits (regime transitioned but no exit signal)
FR46: Operator can compare backtesting results against live operation results

System Operations

FR47: System can execute regime evaluation on scheduled intervals (hourly)
FR48: System can validate configuration schema on startup
FR49: System can retry failed Git push operations on subsequent evaluation cycles
FR50: System can generate static HTML dashboards with embedded visualizations
FR51: System can track evaluation execution time and log performance warnings
FR52: System can send operational metrics to logging infrastructure (Grafana Loki)
FR53: System can handle partial metric calculation failures (continue with available metrics)
FR54: System can log metric calculation errors in decision records
FR55: Operator can override configuration via environment variables
FR56: System can detect and alert on persistent API failures (3+ consecutive failures)

Non-Functional Requirements

Performance

NFR-P1: Regime evaluation completes within 5 minutes (allows 55-minute buffer before next hourly cycle)
NFR-P2: Evaluation time exceeding 1 minute triggers WARNING log entry
NFR-P3: Evaluation time exceeding 5 minutes triggers ERROR log entry and operator alert
NFR-P4: Notification delivery latency <60 seconds from decision record creation
NFR-P5: Git commit and push operations complete within 10 seconds under normal conditions
NFR-P6: Historical data loading (12-24 hours of metrics) completes within 30 seconds

Reliability

NFR-R1: System availability ≥99% during validation phase (acceptable: ~7 hours downtime per month)
NFR-R2: CronJob execution success rate ≥98% (missed evaluations acceptable if isolated)
NFR-R3: Failed Git push operations retry automatically on subsequent evaluation cycles
NFR-R4: Failed KuCoin API calls retry up to 3 times with exponential backoff before declaring failure
NFR-R5: System continues exit state evaluation when position data unavailable (graceful degradation)
NFR-R6: Persistent failures (3+ consecutive cycles) trigger operator alerts
NFR-R7: Configuration errors detected on startup prevent deployment (fail-fast with rollback to previous version)

Security

NFR-S1: Exchange API keys stored in external secrets management (not in code/config files)
NFR-S2: API keys restricted with IP whitelist and no-withdrawal permissions
NFR-S3: Decision records stored in private Git repository with access limited to operator
NFR-S4: All API communications use HTTPS/TLS encryption
NFR-S5: Pushover notifications encrypted in transit
NFR-S6: No credentials or API keys logged in application logs or decision records
NFR-S7: Git repository authentication uses SSH keys (not HTTPS credentials)

Integration

NFR-I1: System tolerates KuCoin API response times up to 5 seconds (retry if exceeded)
NFR-I2: KuCoin API rate limits are not exceeded (hourly cadence well within limits)
NFR-I3: Pushover API failures do not block evaluation completion
NFR-I4: Git push failures do not prevent decision record creation (stored locally, pushed later)
NFR-I5: System handles KuCoin API maintenance windows gracefully (uses last known data, alerts operator)
NFR-I6: Failed notification delivery tracked and retried on next evaluation cycle
NFR-I7: API integration errors include actionable context (error type, retry count, next action)

Data Integrity

NFR-D1: Decision records are immutable once committed to Git (no retroactive editing)
NFR-D2: All decision records include Git commit hash for configuration version traceability
NFR-D3: All decision records include Docker image version for code traceability
NFR-D4: Metric calculation errors logged in decision records (transparent failure tracking)
NFR-D5: Partial metric failures documented with specific metrics unavailable
NFR-D6: Timestamp precision to the second for all decision records and state transitions
NFR-D7: Operator actions tracked separately from system recommendations (no conflation)
NFR-D8: Historical data retained indefinitely for MVP (no automated deletion)