Architecture Decision Document

This document builds collaboratively through step-by-step discovery. Sections are appended as we work through each architectural decision together.

Project Context Analysis

Requirements Overview

System Purpose: The Market-Making System is a comprehensive algorithmic trading decision support platform that enables profitable grid trading at scale through regime-aware monitoring, tiered exit protection, and systematic capital management. The system prioritizes capital preservation over profit maximization, providing 24/7 market monitoring with human-in-loop execution to maintain operator control while benefiting from automated risk detection.

Functional Requirements:

Core Capabilities (20 major requirements identified):

  1. Market Regime Classification (Req 1): Four-regime taxonomy (RANGE_OK, RANGE_WEAK, TRANSITION, TREND) with multi-timeframe analysis (1h primary + 4h confirmation) using 6 independent metrics
  2. Recommendation Engine (Req 2, 6, 7): Two recommendation types (GRID_MANAGE, GRID_SETUP) with constrained action sets and confidence-based escalation
  3. Confidence Scoring (Req 3): Conservative calibration with multiple penalties (time-based <36h, position-based, maturity, spacing) capped at 0.95
  4. Decision Record Management (Req 4): Git-backed immutable YAML files tracking recommendations, actions, and evaluations at 24h/72h/7d horizons
  5. Grid Configuration Management (Req 5, 16): YAML-based configurations with detailed parameters (price bounds, grid levels, amounts per grid, profit percentages) and version history
  6. Capital Management (Req 8): SINGLE_GRID mode with global reserve enforcement and unlocked balance calculations
  7. Automation Controls (Req 9): Asymmetric automation philosophy - auto-reduce risk (STOP_GRID), manual capital deployment (CREATE_GRID requires approval)
  8. Timeout & Cooldown Management (Req 10): Verdict-based timeouts (30min TRANSITION, 15min TREND) and action-based cooldowns (60min after stop, 120min after declined setup)
  9. API Security (Req 11): No withdrawal permissions, IP whitelist enforcement, trade permissions validation
  10. Exchange Integration (Req 12): Abstract Exchange_Interface with KuCoin as first implementation
  11. Trade Monitoring & Notifications (Req 13): Pushover API integration (direct), optional webhook support, rate-limited context-rich alerts
  12. Performance Evaluation (Req 14): Multi-horizon assessment (24h/72h/7d) with USD-denominated economic impact tracking
  13. Metrics Collection (Req 15): Hourly snapshots with minute-level price granularity stored in Git
  14. Grid Restart Gates (Req 17): Sequential three-gate evaluation (Directional Energy Decay → Mean Reversion Return → Tradable Volatility) required after stops
  15. Probationary Grid Management (Req 18): Conservative parameters (50-60% allocation, wider spacing) for confidence 0.60-0.80 ranges
  16. Grid State Management (Req 19): History array as single source of truth (not separate enabled field)
  17. Historical Data Management (Req 20): Backfill tools and automated cleanup with retention policy enforcement

Grid Exit Strategy Feature (Current Development Focus - Phases 2-5):

  • Tiered exit states: NORMAL → WARNING → LATEST_ACCEPTABLE_EXIT → MANDATORY_EXIT
  • Multi-metric consensus: WARNING requires 2+ conditions (prevents false alarms)
  • State transition tracking with rate limiting
  • Historical data loading for persistence checks
  • Position risk quantification with KuCoin integration
  • KPI tracking framework (7 metrics: SLAR, PRR, TTDR, FER, ERT, EAW, MEC)

Non-Functional Requirements:

Performance:

  • Evaluation cadence: 1-hour (current), 15-minute (planned post-validation)
  • Target response time: <30 seconds per evaluation cycle
  • Warning threshold: >1 minute evaluation time triggers logging
  • Error threshold: >5 minutes evaluation time triggers alerts

Reliability:

  • Graceful degradation on API failures (retry 2-3x with exponential backoff)
  • Git push failures: log locally, continue operation, retry on next cycle
  • Circuit breaker patterns for persistent failures (3+ consecutive)
  • 24/7 availability via Kubernetes CronJob scheduling

Security:

  • KuCoin API keys: No withdrawal permissions, IP whitelist mandatory
  • Kubernetes secrets via ExternalSecrets integration
  • Private Git repository for decision records (SSH authentication)
  • VPN-only access to decision interface (no public exposure for MVP)

Audit & Compliance:

  • Immutable decision records (Git-backed, no retroactive editing)
  • Configuration versioning (Git commit hash tracked in every decision)
  • Separation of recommendation quality vs action quality tracking
  • Investor-grade audit trail for external capital readiness (£100K+ scaling target)

Scalability:

  • Current: Single grid, single symbol (ETH-USDT), £1K capital
  • Near-term: £10K capital (3-4 months post-Phase 5 validation)
  • Long-term: Multi-grid, multi-symbol, £100K+ external capital
  • Architecture designed for multi-exchange extension

Data Management:

  • Git-as-database pattern (no traditional database required)
  • PVC-backed Git repository in Kubernetes (persists across pod restarts)
  • Minute-level price granularity (up to 540 days 4h data)
  • Automated cleanup with retention policy (to be defined post-MVP)

Observability:

  • Grafana Loki integration for all logs and metrics
  • Structured logging with error context
  • Performance metrics (API response times, evaluation duration, error rates)
  • State transition tracking (all exit state changes logged)

Scale & Complexity

Complexity Assessment: HIGH

Justification:

  • Multi-subsystem platform with 12+ architectural components
  • Complex state management (regime classification, exit states, gate sequencing)
  • Financial system integration with real capital at risk
  • Multi-metric consensus logic requiring 2+ conditions
  • Sequential gate evaluation with forced progression
  • Multi-timeframe coordination (1h/4h synchronization)
  • Immutable audit trail requirements for investor credibility
  • ~180-250 hours remaining development effort across Phases 2-5

Primary Technical Domain: Financial Technology - Algorithmic Trading Decision Support

Architecture Style: Event-driven batch processing with Git-based immutable audit logging

Project Context: Brownfield - extending existing regime detection system (Phase 1 complete) with exit strategy, position risk, testing, and operational capabilities

Current Maturity: ~40-50% complete toward MVP

  • ✅ Phase 1: Regime detection with 6 real metrics (60 tests passing)
  • 🚀 Phase 2: Exit strategy ready to start (50-70h effort)
  • 📋 Phase 3: Position risk planned (30-40h effort)
  • 📋 Phase 4: Testing & validation planned (40-50h effort)
  • 📋 Phase 5: Operational improvements planned (20-30h effort)

Estimated Architectural Components: 40+ modules across 12 major subsystems

Component Categories:

  • Data Ingestion (3 components): API clients, interface abstraction, backfill tools
  • Regime Analysis (4 components): 6 metric calculators, classifier, confidence scorer, multi-timeframe coordinator
  • Exit Strategy (4 components): 3 trigger modules, state tracker, rate limiter, historical loader
  • Grid Management (4 components): Config manager, state determiner, probationary recommender, capital allocator
  • Restart Gates (4 components): 3 gate evaluators, sequential orchestrator
  • Decision & Audit (4 components): Record creator, action appender, evaluation appender, version tracker
  • Position & Risk (4 components): Position tracker, risk calculator, profit estimator, stop-loss monitor
  • Notification (4 components): Pushover client, webhook dispatcher, rate limiter, message builder
  • Metrics Collection (4 components): Snapshot collector, price aggregator, Git persistence, PVC handler
  • Dashboard (4 components): Visualization generator, HTML builder, dashboard packager, trend analyzer
  • Testing & Validation (4 components): Backtesting framework, KPI calculator, test suite, integration orchestrator
  • Infrastructure (5 components): CronJob manager, ExternalSecrets, ArgoCD, Loki logging, Docker builder

Technical Constraints & Dependencies

Hard Constraints:

  1. KuCoin API Limitation: Spot grids cannot be managed via API

    • Impact: Manual UI execution required (human-in-loop by necessity)
    • Benefit: Turned into design strength (regulatory simplicity, operator control)
    • Mitigation: System generates recommendations, human executes in KuCoin UI
  2. No Database Requirement: Git-as-database pattern enforced

    • Rationale: Simplicity, immutable audit trail, version control built-in
    • Impact: All state must serialize to YAML/JSON files in Git
    • Challenge: Historical data loading requires file I/O from Git PVC
  3. Personal Capital Only (MVP): No external investor funds until post-Phase 5

    • Current: £1K validation capital
    • Near-term: £10K personal capital (3-4 month target)
    • Long-term: £100K+ external capital (requires completed audit trail)
    • Impact: No regulatory compliance requirements for MVP
  4. API Security Requirements:

    • No withdrawal permissions (hard requirement)
    • IP whitelist enforcement (hard requirement)
    • Kubernetes secrets only (no config file credentials)
  5. Stateless Execution: Each CronJob run must be independent

    • Load state from Git PVC
    • Evaluate regime and exit conditions
    • Commit results to Git
    • Exit cleanly (no persistent processes)

Soft Constraints (Assumptions to Validate):

  1. 1-Hour Evaluation Cadence: Assumption that 12-24 hour warning windows exist for regime transitions

    • RAIA A001, A004: To be validated in Phase 4 backtesting
    • Fallback: Switch to 15-minute cadence if <80% transitions provide >2h warning
  2. 2+ Condition WARNING Trigger: Assumption that requiring 2+ metrics prevents false positives without missing true transitions

    • RAIA A005: Target False Exit Rate <30%
    • Tunable: Can increase to 3+ conditions if FER >30%
  3. False Positive Tolerance: Assumption that <30% false exit rate is acceptable

    • Economic impact: Missed ranging periods vs avoided stop-losses
    • To be measured via KPI framework in Phase 4-5
  4. Single Grid Sufficient: Assumption that one active grid at a time is sufficient for validation phase

    • Future: Multi-grid support for £10K+ capital scaling
    • Architecture designed for extension (grid_id tracking everywhere)

External Dependencies:

  1. KuCoin Exchange API:

    • Market data (OHLCV at multiple timeframes)
    • Account data (balance, positions)
    • Trade history (fills, orders)
    • Rate limits: 1-hour cadence well within limits
    • Availability: System must handle API outages gracefully
  2. Git Repository (market-maker-data):

    • Private repository for decision records and metrics
    • SSH authentication from Kubernetes pods
    • PVC-backed clone (persists across restarts)
    • Push failures acceptable (retry on next cycle)
  3. Pushover API:

    • Direct notification delivery (no n8n dependency for MVP)
    • Rate limiting: Built into application logic
    • Fallback: Log notifications if API unavailable
  4. Kubernetes Infrastructure:

    • CronJob scheduling (hourly execution)
    • PVC provisioning (Git repository storage)
    • ExternalSecrets integration (KuCoin API keys)
    • Grafana Loki (logging and observability)
    • ArgoCD (deployment automation)
  5. Optional: n8n Integration (Post-MVP):

    • Webhook endpoints for advanced orchestration
    • Multi-channel notification routing (Email, Slack, SMS)
    • Not required for core functionality

Technology Stack (Current):

  • Python 3.11+
  • Pydantic (schema validation)
  • GitPython (Git operations)
  • PyYAML (YAML parsing)
  • Requests (KuCoin API calls)
  • Chart.js (dashboard visualizations)
  • NumPy (metric calculations)
  • Pytest (testing framework - 60 tests passing)

Cross-Cutting Concerns Identified

1. Configuration Management

  • YAML-based configuration with environment variable overrides for Kubernetes
  • Schema validation on load (fail fast if invalid)
  • Git commit hash versioning (every decision references config version)
  • Convention-based naming for env vars (e.g., MARKET_MAKER_DATA_REPOSITORY_BASE_PATH)
  • Blue-green deployment support (invalid config keeps previous version running)

2. Error Handling & Resilience

  • API retry logic: 2-3 attempts with exponential backoff
  • Circuit breakers: 3+ consecutive failures trigger alerts
  • Git push failures: Log locally, continue operation, retry on next cycle
  • Metric calculation errors: Continue with remaining metrics, include error context in notifications
  • Graceful degradation: Partial metrics acceptable if confidence calculable

3. Logging & Observability

  • Structured logging to Grafana Loki
  • Performance metrics: API response times, evaluation duration, error rates
  • State transition logging: All exit state changes captured
  • Alert thresholds: >1min evaluation (WARNING), >5min evaluation (ERROR)
  • Git push success/failure tracking

4. Schema Validation

  • Pydantic models for all data structures:
    • Metrics files (regime, confidence, detailed_analysis, gate_evaluation)
    • Exit state transitions (transitions array, last_notification timestamps)
    • Decision records (recommendation, action_records, evaluation_records)
    • Configuration (exit_rules, notifications, gates)
  • Pre-commit validation (reject invalid data before Git commit)
  • Runtime validation (load-time validation with clear error messages)

5. Testing Strategy

  • Unit tests: 90%+ coverage target (60 tests passing for Phase 1)
  • Integration tests: 5+ scenarios for state progression
  • Real data validation: Run against last 7 days of market-maker-data
  • Backtesting: 3-6 months historical validation in Phase 4
  • KPI tracking: 7 metrics (SLAR, PRR, TTDR, FER, ERT, EAW, MEC)

6. Data Retention & Cleanup

  • Current policy: Keep all data forever (MVP)
  • Estimated growth: 10-50 KB per evaluation × 24 hours × 365 days ≈ 87-438 MB/year
  • Action item: Revisit retention policy at end of MVP (RAIA log entry needed)
  • Monthly maintenance: Automated cleanup procedures (Req 20)

7. Security

  • API key management: Kubernetes ExternalSecrets integration
  • No withdrawal permissions: Enforced at API key level
  • IP whitelist: Required for KuCoin API access
  • Private Git repository: SSH authentication, no public access
  • Decision interface: VPN-only access (no OAuth for MVP)

8. Cooldown & Timeout Management

  • Per-grid cooldowns (not global):
    • 60 minutes after STOP_GRID action
    • 120 minutes after declined/expired Grid_Setup
  • Verdict-based timeouts:
    • 30 minutes for TRANSITION regime
    • 15 minutes for TREND regime
    • No timeout for RANGE_OK/RANGE_WEAK
  • Enforcement: State tracking in exit_states/{symbol}/{date}.json files

9. Rate Limiting

  • Notification rate limits per exit state:
    • WARNING: 4 hours minimum between same-state notifications
    • LATEST_ACCEPTABLE_EXIT: 2 hours minimum
    • MANDATORY_EXIT: 1 hour minimum
  • Purpose: Prevent notification fatigue while ensuring critical alerts break through
  • Implementation: last_notification timestamps in state transition files

10. Multi-Timeframe Coordination

  • Primary timeframe: 1 hour (decision timeframe)
  • Confirmation timeframe: 4 hours (structural validation)
  • Early warning context: 5min, 15min (not used for decisions)
  • Data requirements: 21d/1m, 120d/15m, 270d/1h, 540d/4h
  • Synchronization: Historical data loading must handle timeframe alignment

Starter Architecture Analysis

Project Foundation Context

This is a brownfield project with Phase 1 complete (~40-50% toward MVP). Rather than selecting a starter template, we’re documenting the established architecture that Phase 1 created, which serves as the “effective starter” for all future development (Phases 2-5).

Primary Technology Domain

Backend Batch Processing System - Financial Technology Decision Support

Not a web application, API service, or interactive system. This is a scheduled batch processor that:

  • Runs as Kubernetes CronJob (hourly execution currently)
  • Loads state from Git repository on PVC
  • Performs regime analysis and exit state evaluation
  • Commits results back to Git
  • Exits cleanly (stateless execution model)
  • Generates static HTML dashboards (no server required)

Established Architecture Pattern: “Git-Backed Batch Processor”

Phase 1 established a unique architectural pattern optimized for financial decision support with immutable audit trails:

Core Pattern: Event-driven batch processing + Git-as-database + Stateless execution

Why this pattern:

  • Immutability: Git provides version-controlled, tamper-evident audit trail (investor credibility requirement)
  • Simplicity: No database to operate, backup, or maintain
  • Reproducibility: Every evaluation can be replayed from Git history
  • Transparency: Decision records are human-readable YAML/JSON files
  • Resilience: Git push failures don’t crash system (retry on next cycle)

Technology Stack (Phase 1 Established)

Language & Runtime:

  • Python 3.11+ (strict version requirement)
  • Type hints throughout (enforced in code reviews)
  • No async/await (synchronous batch processing sufficient)

Core Dependencies:

# Data Validation & Serialization
pydantic>=2.0          # Schema validation for all data structures
pyyaml>=6.0           # YAML parsing for config and decision records
 
# Git Operations
GitPython>=3.1        # Git repository management
 
# API Integration
requests>=2.31        # KuCoin API calls (synchronous, retry logic built-in)
 
# Numerical Computing
numpy>=1.24           # Metric calculations (ADX, OU process, etc.)
 
# Testing
pytest>=7.4           # Test framework (60 tests passing)
pytest-cov>=4.1       # Coverage reporting (100% for Phase 1 metrics)

Visualization (Dashboard Generation):

  • Chart.js (JavaScript library, embedded in static HTML)
  • No server-side rendering
  • Self-contained HTML files with embedded data

Infrastructure:

  • Kubernetes CronJob (scheduling)
  • PVC (Persistent Volume Claim) for Git repository storage
  • ExternalSecrets (KuCoin API key injection)
  • Grafana Loki (structured logging)
  • ArgoCD (GitOps deployment)
  • Docker (containerization via GitHub Actions)

Project Structure & Organization

Established by Phase 1 (60 tests passing, production-ready):

metrics-service/
├── src/
│   ├── config/                    # Configuration management
│   │   ├── loader.py             # YAML config loading + env var overrides
│   │   └── validator.py          # Schema validation (fail fast on invalid)
│   │
│   ├── exchanges/                 # Exchange integration layer
│   │   ├── kucoin_client.py      # KuCoin API wrapper
│   │   └── exchange_interface.py # Abstract interface (future multi-exchange)
│   │
│   ├── regime/                    # Regime analysis engine (Phase 1 COMPLETE)
│   │   ├── engine.py             # Main regime classifier
│   │   ├── classifier.py         # 4-regime taxonomy logic
│   │   ├── confidence.py         # Conservative confidence scoring
│   │   ├── metrics/              # 6 metric calculators
│   │   │   ├── __init__.py
│   │   │   ├── adx.py           # Average Directional Index
│   │   │   ├── efficiency_ratio.py
│   │   │   ├── autocorrelation.py
│   │   │   ├── ou_process.py    # Ornstein-Uhlenbeck half-life
│   │   │   ├── slope.py         # Normalized slope
│   │   │   └── bollinger.py     # Bollinger Band bandwidth
│   │   └── gates/                # Restart gate evaluators
│   │       ├── gate1_energy.py  # Directional Energy Decay
│   │       ├── gate2_reversion.py # Mean Reversion Return
│   │       └── gate3_volatility.py # Tradable Volatility
│   │
│   ├── exit_strategy/             # Exit state engine (Phase 2 - IN PROGRESS)
│   │   ├── evaluator.py          # Main exit state evaluator (~30% complete)
│   │   ├── state_tracker.py     # State transition tracking
│   │   ├── triggers/             # Trigger modules (to be implemented)
│   │   │   ├── mandatory.py
│   │   │   ├── latest_acceptable.py
│   │   │   └── warning.py
│   │   └── history_loader.py    # Historical data loading
│   │
│   ├── grid/                      # Grid configuration management
│   │   ├── config_manager.py    # YAML-based grid definitions
│   │   ├── state_determiner.py  # History array-based state logic
│   │   ├── capital_allocator.py # SINGLE_GRID mode + reserve enforcement
│   │   └── probationary.py      # Conservative grid recommender
│   │
│   ├── metrics/                   # Metrics collection & storage
│   │   ├── collector.py          # Hourly snapshot collection
│   │   ├── aggregator.py        # Minute-level price aggregation
│   │   └── persistence.py       # Git commit/push operations
│   │
│   ├── interfaces/                # Abstract interfaces
│   │   ├── exchange.py           # Exchange abstraction
│   │   └── notification.py      # Notification abstraction
│   │
│   ├── spotcheck/                 # Utility modules
│   │   └── validators.py
│   │
│   ├── git_manager.py            # Git operations wrapper
│   └── init.py                   # Entry point orchestration
│
├── tests/                         # Test suite (60 tests passing Phase 1)
│   ├── regime/
│   │   ├── metrics/              # 8 tests per metric × 6 metrics = 48 tests
│   │   │   ├── test_adx.py
│   │   │   ├── test_efficiency_ratio.py
│   │   │   ├── test_autocorrelation.py
│   │   │   ├── test_ou_process.py
│   │   │   ├── test_slope.py
│   │   │   └── test_bollinger.py
│   │   └── test_classifier.py   # Integration tests
│   ├── grid/
│   │   └── test_state_determiner.py
│   └── conftest.py               # Shared fixtures
│
├── config/                        # Configuration files
│   ├── environment.yaml          # Default configuration
│   └── exit_strategy_config.yaml # Exit rules (Phase 2)
│
├── scripts/                       # Operational scripts
│   ├── send_regime_notifications.py # Pushover integration
│   └── collect_metrics.py        # Metrics collection orchestrator
│
├── infra/                         # Kubernetes manifests
│   └── metrics-service/
│       ├── cronjob.yaml          # CronJob definition
│       ├── pvc.yaml              # Git repository PVC
│       └── externalsecret.yaml   # API key injection
│
├── .venv/                         # Virtual environment (local dev)
├── pyproject.toml                # Dependencies + pytest config
├── Taskfile.yml                  # Task automation (local + CI/CD)
└── Dockerfile                    # Container image build

Key Organizational Patterns:

  1. Flat module structure: No deep nesting (max 2-3 levels)
  2. Feature-based organization: regime/, exit_strategy/, grid/ (not layered like models/, services/)
  3. Tests mirror src structure: tests/regime/metrics/test_adx.pysrc/regime/metrics/adx.py
  4. Interfaces separate from implementations: interfaces/ contains abstractions, exchanges/ contains implementations
  5. Single responsibility modules: Each .py file has one clear purpose

Architectural Decisions Provided by Phase 1 Foundation

Language & Type Safety:

  • Python 3.11+ with strict type hints (enforced)
  • Pydantic models for all data structures (runtime validation)
  • No dynamic typing or Any types (except when unavoidable)
  • Docstrings required for public functions

Configuration Management:

  • YAML-based configuration (human-readable, version-controlled)
  • Environment variable overrides with convention: MARKET_MAKER_<NESTED_KEY>
  • Schema validation on load (fail fast, never run with invalid config)
  • Git commit hash versioning (every decision references config version)

Data Persistence:

  • Git as primary data store (no database)
  • YAML for structured data (decision records, metrics)
  • JSON for state transitions (exit states, gate tracking)
  • One file per evaluation/decision (atomic commits)
  • PVC-backed Git repository in Kubernetes

Testing Strategy:

  • Pytest as test framework
  • 100% coverage target for new code (achieved in Phase 1)
  • Test file naming: test_<module>.py
  • Shared fixtures in conftest.py
  • Integration tests in separate test_integration_*.py files

Error Handling:

  • Explicit exception handling (no bare except: clauses)
  • Retry logic with exponential backoff for API calls
  • Graceful degradation (partial metrics acceptable)
  • Structured logging with error context
  • Circuit breakers for persistent failures

Code Style:

  • Black formatting (enforced in CI/CD)
  • Isort for import organization
  • Pylint for code quality
  • Maximum line length: 100 characters
  • Docstring format: Google style

Git Workflow:

  • Feature branches: feature/<phase>-<description>
  • Conventional commits: [Phase N] <type>: <description>
  • Squash merges to main
  • No direct commits to main

Deployment Pattern:

  • Docker multi-stage builds (small final image)
  • GitHub Actions for CI/CD
  • ArgoCD for GitOps deployment
  • Blue-green deployment (invalid config keeps previous version)
  • Image tagging: <version> (semantic versioning)

Development Experience Features

Local Development:

  • Taskfile.yml for common operations:
    • task install - Set up virtual environment
    • task test - Run test suite
    • task lint - Run linters
    • task format - Auto-format code
    • task evaluate-regime - Run regime evaluation locally
    • task collect-metrics - Simulate CronJob execution

Hot Reloading:

  • Not applicable (batch processing, not web server)
  • Use task evaluate-regime to test changes locally

Debugging:

  • Standard Python debugger (pdb)
  • VSCode launch configurations provided
  • Logging to console in development mode

Testing Infrastructure:

  • pytest -v for verbose output
  • pytest --cov for coverage reporting
  • pytest -k <pattern> for selective test execution
  • Fixtures for mocked KuCoin API responses

Documentation:

  • Docstrings in code (Google style)
  • README.md with quick start guide
  • DEVELOPER-HANDOFF.md for new developers
  • Architecture decisions in this document

Extension Points for Future Development

For Phase 2 (Exit Strategy):

  • Add trigger modules in src/exit_strategy/triggers/
  • Follow established pattern: One trigger type per file
  • Add tests in tests/exit_strategy/triggers/
  • 8+ test cases per trigger (boundary conditions, edge cases)

For Phase 3 (Position Risk):

  • Add position tracker in src/position/
  • Implement PositionTrackerInterface (create new interface)
  • Add risk calculator in src/risk/
  • Follow KuCoin client pattern for API integration

For Phase 4 (Testing & Validation):

  • Add backtesting framework in src/backtest/
  • Add KPI calculator in src/kpi/
  • Create integration test scenarios in tests/integration/

For Phase 5 (Operational):

  • Enhance logging in existing modules
  • Add audit logger in src/audit/
  • Add KPI dashboard generator in src/dashboard/

Migration Notes for New Developers

Coming from Web Development:

  • This is NOT a web server - it’s a batch job
  • No HTTP requests to handle (except outbound API calls)
  • No real-time state - everything loads from Git each run
  • “Deploy” means update CronJob, not restart server

Coming from Database-Heavy Systems:

  • No database queries - read YAML/JSON files from Git
  • No migrations - schema changes via Pydantic model updates
  • No transactions - Git commits are atomic operations
  • No indexing - file-based lookups are sufficient at current scale

Coming from Microservices:

  • This is a monolith by design (simplicity > distributed complexity)
  • No service-to-service calls (except KuCoin API)
  • No message queues - Git repository is the “queue”
  • No service discovery - fixed CronJob schedule

Key Principles Established by Phase 1

  1. Simplicity over Cleverness: Straightforward code beats clever abstractions
  2. Immutability over Mutability: Append-only decision records, never edit
  3. Explicit over Implicit: Configuration visible, no magic defaults
  4. Type Safety over Dynamic: Pydantic validation, no Any types
  5. Testability over Speed: 100% coverage more important than micro-optimizations
  6. Capital Preservation over Profit Optimization: Safety-first decision logic
  7. Human-in-Loop over Full Automation: System recommends, human decides

What Phase 1 Proves

Technical Viability:

  • ✅ Git-as-database works at hourly evaluation scale
  • ✅ Pydantic validation catches config errors before deployment
  • ✅ Kubernetes CronJob scheduling reliable (no missed evaluations)
  • ✅ PVC-backed Git repository persists across pod restarts
  • ✅ 6 metric calculators produce trustworthy values (replaced hardcoded dummies)

Development Velocity:

  • ✅ 60 tests passing with 100% coverage (40-60 hour effort)
  • ✅ Clean module boundaries enable parallel development
  • ✅ Taskfile commands work identically in local + CI/CD
  • ✅ Type hints catch errors at development time

Operational Readiness:

  • ✅ Blue-green deployment prevents bad config from reaching production
  • ✅ Grafana Loki logging provides visibility into evaluation runs
  • ✅ Git commit history serves as complete audit trail
  • ✅ ExternalSecrets integration keeps API keys secure

What This Means for Phases 2-5:

Follow the patterns established in Phase 1. Don’t reinvent:

  • Module organization (feature-based, flat structure)
  • Testing approach (8+ tests per module, 100% coverage)
  • Configuration management (YAML + env var overrides)
  • Git operations (load → evaluate → commit → exit)
  • Error handling (retry logic, graceful degradation)

Add new capabilities by extending existing patterns:

  • New trigger modules → src/exit_strategy/triggers/
  • New gate evaluators → src/regime/gates/
  • New metric calculators → src/regime/metrics/
  • New integrations → src/<integration>/ with interface in src/interfaces/

Core Architectural Decisions

Decision Context

This section documents architectural decisions established by Phase 1 (60 tests passing, production-ready). These patterns are definitive for Phases 2-5 - extend them, don’t reinvent them.

Phase 1 proved these decisions work in production with real capital at risk. New development should follow these established patterns unless there’s a compelling reason to diverge (document exceptions in RAIA.md).

Decision Priority Analysis

Critical Decisions (Already Established by Phase 1):

  • Data persistence via Git (no traditional database)
  • Pydantic schema validation (fail fast on invalid data)
  • Stateless batch execution (load → evaluate → commit → exit)
  • Python 3.11+ with strict type hints
  • Kubernetes CronJob deployment model
  • 100% test coverage for new code

Important Decisions (Already Established by Phase 1):

  • YAML for configuration, decisions, and state
  • JSON only for raw data arrays
  • ExternalSecrets for API key management
  • Grafana Loki for observability
  • ArgoCD for GitOps deployment
  • Black + Isort + Pylint for code quality

Deferred Decisions (Post-MVP):

  • Data retention policy (currently keep everything forever)
  • Multi-exchange support (KuCoin only for MVP)
  • Multi-grid concurrent execution (single grid only for MVP)
  • 15-minute evaluation cadence (hourly for MVP, validate 12-24h warning window assumption first)
  • Automated grid creation (manual approval required for MVP)

Data Architecture

Decision: Git-as-Database Pattern

  • Choice: Git repository as primary data store (no PostgreSQL, MongoDB, etc.)
  • Version: GitPython 3.1+
  • Rationale:
    • Immutable audit trail (version-controlled, tamper-evident)
    • Simplicity (no database to operate, backup, or maintain)
    • Reproducibility (every evaluation can be replayed from Git history)
    • Transparency (decision records are human-readable YAML files)
  • Affects: All subsystems (regime, exit strategy, grid management, metrics collection)
  • Provided by: Phase 1 design decision
  • Production Proven: ✅ Yes (Phase 1 complete)

Decision: Pydantic Schema Validation

  • Choice: Pydantic 2.0+ for all data structure validation
  • Version: pydantic>=2.0
  • Rationale:
    • Runtime validation (catch config errors before Git commit)
    • Type safety (strict typing with validation)
    • Schema evolution (versioned models)
    • Developer experience (clear error messages)
  • Affects: Configuration loading, metrics files, decision records, exit state transitions
  • Pattern: One Pydantic model per file type
    # src/schemas/metrics.py
    class MetricsFile(BaseModel):
        symbol: str
        timestamp: datetime
        regime: RegimeType
        confidence: float = Field(ge=0.0, le=1.0)
        # ... rest of schema
  • Production Proven: ✅ Yes (Phase 1 complete)

Decision: YAML for Recommendations/State, JSON ONLY for Raw Data

  • Choice: YAML for all recommendations, decisions, and state tracking. JSON ONLY for raw data arrays.
  • Version: pyyaml>=6.0
  • Rationale:
    • YAML: Human-readable, supports comments, better for ALL state and recommendations
    • JSON: Machine-readable, better ONLY for raw data arrays (price data, etc.)
    • Both validated via Pydantic models (no raw dict manipulation)
  • File Structure:
    market-maker-data/
    ├── metrics/{symbol}/{YYYY-MM-DD}-{HH}.yaml     # YAML (regime, confidence, analysis)
    ├── decisions/{YYYY-MM-DD}/dec-{symbol}-{HHMMSS}.yaml  # YAML (recommendations)
    ├── exit_states/{symbol}/{YYYY-MM-DD}.yaml      # YAML (state transitions)
    └── raw_data/{symbol}/{YYYY-MM-DD}.json         # JSON (only for raw price arrays if needed)
    
  • Production Proven: ✅ Yes (Phase 1 complete)

Decision: Schema Evolution Strategy

  • Choice: Pydantic model versioning with backward-compatible reads
  • Pattern:
    class MetricsFileV2(BaseModel):
        schema_version: str = "2.0"
        # ... new fields with defaults
        
    def load_metrics_file(path: Path) -> MetricsFileV2:
        raw_data = yaml.safe_load(path.read_text())
        if raw_data.get("schema_version") == "1.0":
            raw_data = migrate_v1_to_v2(raw_data)
        return MetricsFileV2(**raw_data)
  • Migration Strategy: Write migration functions, no automated backfill (lazy migration on read)
  • Rationale: Preserves immutability of historical records while allowing schema evolution
  • Deferred to: Phase 4 (first real schema change expected)

Decision: Historical Data Loading & Caching

  • Choice: Load from Git on demand, in-memory caching within single evaluation run
  • Pattern:
    class HistoryLoader:
        def __init__(self, git_repo_path: Path):
            self._cache: Dict[str, MetricsFile] = {}
            
        def load_last_n_hours(self, symbol: str, n: int) -> List[MetricsFile]:
            # Load from Git, cache in memory for this run
            # No persistent cache (stateless execution)
  • Rationale: Stateless execution model means no persistent cache, in-memory cache sufficient for single run
  • Performance: Loading 24 hours × 50KB ≈ 1.2MB (acceptable for hourly evaluation)
  • Implementation Phase: Phase 2 (exit strategy needs historical data)

Decision: Multi-Timeframe Data Synchronization

  • Choice: Primary timeframe (1h) loads 4h confirmation data as needed
  • Pattern: Each metric calculator specifies required timeframes, engine loads all required data upfront
  • Rationale: Simpler than streaming/incremental loading, sufficient for batch processing
  • Already Implemented: ✅ Yes (Phase 1 loads 1h + 4h data for regime classification)

State Management

Decision: Exit State Persistence

  • Choice: Daily YAML files per symbol with transitions array
  • File Structure: exit_states/{symbol}/{YYYY-MM-DD}.yaml
  • Schema:
    symbol: ETH-USDT
    grid_id: eth-grid-1
    date: 2026-02-02
    transitions:
      - timestamp: "2026-02-02T14:00:00Z"
        from_state: NORMAL
        to_state: WARNING
        reasons:
          - "Condition 1"
          - "Condition 2"
        metrics:
          adx: 28.5
          efficiency_ratio: 0.62
    last_notification:
      WARNING: "2026-02-02T14:00:00Z"
      LATEST_ACCEPTABLE_EXIT: null
      MANDATORY_EXIT: null
  • Rationale: YAML for state transitions (human-readable audit trail), daily files keep file sizes manageable
  • Implementation Phase: Phase 2

Decision: Gate Evaluation Tracking

  • Choice: Embedded in metrics YAML files (not separate state)
  • Location: metrics/{symbol}/{YYYY-MM-DD}-{HH}.yamlgate_evaluation section
  • Rationale: Gate status is part of regime analysis output, co-locating with metrics simplifies loading
  • Already Implemented: ⚠️ Partial (structure defined in SCHEMA.md, implementation pending Phase 2)

Decision: Rate Limiting State

  • Choice: Store in exit state transitions file (last_notification timestamps)
  • Pattern: Check last_notification[state] timestamp, compare to current time + rate limit threshold
  • Rationale: Co-locating with exit state transitions keeps related data together
  • Rate Limits:
    • WARNING: 4 hours minimum
    • LATEST_ACCEPTABLE_EXIT: 2 hours minimum
    • MANDATORY_EXIT: 1 hour minimum
  • Implementation Phase: Phase 2

Decision: Grid State Determination

  • Choice: History array as single source of truth (not separate enabled field)
  • Pattern:
    def is_grid_running(grid_config: Dict) -> bool:
        history = grid_config.get("history", [])
        if not history:
            return False
        last_entry = history[-1]
        return "enabled" in last_entry and "disabled" not in last_entry
  • Rationale: Single source of truth, no conflicting state
  • Already Implemented: ✅ Yes (Phase 1 complete, Requirement 19)

Error Recovery & Resilience

Decision: Partial Metric Failure Handling

  • Choice: Continue with N/6 metrics if ≥4 available, abort if <4
  • Pattern:
    try:
        adx = calculate_adx(data)
    except MetricCalculationError as e:
        logger.error(f"ADX calculation failed: {e}")
        adx = None  # Continue with None, confidence scorer handles missing metrics
     
    if available_metrics < 4:
        raise InsufficientMetricsError("Cannot calculate confidence with <4 metrics")
  • Rationale: Graceful degradation, regime classification still possible with partial metrics
  • Already Implemented: ✅ Yes (Phase 1 handles missing metrics gracefully)

Decision: Git Conflict Resolution

  • Choice: Retry with pull + merge (automatic for non-conflicting)
  • Pattern:
    try:
        repo.git.push()
    except GitCommandError:
        repo.git.pull(rebase=True)  # Rebase our commit on top
        repo.git.push()
  • Rationale: Conflicts unlikely (single CronJob instance, hourly execution), automatic retry sufficient
  • Escalation: If retry fails, log error, continue operation (commit on next cycle)
  • Already Implemented: ✅ Yes (Phase 1 git_manager.py)

Decision: API Circuit Breaker

  • Choice: 3 consecutive failures → circuit OPEN (stop trying for 30 minutes)
  • Pattern:
    class KuCoinClient:
        def __init__(self):
            self._failure_count = 0
            self._circuit_open_until = None
        
        def fetch_ohlcv(self, symbol, timeframe):
            if self._circuit_open_until and now() < self._circuit_open_until:
                raise CircuitBreakerOpenError()
            
            try:
                result = self._api_call(...)
                self._failure_count = 0  # Reset on success
                return result
            except APIError:
                self._failure_count += 1
                if self._failure_count >= 3:
                    self._circuit_open_until = now() + timedelta(minutes=30)
                raise
  • Rationale: Prevents hammering failed API, 30-minute timeout allows temporary outages to clear
  • Implementation Phase: Phase 3 (position tracking adds more API calls)

Decision: API Retry Logic

  • Choice: Exponential backoff, 2-3 attempts max
  • Pattern:
    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=1, max=10),
        retry=retry_if_exception_type(APIError)
    )
    def fetch_with_retry(self, endpoint, params):
        return self._http_get(endpoint, params)
  • Library: tenacity (add to dependencies)
  • Rationale: Handles transient network issues without manual retry logic
  • Already Implemented: ⚠️ Partial (manual retry in some places, not consistent)
  • Standardization Phase: Phase 3 (apply tenacity library consistently)

Notification Architecture

Decision: Notification Priority Mapping

  • Choice: Exit states map to Pushover priority levels
  • Mapping:
    PRIORITY_MAP = {
        ExitState.NORMAL: -1,              # Low priority (quiet notification)
        ExitState.WARNING: 0,              # Normal priority
        ExitState.LATEST_ACCEPTABLE_EXIT: 1,  # High priority (bypass quiet hours)
        ExitState.MANDATORY_EXIT: 2,       # Emergency priority (requires acknowledgment)
    }
  • Rationale: Escalating urgency matches exit state severity, MANDATORY_EXIT requires explicit ack
  • Implementation Phase: Phase 2

Decision: Multi-Channel Strategy

  • Choice: Direct Pushover integration for MVP, optional n8n webhooks for future
  • Pattern:
    class NotificationDispatcher:
        def __init__(self, pushover_client, webhook_url=None):
            self.pushover = pushover_client
            self.webhook_url = webhook_url  # Optional
        
        def send(self, notification):
            self.pushover.send(notification)  # Always send to Pushover
            if self.webhook_url:
                requests.post(self.webhook_url, json=notification.dict())  # Optional webhook
  • Rationale: Pushover sufficient for MVP (direct, reliable), n8n adds flexibility post-MVP
  • Already Implemented: ✅ Pushover integration (scripts/send_regime_notifications.py)
  • Future Enhancement: Phase 5 (n8n integration for multi-channel routing)

Decision: Notification Templating

  • Choice: Templates in code (not config files)
  • Pattern:
    def build_exit_state_message(state: ExitState, regime: Dict, grid: Dict) -> str:
        if state == ExitState.MANDATORY_EXIT:
            return f"🚨 MANDATORY EXIT: {grid['id']}\n" \
                   f"Regime: {regime['verdict']}\n" \
                   f"Stop grid immediately."
        elif state == ExitState.LATEST_ACCEPTABLE_EXIT:
            return f"⚠️ LATEST ACCEPTABLE EXIT: {grid['id']}\n" \
                   f"Exit within 12-24 hours to preserve capital."
        # ... etc
  • Rationale: Templates are code logic (conditional rendering), not configuration
  • Deferred to Config: Phase 5 if user wants customizable templates
  • Implementation Phase: Phase 2

Testing & Validation Architecture

Decision: Backtesting Data Format

  • Choice: Replay actual Git history (no synthetic data for backtesting)
  • Pattern:
    class BacktestRunner:
        def run(self, start_date: date, end_date: date):
            # Load actual metrics files from Git history
            for metrics_file in self.load_metrics_range(start_date, end_date):
                # Re-evaluate exit strategy against historical data
                exit_state = self.evaluator.evaluate(metrics_file)
                # Compare against actual actions taken (from decision records)
  • Rationale:
    • Real data (no synthetic data bias)
    • Tests actual Git loading code
    • Validates against real market conditions
  • Implementation Phase: Phase 4

Decision: KPI Calculation Frequency

  • Choice: Batch calculation (daily aggregation)
  • Pattern:
    # Run as separate CronJob (daily at midnight)
    class KPICalculator:
        def calculate_daily_kpis(self, date: date):
            # Load all exit state transitions for date
            # Load all decision records for date
            # Calculate KPIs (SLAR, PRR, TTDR, etc.)
            # Write to kpis/{YYYY-MM-DD}.yaml
  • Rationale: KPIs are lagging indicators (don’t need real-time), daily batch sufficient
  • Implementation Phase: Phase 5

Decision: Test Data Generation

  • Choice: Replay actual Git history (no mocking for integration tests)
  • Pattern:
    @pytest.fixture
    def last_7_days_metrics(git_repo):
        # Load actual metrics files from last 7 days
        return HistoryLoader(git_repo).load_last_n_days("ETH-USDT", 7)
     
    def test_exit_state_progression_real_data(last_7_days_metrics):
        # Test against real historical data
        for metrics in last_7_days_metrics:
            exit_state = evaluator.evaluate(metrics)
            # Assert reasonable exit states (no wild oscillations)
  • Rationale: Real data validates production behavior, catches edge cases mocks miss
  • Already Implemented: ⚠️ Partial (some tests use real data, not standardized)
  • Standardization Phase: Phase 4

Infrastructure & Deployment

Decision: Hosting Strategy

  • Choice: Self-hosted Kubernetes cluster (not cloud provider)
  • Rationale:
    • Full control over infrastructure
    • No cloud provider costs
    • Already deployed and working (Phase 1 production)
  • Components:
    • Kubernetes CronJob (evaluation scheduling)
    • PVC (Git repository persistence)
    • ExternalSecrets (API key injection)
    • Grafana Loki (logging)
  • Already Implemented: ✅ Yes (Phase 1 complete)

Decision: CI/CD Pipeline

  • Choice: GitHub Actions → Docker build → GHCR → ArgoCD GitOps
  • Pipeline:
    1. GitHub Actions: Run tests, lint, build Docker image
    2. Push to GHCR (GitHub Container Registry)
    3. Update image tag in infra/ manifests
    4. ArgoCD detects change, deploys to Kubernetes
  • Rationale:
    • GitHub Actions free for public repos
    • GHCR integrated with GitHub
    • ArgoCD provides GitOps deployment
    • Blue-green deployment (rollback on invalid config)
  • Already Implemented: ✅ Yes (Phase 1 complete)

Decision: Environment Configuration

  • Choice: YAML base + environment variable overrides
  • Pattern:
    # config/environment.yaml
    kucoin:
      api_key: "${KUCOIN_API_KEY}"  # Injected via ExternalSecrets
      api_secret: "${KUCOIN_API_SECRET}"
     
    repository:
      base_path: "${MARKET_MAKER_DATA_REPOSITORY_BASE_PATH}"  # Overridable
  • Convention: MARKET_MAKER_<NESTED_KEY> for env var names
  • Rationale:
    • YAML provides defaults
    • Env vars allow Kubernetes overrides
    • ExternalSecrets injects secrets securely
  • Already Implemented: ✅ Yes (Phase 1 complete, Requirement 15)

Decision: Monitoring and Logging

  • Choice: Grafana Loki for structured logging (no separate APM tool)
  • Pattern:
    logger.info(
        "Regime evaluation complete",
        extra={
            "symbol": symbol,
            "regime": regime.verdict,
            "confidence": regime.confidence,
            "duration_ms": duration,
        }
    )
  • Rationale:
    • Structured logging sufficient for batch processing
    • No need for distributed tracing (single monolith)
    • Loki already deployed (Phase 1)
  • Alert Thresholds:
    • 1 minute evaluation: WARNING

    • 5 minutes evaluation: ERROR

    • Git push failure: WARNING
    • 3+ consecutive API failures: ERROR
  • Already Implemented: ✅ Yes (Grafana Loki integration complete)

Decision: Scaling Strategy

  • Choice: Vertical scaling only (no horizontal scaling for MVP)
  • Rationale:
    • Single CronJob instance (no concurrency needed)
    • Hourly evaluation cadence (no performance bottleneck)
    • Git-as-database limits horizontal scaling (conflict management complexity)
  • Future: Phase 5+ could explore sharding by symbol (separate CronJobs per grid)
  • Current Resources: Sufficient for single grid, hourly evaluation

Decision Impact Analysis

Implementation Sequence for Phases 2-5:

  1. Phase 2 (Exit Strategy):

    • Extend state management patterns (exit state transitions YAML files)
    • Implement historical data loading (in-memory caching pattern)
    • Add notification priority mapping (Pushover priority levels)
    • Follow testing pattern (8+ tests per trigger module)
  2. Phase 3 (Position Risk):

    • Extend API integration pattern (KuCoin position tracking)
    • Add circuit breaker implementation (standardize API resilience)
    • Implement tenacity retry library (consistent across all API calls)
    • Follow interface pattern (PositionTrackerInterface → KuCoinPositionTracker)
  3. Phase 4 (Testing & Validation):

    • Implement backtesting (replay Git history pattern)
    • Add KPI calculation (daily batch processing CronJob)
    • Standardize test data generation (real data replay)
    • Validate all RAIA assumptions (12-24h warning window, <30% FER)
  4. Phase 5 (Operational):

    • Enhance logging (structured logging consistency)
    • Add audit logger (append-only decision tracking)
    • Implement KPI dashboard (static HTML + Chart.js pattern)
    • Optional n8n integration (webhook dispatcher)

Cross-Component Dependencies:

  • Exit Strategy → Historical Data Loading: Phase 2 implements pattern, Phase 3+ reuses for position tracking
  • State Management → Notification: Exit state transitions drive notification priority/content
  • Error Handling → All Phases: Circuit breaker + retry patterns established in Phase 3, applied retroactively to Phase 1-2
  • Schema Validation → All Phases: Pydantic models ensure consistency across all data structures
  • Testing Patterns → All Phases: 100% coverage + real data validation established in Phase 1, maintained throughout

Key Principle: Extend Phase 1 Patterns, Don’t Reinvent

Every new feature should ask:

  1. Does Phase 1 have a pattern for this? (Yes → follow it)
  2. Is this genuinely new? (Yes → create pattern consistent with Phase 1 principles)
  3. Does this conflict with Phase 1? (Yes → document exception in RAIA.md with rationale)