Architecture Decision Document
This document builds collaboratively through step-by-step discovery. Sections are appended as we work through each architectural decision together.
Project Context Analysis
Requirements Overview
System Purpose: The Market-Making System is a comprehensive algorithmic trading decision support platform that enables profitable grid trading at scale through regime-aware monitoring, tiered exit protection, and systematic capital management. The system prioritizes capital preservation over profit maximization, providing 24/7 market monitoring with human-in-loop execution to maintain operator control while benefiting from automated risk detection.
Functional Requirements:
Core Capabilities (20 major requirements identified):
- Market Regime Classification (Req 1): Four-regime taxonomy (RANGE_OK, RANGE_WEAK, TRANSITION, TREND) with multi-timeframe analysis (1h primary + 4h confirmation) using 6 independent metrics
- Recommendation Engine (Req 2, 6, 7): Two recommendation types (GRID_MANAGE, GRID_SETUP) with constrained action sets and confidence-based escalation
- Confidence Scoring (Req 3): Conservative calibration with multiple penalties (time-based <36h, position-based, maturity, spacing) capped at 0.95
- Decision Record Management (Req 4): Git-backed immutable YAML files tracking recommendations, actions, and evaluations at 24h/72h/7d horizons
- Grid Configuration Management (Req 5, 16): YAML-based configurations with detailed parameters (price bounds, grid levels, amounts per grid, profit percentages) and version history
- Capital Management (Req 8): SINGLE_GRID mode with global reserve enforcement and unlocked balance calculations
- Automation Controls (Req 9): Asymmetric automation philosophy - auto-reduce risk (STOP_GRID), manual capital deployment (CREATE_GRID requires approval)
- Timeout & Cooldown Management (Req 10): Verdict-based timeouts (30min TRANSITION, 15min TREND) and action-based cooldowns (60min after stop, 120min after declined setup)
- API Security (Req 11): No withdrawal permissions, IP whitelist enforcement, trade permissions validation
- Exchange Integration (Req 12): Abstract Exchange_Interface with KuCoin as first implementation
- Trade Monitoring & Notifications (Req 13): Pushover API integration (direct), optional webhook support, rate-limited context-rich alerts
- Performance Evaluation (Req 14): Multi-horizon assessment (24h/72h/7d) with USD-denominated economic impact tracking
- Metrics Collection (Req 15): Hourly snapshots with minute-level price granularity stored in Git
- Grid Restart Gates (Req 17): Sequential three-gate evaluation (Directional Energy Decay → Mean Reversion Return → Tradable Volatility) required after stops
- Probationary Grid Management (Req 18): Conservative parameters (50-60% allocation, wider spacing) for confidence 0.60-0.80 ranges
- Grid State Management (Req 19): History array as single source of truth (not separate enabled field)
- Historical Data Management (Req 20): Backfill tools and automated cleanup with retention policy enforcement
Grid Exit Strategy Feature (Current Development Focus - Phases 2-5):
- Tiered exit states: NORMAL → WARNING → LATEST_ACCEPTABLE_EXIT → MANDATORY_EXIT
- Multi-metric consensus: WARNING requires 2+ conditions (prevents false alarms)
- State transition tracking with rate limiting
- Historical data loading for persistence checks
- Position risk quantification with KuCoin integration
- KPI tracking framework (7 metrics: SLAR, PRR, TTDR, FER, ERT, EAW, MEC)
Non-Functional Requirements:
Performance:
- Evaluation cadence: 1-hour (current), 15-minute (planned post-validation)
- Target response time: <30 seconds per evaluation cycle
- Warning threshold: >1 minute evaluation time triggers logging
- Error threshold: >5 minutes evaluation time triggers alerts
Reliability:
- Graceful degradation on API failures (retry 2-3x with exponential backoff)
- Git push failures: log locally, continue operation, retry on next cycle
- Circuit breaker patterns for persistent failures (3+ consecutive)
- 24/7 availability via Kubernetes CronJob scheduling
Security:
- KuCoin API keys: No withdrawal permissions, IP whitelist mandatory
- Kubernetes secrets via ExternalSecrets integration
- Private Git repository for decision records (SSH authentication)
- VPN-only access to decision interface (no public exposure for MVP)
Audit & Compliance:
- Immutable decision records (Git-backed, no retroactive editing)
- Configuration versioning (Git commit hash tracked in every decision)
- Separation of recommendation quality vs action quality tracking
- Investor-grade audit trail for external capital readiness (£100K+ scaling target)
Scalability:
- Current: Single grid, single symbol (ETH-USDT), £1K capital
- Near-term: £10K capital (3-4 months post-Phase 5 validation)
- Long-term: Multi-grid, multi-symbol, £100K+ external capital
- Architecture designed for multi-exchange extension
Data Management:
- Git-as-database pattern (no traditional database required)
- PVC-backed Git repository in Kubernetes (persists across pod restarts)
- Minute-level price granularity (up to 540 days 4h data)
- Automated cleanup with retention policy (to be defined post-MVP)
Observability:
- Grafana Loki integration for all logs and metrics
- Structured logging with error context
- Performance metrics (API response times, evaluation duration, error rates)
- State transition tracking (all exit state changes logged)
Scale & Complexity
Complexity Assessment: HIGH
Justification:
- Multi-subsystem platform with 12+ architectural components
- Complex state management (regime classification, exit states, gate sequencing)
- Financial system integration with real capital at risk
- Multi-metric consensus logic requiring 2+ conditions
- Sequential gate evaluation with forced progression
- Multi-timeframe coordination (1h/4h synchronization)
- Immutable audit trail requirements for investor credibility
- ~180-250 hours remaining development effort across Phases 2-5
Primary Technical Domain: Financial Technology - Algorithmic Trading Decision Support
Architecture Style: Event-driven batch processing with Git-based immutable audit logging
Project Context: Brownfield - extending existing regime detection system (Phase 1 complete) with exit strategy, position risk, testing, and operational capabilities
Current Maturity: ~40-50% complete toward MVP
- ✅ Phase 1: Regime detection with 6 real metrics (60 tests passing)
- 🚀 Phase 2: Exit strategy ready to start (50-70h effort)
- 📋 Phase 3: Position risk planned (30-40h effort)
- 📋 Phase 4: Testing & validation planned (40-50h effort)
- 📋 Phase 5: Operational improvements planned (20-30h effort)
Estimated Architectural Components: 40+ modules across 12 major subsystems
Component Categories:
- Data Ingestion (3 components): API clients, interface abstraction, backfill tools
- Regime Analysis (4 components): 6 metric calculators, classifier, confidence scorer, multi-timeframe coordinator
- Exit Strategy (4 components): 3 trigger modules, state tracker, rate limiter, historical loader
- Grid Management (4 components): Config manager, state determiner, probationary recommender, capital allocator
- Restart Gates (4 components): 3 gate evaluators, sequential orchestrator
- Decision & Audit (4 components): Record creator, action appender, evaluation appender, version tracker
- Position & Risk (4 components): Position tracker, risk calculator, profit estimator, stop-loss monitor
- Notification (4 components): Pushover client, webhook dispatcher, rate limiter, message builder
- Metrics Collection (4 components): Snapshot collector, price aggregator, Git persistence, PVC handler
- Dashboard (4 components): Visualization generator, HTML builder, dashboard packager, trend analyzer
- Testing & Validation (4 components): Backtesting framework, KPI calculator, test suite, integration orchestrator
- Infrastructure (5 components): CronJob manager, ExternalSecrets, ArgoCD, Loki logging, Docker builder
Technical Constraints & Dependencies
Hard Constraints:
-
KuCoin API Limitation: Spot grids cannot be managed via API
- Impact: Manual UI execution required (human-in-loop by necessity)
- Benefit: Turned into design strength (regulatory simplicity, operator control)
- Mitigation: System generates recommendations, human executes in KuCoin UI
-
No Database Requirement: Git-as-database pattern enforced
- Rationale: Simplicity, immutable audit trail, version control built-in
- Impact: All state must serialize to YAML/JSON files in Git
- Challenge: Historical data loading requires file I/O from Git PVC
-
Personal Capital Only (MVP): No external investor funds until post-Phase 5
- Current: £1K validation capital
- Near-term: £10K personal capital (3-4 month target)
- Long-term: £100K+ external capital (requires completed audit trail)
- Impact: No regulatory compliance requirements for MVP
-
API Security Requirements:
- No withdrawal permissions (hard requirement)
- IP whitelist enforcement (hard requirement)
- Kubernetes secrets only (no config file credentials)
-
Stateless Execution: Each CronJob run must be independent
- Load state from Git PVC
- Evaluate regime and exit conditions
- Commit results to Git
- Exit cleanly (no persistent processes)
Soft Constraints (Assumptions to Validate):
-
1-Hour Evaluation Cadence: Assumption that 12-24 hour warning windows exist for regime transitions
- RAIA A001, A004: To be validated in Phase 4 backtesting
- Fallback: Switch to 15-minute cadence if <80% transitions provide >2h warning
-
2+ Condition WARNING Trigger: Assumption that requiring 2+ metrics prevents false positives without missing true transitions
- RAIA A005: Target False Exit Rate <30%
- Tunable: Can increase to 3+ conditions if FER >30%
-
False Positive Tolerance: Assumption that <30% false exit rate is acceptable
- Economic impact: Missed ranging periods vs avoided stop-losses
- To be measured via KPI framework in Phase 4-5
-
Single Grid Sufficient: Assumption that one active grid at a time is sufficient for validation phase
- Future: Multi-grid support for £10K+ capital scaling
- Architecture designed for extension (grid_id tracking everywhere)
External Dependencies:
-
KuCoin Exchange API:
- Market data (OHLCV at multiple timeframes)
- Account data (balance, positions)
- Trade history (fills, orders)
- Rate limits: 1-hour cadence well within limits
- Availability: System must handle API outages gracefully
-
Git Repository (market-maker-data):
- Private repository for decision records and metrics
- SSH authentication from Kubernetes pods
- PVC-backed clone (persists across restarts)
- Push failures acceptable (retry on next cycle)
-
Pushover API:
- Direct notification delivery (no n8n dependency for MVP)
- Rate limiting: Built into application logic
- Fallback: Log notifications if API unavailable
-
Kubernetes Infrastructure:
- CronJob scheduling (hourly execution)
- PVC provisioning (Git repository storage)
- ExternalSecrets integration (KuCoin API keys)
- Grafana Loki (logging and observability)
- ArgoCD (deployment automation)
-
Optional: n8n Integration (Post-MVP):
- Webhook endpoints for advanced orchestration
- Multi-channel notification routing (Email, Slack, SMS)
- Not required for core functionality
Technology Stack (Current):
- Python 3.11+
- Pydantic (schema validation)
- GitPython (Git operations)
- PyYAML (YAML parsing)
- Requests (KuCoin API calls)
- Chart.js (dashboard visualizations)
- NumPy (metric calculations)
- Pytest (testing framework - 60 tests passing)
Cross-Cutting Concerns Identified
1. Configuration Management
- YAML-based configuration with environment variable overrides for Kubernetes
- Schema validation on load (fail fast if invalid)
- Git commit hash versioning (every decision references config version)
- Convention-based naming for env vars (e.g., MARKET_MAKER_DATA_REPOSITORY_BASE_PATH)
- Blue-green deployment support (invalid config keeps previous version running)
2. Error Handling & Resilience
- API retry logic: 2-3 attempts with exponential backoff
- Circuit breakers: 3+ consecutive failures trigger alerts
- Git push failures: Log locally, continue operation, retry on next cycle
- Metric calculation errors: Continue with remaining metrics, include error context in notifications
- Graceful degradation: Partial metrics acceptable if confidence calculable
3. Logging & Observability
- Structured logging to Grafana Loki
- Performance metrics: API response times, evaluation duration, error rates
- State transition logging: All exit state changes captured
- Alert thresholds: >1min evaluation (WARNING), >5min evaluation (ERROR)
- Git push success/failure tracking
4. Schema Validation
- Pydantic models for all data structures:
- Metrics files (regime, confidence, detailed_analysis, gate_evaluation)
- Exit state transitions (transitions array, last_notification timestamps)
- Decision records (recommendation, action_records, evaluation_records)
- Configuration (exit_rules, notifications, gates)
- Pre-commit validation (reject invalid data before Git commit)
- Runtime validation (load-time validation with clear error messages)
5. Testing Strategy
- Unit tests: 90%+ coverage target (60 tests passing for Phase 1)
- Integration tests: 5+ scenarios for state progression
- Real data validation: Run against last 7 days of market-maker-data
- Backtesting: 3-6 months historical validation in Phase 4
- KPI tracking: 7 metrics (SLAR, PRR, TTDR, FER, ERT, EAW, MEC)
6. Data Retention & Cleanup
- Current policy: Keep all data forever (MVP)
- Estimated growth: 10-50 KB per evaluation × 24 hours × 365 days ≈ 87-438 MB/year
- Action item: Revisit retention policy at end of MVP (RAIA log entry needed)
- Monthly maintenance: Automated cleanup procedures (Req 20)
7. Security
- API key management: Kubernetes ExternalSecrets integration
- No withdrawal permissions: Enforced at API key level
- IP whitelist: Required for KuCoin API access
- Private Git repository: SSH authentication, no public access
- Decision interface: VPN-only access (no OAuth for MVP)
8. Cooldown & Timeout Management
- Per-grid cooldowns (not global):
- 60 minutes after STOP_GRID action
- 120 minutes after declined/expired Grid_Setup
- Verdict-based timeouts:
- 30 minutes for TRANSITION regime
- 15 minutes for TREND regime
- No timeout for RANGE_OK/RANGE_WEAK
- Enforcement: State tracking in exit_states/{symbol}/{date}.json files
9. Rate Limiting
- Notification rate limits per exit state:
- WARNING: 4 hours minimum between same-state notifications
- LATEST_ACCEPTABLE_EXIT: 2 hours minimum
- MANDATORY_EXIT: 1 hour minimum
- Purpose: Prevent notification fatigue while ensuring critical alerts break through
- Implementation: last_notification timestamps in state transition files
10. Multi-Timeframe Coordination
- Primary timeframe: 1 hour (decision timeframe)
- Confirmation timeframe: 4 hours (structural validation)
- Early warning context: 5min, 15min (not used for decisions)
- Data requirements: 21d/1m, 120d/15m, 270d/1h, 540d/4h
- Synchronization: Historical data loading must handle timeframe alignment
Starter Architecture Analysis
Project Foundation Context
This is a brownfield project with Phase 1 complete (~40-50% toward MVP). Rather than selecting a starter template, we’re documenting the established architecture that Phase 1 created, which serves as the “effective starter” for all future development (Phases 2-5).
Primary Technology Domain
Backend Batch Processing System - Financial Technology Decision Support
Not a web application, API service, or interactive system. This is a scheduled batch processor that:
- Runs as Kubernetes CronJob (hourly execution currently)
- Loads state from Git repository on PVC
- Performs regime analysis and exit state evaluation
- Commits results back to Git
- Exits cleanly (stateless execution model)
- Generates static HTML dashboards (no server required)
Established Architecture Pattern: “Git-Backed Batch Processor”
Phase 1 established a unique architectural pattern optimized for financial decision support with immutable audit trails:
Core Pattern: Event-driven batch processing + Git-as-database + Stateless execution
Why this pattern:
- Immutability: Git provides version-controlled, tamper-evident audit trail (investor credibility requirement)
- Simplicity: No database to operate, backup, or maintain
- Reproducibility: Every evaluation can be replayed from Git history
- Transparency: Decision records are human-readable YAML/JSON files
- Resilience: Git push failures don’t crash system (retry on next cycle)
Technology Stack (Phase 1 Established)
Language & Runtime:
- Python 3.11+ (strict version requirement)
- Type hints throughout (enforced in code reviews)
- No async/await (synchronous batch processing sufficient)
Core Dependencies:
# Data Validation & Serialization
pydantic>=2.0 # Schema validation for all data structures
pyyaml>=6.0 # YAML parsing for config and decision records
# Git Operations
GitPython>=3.1 # Git repository management
# API Integration
requests>=2.31 # KuCoin API calls (synchronous, retry logic built-in)
# Numerical Computing
numpy>=1.24 # Metric calculations (ADX, OU process, etc.)
# Testing
pytest>=7.4 # Test framework (60 tests passing)
pytest-cov>=4.1 # Coverage reporting (100% for Phase 1 metrics)Visualization (Dashboard Generation):
- Chart.js (JavaScript library, embedded in static HTML)
- No server-side rendering
- Self-contained HTML files with embedded data
Infrastructure:
- Kubernetes CronJob (scheduling)
- PVC (Persistent Volume Claim) for Git repository storage
- ExternalSecrets (KuCoin API key injection)
- Grafana Loki (structured logging)
- ArgoCD (GitOps deployment)
- Docker (containerization via GitHub Actions)
Project Structure & Organization
Established by Phase 1 (60 tests passing, production-ready):
metrics-service/
├── src/
│ ├── config/ # Configuration management
│ │ ├── loader.py # YAML config loading + env var overrides
│ │ └── validator.py # Schema validation (fail fast on invalid)
│ │
│ ├── exchanges/ # Exchange integration layer
│ │ ├── kucoin_client.py # KuCoin API wrapper
│ │ └── exchange_interface.py # Abstract interface (future multi-exchange)
│ │
│ ├── regime/ # Regime analysis engine (Phase 1 COMPLETE)
│ │ ├── engine.py # Main regime classifier
│ │ ├── classifier.py # 4-regime taxonomy logic
│ │ ├── confidence.py # Conservative confidence scoring
│ │ ├── metrics/ # 6 metric calculators
│ │ │ ├── __init__.py
│ │ │ ├── adx.py # Average Directional Index
│ │ │ ├── efficiency_ratio.py
│ │ │ ├── autocorrelation.py
│ │ │ ├── ou_process.py # Ornstein-Uhlenbeck half-life
│ │ │ ├── slope.py # Normalized slope
│ │ │ └── bollinger.py # Bollinger Band bandwidth
│ │ └── gates/ # Restart gate evaluators
│ │ ├── gate1_energy.py # Directional Energy Decay
│ │ ├── gate2_reversion.py # Mean Reversion Return
│ │ └── gate3_volatility.py # Tradable Volatility
│ │
│ ├── exit_strategy/ # Exit state engine (Phase 2 - IN PROGRESS)
│ │ ├── evaluator.py # Main exit state evaluator (~30% complete)
│ │ ├── state_tracker.py # State transition tracking
│ │ ├── triggers/ # Trigger modules (to be implemented)
│ │ │ ├── mandatory.py
│ │ │ ├── latest_acceptable.py
│ │ │ └── warning.py
│ │ └── history_loader.py # Historical data loading
│ │
│ ├── grid/ # Grid configuration management
│ │ ├── config_manager.py # YAML-based grid definitions
│ │ ├── state_determiner.py # History array-based state logic
│ │ ├── capital_allocator.py # SINGLE_GRID mode + reserve enforcement
│ │ └── probationary.py # Conservative grid recommender
│ │
│ ├── metrics/ # Metrics collection & storage
│ │ ├── collector.py # Hourly snapshot collection
│ │ ├── aggregator.py # Minute-level price aggregation
│ │ └── persistence.py # Git commit/push operations
│ │
│ ├── interfaces/ # Abstract interfaces
│ │ ├── exchange.py # Exchange abstraction
│ │ └── notification.py # Notification abstraction
│ │
│ ├── spotcheck/ # Utility modules
│ │ └── validators.py
│ │
│ ├── git_manager.py # Git operations wrapper
│ └── init.py # Entry point orchestration
│
├── tests/ # Test suite (60 tests passing Phase 1)
│ ├── regime/
│ │ ├── metrics/ # 8 tests per metric × 6 metrics = 48 tests
│ │ │ ├── test_adx.py
│ │ │ ├── test_efficiency_ratio.py
│ │ │ ├── test_autocorrelation.py
│ │ │ ├── test_ou_process.py
│ │ │ ├── test_slope.py
│ │ │ └── test_bollinger.py
│ │ └── test_classifier.py # Integration tests
│ ├── grid/
│ │ └── test_state_determiner.py
│ └── conftest.py # Shared fixtures
│
├── config/ # Configuration files
│ ├── environment.yaml # Default configuration
│ └── exit_strategy_config.yaml # Exit rules (Phase 2)
│
├── scripts/ # Operational scripts
│ ├── send_regime_notifications.py # Pushover integration
│ └── collect_metrics.py # Metrics collection orchestrator
│
├── infra/ # Kubernetes manifests
│ └── metrics-service/
│ ├── cronjob.yaml # CronJob definition
│ ├── pvc.yaml # Git repository PVC
│ └── externalsecret.yaml # API key injection
│
├── .venv/ # Virtual environment (local dev)
├── pyproject.toml # Dependencies + pytest config
├── Taskfile.yml # Task automation (local + CI/CD)
└── Dockerfile # Container image build
Key Organizational Patterns:
- Flat module structure: No deep nesting (max 2-3 levels)
- Feature-based organization:
regime/,exit_strategy/,grid/(not layered likemodels/,services/) - Tests mirror src structure:
tests/regime/metrics/test_adx.py↔src/regime/metrics/adx.py - Interfaces separate from implementations:
interfaces/contains abstractions,exchanges/contains implementations - Single responsibility modules: Each
.pyfile has one clear purpose
Architectural Decisions Provided by Phase 1 Foundation
Language & Type Safety:
- Python 3.11+ with strict type hints (enforced)
- Pydantic models for all data structures (runtime validation)
- No dynamic typing or
Anytypes (except when unavoidable) - Docstrings required for public functions
Configuration Management:
- YAML-based configuration (human-readable, version-controlled)
- Environment variable overrides with convention:
MARKET_MAKER_<NESTED_KEY> - Schema validation on load (fail fast, never run with invalid config)
- Git commit hash versioning (every decision references config version)
Data Persistence:
- Git as primary data store (no database)
- YAML for structured data (decision records, metrics)
- JSON for state transitions (exit states, gate tracking)
- One file per evaluation/decision (atomic commits)
- PVC-backed Git repository in Kubernetes
Testing Strategy:
- Pytest as test framework
- 100% coverage target for new code (achieved in Phase 1)
- Test file naming:
test_<module>.py - Shared fixtures in
conftest.py - Integration tests in separate
test_integration_*.pyfiles
Error Handling:
- Explicit exception handling (no bare
except:clauses) - Retry logic with exponential backoff for API calls
- Graceful degradation (partial metrics acceptable)
- Structured logging with error context
- Circuit breakers for persistent failures
Code Style:
- Black formatting (enforced in CI/CD)
- Isort for import organization
- Pylint for code quality
- Maximum line length: 100 characters
- Docstring format: Google style
Git Workflow:
- Feature branches:
feature/<phase>-<description> - Conventional commits:
[Phase N] <type>: <description> - Squash merges to main
- No direct commits to main
Deployment Pattern:
- Docker multi-stage builds (small final image)
- GitHub Actions for CI/CD
- ArgoCD for GitOps deployment
- Blue-green deployment (invalid config keeps previous version)
- Image tagging:
<version>(semantic versioning)
Development Experience Features
Local Development:
- Taskfile.yml for common operations:
task install- Set up virtual environmenttask test- Run test suitetask lint- Run linterstask format- Auto-format codetask evaluate-regime- Run regime evaluation locallytask collect-metrics- Simulate CronJob execution
Hot Reloading:
- Not applicable (batch processing, not web server)
- Use
task evaluate-regimeto test changes locally
Debugging:
- Standard Python debugger (pdb)
- VSCode launch configurations provided
- Logging to console in development mode
Testing Infrastructure:
pytest -vfor verbose outputpytest --covfor coverage reportingpytest -k <pattern>for selective test execution- Fixtures for mocked KuCoin API responses
Documentation:
- Docstrings in code (Google style)
- README.md with quick start guide
- DEVELOPER-HANDOFF.md for new developers
- Architecture decisions in this document
Extension Points for Future Development
For Phase 2 (Exit Strategy):
- Add trigger modules in
src/exit_strategy/triggers/ - Follow established pattern: One trigger type per file
- Add tests in
tests/exit_strategy/triggers/ - 8+ test cases per trigger (boundary conditions, edge cases)
For Phase 3 (Position Risk):
- Add position tracker in
src/position/ - Implement
PositionTrackerInterface(create new interface) - Add risk calculator in
src/risk/ - Follow KuCoin client pattern for API integration
For Phase 4 (Testing & Validation):
- Add backtesting framework in
src/backtest/ - Add KPI calculator in
src/kpi/ - Create integration test scenarios in
tests/integration/
For Phase 5 (Operational):
- Enhance logging in existing modules
- Add audit logger in
src/audit/ - Add KPI dashboard generator in
src/dashboard/
Migration Notes for New Developers
Coming from Web Development:
- This is NOT a web server - it’s a batch job
- No HTTP requests to handle (except outbound API calls)
- No real-time state - everything loads from Git each run
- “Deploy” means update CronJob, not restart server
Coming from Database-Heavy Systems:
- No database queries - read YAML/JSON files from Git
- No migrations - schema changes via Pydantic model updates
- No transactions - Git commits are atomic operations
- No indexing - file-based lookups are sufficient at current scale
Coming from Microservices:
- This is a monolith by design (simplicity > distributed complexity)
- No service-to-service calls (except KuCoin API)
- No message queues - Git repository is the “queue”
- No service discovery - fixed CronJob schedule
Key Principles Established by Phase 1
- Simplicity over Cleverness: Straightforward code beats clever abstractions
- Immutability over Mutability: Append-only decision records, never edit
- Explicit over Implicit: Configuration visible, no magic defaults
- Type Safety over Dynamic: Pydantic validation, no
Anytypes - Testability over Speed: 100% coverage more important than micro-optimizations
- Capital Preservation over Profit Optimization: Safety-first decision logic
- Human-in-Loop over Full Automation: System recommends, human decides
What Phase 1 Proves
Technical Viability:
- ✅ Git-as-database works at hourly evaluation scale
- ✅ Pydantic validation catches config errors before deployment
- ✅ Kubernetes CronJob scheduling reliable (no missed evaluations)
- ✅ PVC-backed Git repository persists across pod restarts
- ✅ 6 metric calculators produce trustworthy values (replaced hardcoded dummies)
Development Velocity:
- ✅ 60 tests passing with 100% coverage (40-60 hour effort)
- ✅ Clean module boundaries enable parallel development
- ✅ Taskfile commands work identically in local + CI/CD
- ✅ Type hints catch errors at development time
Operational Readiness:
- ✅ Blue-green deployment prevents bad config from reaching production
- ✅ Grafana Loki logging provides visibility into evaluation runs
- ✅ Git commit history serves as complete audit trail
- ✅ ExternalSecrets integration keeps API keys secure
What This Means for Phases 2-5:
Follow the patterns established in Phase 1. Don’t reinvent:
- Module organization (feature-based, flat structure)
- Testing approach (8+ tests per module, 100% coverage)
- Configuration management (YAML + env var overrides)
- Git operations (load → evaluate → commit → exit)
- Error handling (retry logic, graceful degradation)
Add new capabilities by extending existing patterns:
- New trigger modules →
src/exit_strategy/triggers/ - New gate evaluators →
src/regime/gates/ - New metric calculators →
src/regime/metrics/ - New integrations →
src/<integration>/with interface insrc/interfaces/
Core Architectural Decisions
Decision Context
This section documents architectural decisions established by Phase 1 (60 tests passing, production-ready). These patterns are definitive for Phases 2-5 - extend them, don’t reinvent them.
Phase 1 proved these decisions work in production with real capital at risk. New development should follow these established patterns unless there’s a compelling reason to diverge (document exceptions in RAIA.md).
Decision Priority Analysis
Critical Decisions (Already Established by Phase 1):
- Data persistence via Git (no traditional database)
- Pydantic schema validation (fail fast on invalid data)
- Stateless batch execution (load → evaluate → commit → exit)
- Python 3.11+ with strict type hints
- Kubernetes CronJob deployment model
- 100% test coverage for new code
Important Decisions (Already Established by Phase 1):
- YAML for configuration, decisions, and state
- JSON only for raw data arrays
- ExternalSecrets for API key management
- Grafana Loki for observability
- ArgoCD for GitOps deployment
- Black + Isort + Pylint for code quality
Deferred Decisions (Post-MVP):
- Data retention policy (currently keep everything forever)
- Multi-exchange support (KuCoin only for MVP)
- Multi-grid concurrent execution (single grid only for MVP)
- 15-minute evaluation cadence (hourly for MVP, validate 12-24h warning window assumption first)
- Automated grid creation (manual approval required for MVP)
Data Architecture
Decision: Git-as-Database Pattern
- Choice: Git repository as primary data store (no PostgreSQL, MongoDB, etc.)
- Version: GitPython 3.1+
- Rationale:
- Immutable audit trail (version-controlled, tamper-evident)
- Simplicity (no database to operate, backup, or maintain)
- Reproducibility (every evaluation can be replayed from Git history)
- Transparency (decision records are human-readable YAML files)
- Affects: All subsystems (regime, exit strategy, grid management, metrics collection)
- Provided by: Phase 1 design decision
- Production Proven: ✅ Yes (Phase 1 complete)
Decision: Pydantic Schema Validation
- Choice: Pydantic 2.0+ for all data structure validation
- Version: pydantic>=2.0
- Rationale:
- Runtime validation (catch config errors before Git commit)
- Type safety (strict typing with validation)
- Schema evolution (versioned models)
- Developer experience (clear error messages)
- Affects: Configuration loading, metrics files, decision records, exit state transitions
- Pattern: One Pydantic model per file type
# src/schemas/metrics.py class MetricsFile(BaseModel): symbol: str timestamp: datetime regime: RegimeType confidence: float = Field(ge=0.0, le=1.0) # ... rest of schema - Production Proven: ✅ Yes (Phase 1 complete)
Decision: YAML for Recommendations/State, JSON ONLY for Raw Data
- Choice: YAML for all recommendations, decisions, and state tracking. JSON ONLY for raw data arrays.
- Version: pyyaml>=6.0
- Rationale:
- YAML: Human-readable, supports comments, better for ALL state and recommendations
- JSON: Machine-readable, better ONLY for raw data arrays (price data, etc.)
- Both validated via Pydantic models (no raw dict manipulation)
- File Structure:
market-maker-data/ ├── metrics/{symbol}/{YYYY-MM-DD}-{HH}.yaml # YAML (regime, confidence, analysis) ├── decisions/{YYYY-MM-DD}/dec-{symbol}-{HHMMSS}.yaml # YAML (recommendations) ├── exit_states/{symbol}/{YYYY-MM-DD}.yaml # YAML (state transitions) └── raw_data/{symbol}/{YYYY-MM-DD}.json # JSON (only for raw price arrays if needed) - Production Proven: ✅ Yes (Phase 1 complete)
Decision: Schema Evolution Strategy
- Choice: Pydantic model versioning with backward-compatible reads
- Pattern:
class MetricsFileV2(BaseModel): schema_version: str = "2.0" # ... new fields with defaults def load_metrics_file(path: Path) -> MetricsFileV2: raw_data = yaml.safe_load(path.read_text()) if raw_data.get("schema_version") == "1.0": raw_data = migrate_v1_to_v2(raw_data) return MetricsFileV2(**raw_data) - Migration Strategy: Write migration functions, no automated backfill (lazy migration on read)
- Rationale: Preserves immutability of historical records while allowing schema evolution
- Deferred to: Phase 4 (first real schema change expected)
Decision: Historical Data Loading & Caching
- Choice: Load from Git on demand, in-memory caching within single evaluation run
- Pattern:
class HistoryLoader: def __init__(self, git_repo_path: Path): self._cache: Dict[str, MetricsFile] = {} def load_last_n_hours(self, symbol: str, n: int) -> List[MetricsFile]: # Load from Git, cache in memory for this run # No persistent cache (stateless execution) - Rationale: Stateless execution model means no persistent cache, in-memory cache sufficient for single run
- Performance: Loading 24 hours × 50KB ≈ 1.2MB (acceptable for hourly evaluation)
- Implementation Phase: Phase 2 (exit strategy needs historical data)
Decision: Multi-Timeframe Data Synchronization
- Choice: Primary timeframe (1h) loads 4h confirmation data as needed
- Pattern: Each metric calculator specifies required timeframes, engine loads all required data upfront
- Rationale: Simpler than streaming/incremental loading, sufficient for batch processing
- Already Implemented: ✅ Yes (Phase 1 loads 1h + 4h data for regime classification)
State Management
Decision: Exit State Persistence
- Choice: Daily YAML files per symbol with transitions array
- File Structure:
exit_states/{symbol}/{YYYY-MM-DD}.yaml - Schema:
symbol: ETH-USDT grid_id: eth-grid-1 date: 2026-02-02 transitions: - timestamp: "2026-02-02T14:00:00Z" from_state: NORMAL to_state: WARNING reasons: - "Condition 1" - "Condition 2" metrics: adx: 28.5 efficiency_ratio: 0.62 last_notification: WARNING: "2026-02-02T14:00:00Z" LATEST_ACCEPTABLE_EXIT: null MANDATORY_EXIT: null - Rationale: YAML for state transitions (human-readable audit trail), daily files keep file sizes manageable
- Implementation Phase: Phase 2
Decision: Gate Evaluation Tracking
- Choice: Embedded in metrics YAML files (not separate state)
- Location:
metrics/{symbol}/{YYYY-MM-DD}-{HH}.yaml→gate_evaluationsection - Rationale: Gate status is part of regime analysis output, co-locating with metrics simplifies loading
- Already Implemented: ⚠️ Partial (structure defined in SCHEMA.md, implementation pending Phase 2)
Decision: Rate Limiting State
- Choice: Store in exit state transitions file (
last_notificationtimestamps) - Pattern: Check
last_notification[state]timestamp, compare to current time + rate limit threshold - Rationale: Co-locating with exit state transitions keeps related data together
- Rate Limits:
- WARNING: 4 hours minimum
- LATEST_ACCEPTABLE_EXIT: 2 hours minimum
- MANDATORY_EXIT: 1 hour minimum
- Implementation Phase: Phase 2
Decision: Grid State Determination
- Choice: History array as single source of truth (not separate
enabledfield) - Pattern:
def is_grid_running(grid_config: Dict) -> bool: history = grid_config.get("history", []) if not history: return False last_entry = history[-1] return "enabled" in last_entry and "disabled" not in last_entry - Rationale: Single source of truth, no conflicting state
- Already Implemented: ✅ Yes (Phase 1 complete, Requirement 19)
Error Recovery & Resilience
Decision: Partial Metric Failure Handling
- Choice: Continue with N/6 metrics if ≥4 available, abort if <4
- Pattern:
try: adx = calculate_adx(data) except MetricCalculationError as e: logger.error(f"ADX calculation failed: {e}") adx = None # Continue with None, confidence scorer handles missing metrics if available_metrics < 4: raise InsufficientMetricsError("Cannot calculate confidence with <4 metrics") - Rationale: Graceful degradation, regime classification still possible with partial metrics
- Already Implemented: ✅ Yes (Phase 1 handles missing metrics gracefully)
Decision: Git Conflict Resolution
- Choice: Retry with pull + merge (automatic for non-conflicting)
- Pattern:
try: repo.git.push() except GitCommandError: repo.git.pull(rebase=True) # Rebase our commit on top repo.git.push() - Rationale: Conflicts unlikely (single CronJob instance, hourly execution), automatic retry sufficient
- Escalation: If retry fails, log error, continue operation (commit on next cycle)
- Already Implemented: ✅ Yes (Phase 1 git_manager.py)
Decision: API Circuit Breaker
- Choice: 3 consecutive failures → circuit OPEN (stop trying for 30 minutes)
- Pattern:
class KuCoinClient: def __init__(self): self._failure_count = 0 self._circuit_open_until = None def fetch_ohlcv(self, symbol, timeframe): if self._circuit_open_until and now() < self._circuit_open_until: raise CircuitBreakerOpenError() try: result = self._api_call(...) self._failure_count = 0 # Reset on success return result except APIError: self._failure_count += 1 if self._failure_count >= 3: self._circuit_open_until = now() + timedelta(minutes=30) raise - Rationale: Prevents hammering failed API, 30-minute timeout allows temporary outages to clear
- Implementation Phase: Phase 3 (position tracking adds more API calls)
Decision: API Retry Logic
- Choice: Exponential backoff, 2-3 attempts max
- Pattern:
@retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10), retry=retry_if_exception_type(APIError) ) def fetch_with_retry(self, endpoint, params): return self._http_get(endpoint, params) - Library: tenacity (add to dependencies)
- Rationale: Handles transient network issues without manual retry logic
- Already Implemented: ⚠️ Partial (manual retry in some places, not consistent)
- Standardization Phase: Phase 3 (apply tenacity library consistently)
Notification Architecture
Decision: Notification Priority Mapping
- Choice: Exit states map to Pushover priority levels
- Mapping:
PRIORITY_MAP = { ExitState.NORMAL: -1, # Low priority (quiet notification) ExitState.WARNING: 0, # Normal priority ExitState.LATEST_ACCEPTABLE_EXIT: 1, # High priority (bypass quiet hours) ExitState.MANDATORY_EXIT: 2, # Emergency priority (requires acknowledgment) } - Rationale: Escalating urgency matches exit state severity, MANDATORY_EXIT requires explicit ack
- Implementation Phase: Phase 2
Decision: Multi-Channel Strategy
- Choice: Direct Pushover integration for MVP, optional n8n webhooks for future
- Pattern:
class NotificationDispatcher: def __init__(self, pushover_client, webhook_url=None): self.pushover = pushover_client self.webhook_url = webhook_url # Optional def send(self, notification): self.pushover.send(notification) # Always send to Pushover if self.webhook_url: requests.post(self.webhook_url, json=notification.dict()) # Optional webhook - Rationale: Pushover sufficient for MVP (direct, reliable), n8n adds flexibility post-MVP
- Already Implemented: ✅ Pushover integration (scripts/send_regime_notifications.py)
- Future Enhancement: Phase 5 (n8n integration for multi-channel routing)
Decision: Notification Templating
- Choice: Templates in code (not config files)
- Pattern:
def build_exit_state_message(state: ExitState, regime: Dict, grid: Dict) -> str: if state == ExitState.MANDATORY_EXIT: return f"🚨 MANDATORY EXIT: {grid['id']}\n" \ f"Regime: {regime['verdict']}\n" \ f"Stop grid immediately." elif state == ExitState.LATEST_ACCEPTABLE_EXIT: return f"⚠️ LATEST ACCEPTABLE EXIT: {grid['id']}\n" \ f"Exit within 12-24 hours to preserve capital." # ... etc - Rationale: Templates are code logic (conditional rendering), not configuration
- Deferred to Config: Phase 5 if user wants customizable templates
- Implementation Phase: Phase 2
Testing & Validation Architecture
Decision: Backtesting Data Format
- Choice: Replay actual Git history (no synthetic data for backtesting)
- Pattern:
class BacktestRunner: def run(self, start_date: date, end_date: date): # Load actual metrics files from Git history for metrics_file in self.load_metrics_range(start_date, end_date): # Re-evaluate exit strategy against historical data exit_state = self.evaluator.evaluate(metrics_file) # Compare against actual actions taken (from decision records) - Rationale:
- Real data (no synthetic data bias)
- Tests actual Git loading code
- Validates against real market conditions
- Implementation Phase: Phase 4
Decision: KPI Calculation Frequency
- Choice: Batch calculation (daily aggregation)
- Pattern:
# Run as separate CronJob (daily at midnight) class KPICalculator: def calculate_daily_kpis(self, date: date): # Load all exit state transitions for date # Load all decision records for date # Calculate KPIs (SLAR, PRR, TTDR, etc.) # Write to kpis/{YYYY-MM-DD}.yaml - Rationale: KPIs are lagging indicators (don’t need real-time), daily batch sufficient
- Implementation Phase: Phase 5
Decision: Test Data Generation
- Choice: Replay actual Git history (no mocking for integration tests)
- Pattern:
@pytest.fixture def last_7_days_metrics(git_repo): # Load actual metrics files from last 7 days return HistoryLoader(git_repo).load_last_n_days("ETH-USDT", 7) def test_exit_state_progression_real_data(last_7_days_metrics): # Test against real historical data for metrics in last_7_days_metrics: exit_state = evaluator.evaluate(metrics) # Assert reasonable exit states (no wild oscillations) - Rationale: Real data validates production behavior, catches edge cases mocks miss
- Already Implemented: ⚠️ Partial (some tests use real data, not standardized)
- Standardization Phase: Phase 4
Infrastructure & Deployment
Decision: Hosting Strategy
- Choice: Self-hosted Kubernetes cluster (not cloud provider)
- Rationale:
- Full control over infrastructure
- No cloud provider costs
- Already deployed and working (Phase 1 production)
- Components:
- Kubernetes CronJob (evaluation scheduling)
- PVC (Git repository persistence)
- ExternalSecrets (API key injection)
- Grafana Loki (logging)
- Already Implemented: ✅ Yes (Phase 1 complete)
Decision: CI/CD Pipeline
- Choice: GitHub Actions → Docker build → GHCR → ArgoCD GitOps
- Pipeline:
- GitHub Actions: Run tests, lint, build Docker image
- Push to GHCR (GitHub Container Registry)
- Update image tag in infra/ manifests
- ArgoCD detects change, deploys to Kubernetes
- Rationale:
- GitHub Actions free for public repos
- GHCR integrated with GitHub
- ArgoCD provides GitOps deployment
- Blue-green deployment (rollback on invalid config)
- Already Implemented: ✅ Yes (Phase 1 complete)
Decision: Environment Configuration
- Choice: YAML base + environment variable overrides
- Pattern:
# config/environment.yaml kucoin: api_key: "${KUCOIN_API_KEY}" # Injected via ExternalSecrets api_secret: "${KUCOIN_API_SECRET}" repository: base_path: "${MARKET_MAKER_DATA_REPOSITORY_BASE_PATH}" # Overridable - Convention:
MARKET_MAKER_<NESTED_KEY>for env var names - Rationale:
- YAML provides defaults
- Env vars allow Kubernetes overrides
- ExternalSecrets injects secrets securely
- Already Implemented: ✅ Yes (Phase 1 complete, Requirement 15)
Decision: Monitoring and Logging
- Choice: Grafana Loki for structured logging (no separate APM tool)
- Pattern:
logger.info( "Regime evaluation complete", extra={ "symbol": symbol, "regime": regime.verdict, "confidence": regime.confidence, "duration_ms": duration, } ) - Rationale:
- Structured logging sufficient for batch processing
- No need for distributed tracing (single monolith)
- Loki already deployed (Phase 1)
- Alert Thresholds:
-
1 minute evaluation: WARNING
-
5 minutes evaluation: ERROR
- Git push failure: WARNING
- 3+ consecutive API failures: ERROR
-
- Already Implemented: ✅ Yes (Grafana Loki integration complete)
Decision: Scaling Strategy
- Choice: Vertical scaling only (no horizontal scaling for MVP)
- Rationale:
- Single CronJob instance (no concurrency needed)
- Hourly evaluation cadence (no performance bottleneck)
- Git-as-database limits horizontal scaling (conflict management complexity)
- Future: Phase 5+ could explore sharding by symbol (separate CronJobs per grid)
- Current Resources: Sufficient for single grid, hourly evaluation
Decision Impact Analysis
Implementation Sequence for Phases 2-5:
-
Phase 2 (Exit Strategy):
- Extend state management patterns (exit state transitions YAML files)
- Implement historical data loading (in-memory caching pattern)
- Add notification priority mapping (Pushover priority levels)
- Follow testing pattern (8+ tests per trigger module)
-
Phase 3 (Position Risk):
- Extend API integration pattern (KuCoin position tracking)
- Add circuit breaker implementation (standardize API resilience)
- Implement tenacity retry library (consistent across all API calls)
- Follow interface pattern (PositionTrackerInterface → KuCoinPositionTracker)
-
Phase 4 (Testing & Validation):
- Implement backtesting (replay Git history pattern)
- Add KPI calculation (daily batch processing CronJob)
- Standardize test data generation (real data replay)
- Validate all RAIA assumptions (12-24h warning window, <30% FER)
-
Phase 5 (Operational):
- Enhance logging (structured logging consistency)
- Add audit logger (append-only decision tracking)
- Implement KPI dashboard (static HTML + Chart.js pattern)
- Optional n8n integration (webhook dispatcher)
Cross-Component Dependencies:
- Exit Strategy → Historical Data Loading: Phase 2 implements pattern, Phase 3+ reuses for position tracking
- State Management → Notification: Exit state transitions drive notification priority/content
- Error Handling → All Phases: Circuit breaker + retry patterns established in Phase 3, applied retroactively to Phase 1-2
- Schema Validation → All Phases: Pydantic models ensure consistency across all data structures
- Testing Patterns → All Phases: 100% coverage + real data validation established in Phase 1, maintained throughout
Key Principle: Extend Phase 1 Patterns, Don’t Reinvent
Every new feature should ask:
- Does Phase 1 have a pattern for this? (Yes → follow it)
- Is this genuinely new? (Yes → create pattern consistent with Phase 1 principles)
- Does this conflict with Phase 1? (Yes → document exception in RAIA.md with rationale)