Image Factory - Requirements
Problem Statement
When public base images are updated, our internal images that depend on them become stale and potentially vulnerable. We need an automated system that:
- Monitors upstream base images for updates
- Waits a configurable period (default: 7 days) to allow the community to discover vulnerabilities
- Rebuilds our internal images that depend on the updated base image
- Cascades rebuilds through the dependency chain
- Operates in a federated model where image repositories are distributed across multiple repos/teams
Functional Requirements
FR1: Image Enrollment
- Developers can enroll images by adding them to
images.yaml - Two types of images:
- Managed Images: Have source repo and Dockerfile (we build them)
- External Images: No source repo (third-party images we monitor)
- Enrollment specifies:
- Image registry and repository
- Source code location (optional - only for managed images)
- Dockerfile path (optional - only for managed images)
- Build workflow/pipeline (optional - only for managed images)
- Rebuild policies (delay, auto-rebuild)
- Warehouse configuration (repoURL, allowTags) for external images
FR2: Dependency Discovery
- System automatically discovers base images from Dockerfiles (managed images only)
- Parses all FROM statements
- Handles multi-stage builds
- Tracks dependency relationships
- Base images are initially treated as external
- Base images can be promoted to managed by adding source info to images.yaml
FR3: Base Image Monitoring
- Monitors upstream base images for digest changes
- Detects new versions in registries
- Tracks update history
- No polling - event-driven via Kargo Warehouses
FR4: Rebuild Orchestration
- Waits configurable delay period after base image update
- Triggers rebuilds of dependent images
- Cascades through dependency chain
- Updates state with rebuild attempts and results
FR5: State Management
- Maintains state files for all images and base images
- Tracks current versions, digests, and update history
- Preserves runtime data across updates
- Configuration (images.yaml) takes precedence over state files
- State files contain warehouse configuration (repoURL, allowTags) for CDK8s
- Analysis tool output must align with CDK8s input requirements
FR6: Manifest Generation
- CDK8s app reads both images.yaml and state files
- Generates Kargo Warehouse resources for:
- Base images (discovered from Dockerfiles)
- External images (enrolled without source)
- Managed images that become external (source removed)
- Does NOT generate warehouses for managed images (they’re built, not monitored)
- Merges images.yaml with state (images.yaml takes precedence)
- Output must be valid Kargo Warehouse YAML
FR7: GitOps Integration
- All configuration in git
- State changes committed to git
- Kargo resources generated from config and state
- ArgoCD applies resources automatically
Non-Functional Requirements
NFR1: Event-Driven Architecture
- No polling or CronJobs
- React to Kargo Freight creation
- Efficient resource usage
NFR2: Pure Kargo Implementation
- All monitoring through Kargo Warehouses
- All analysis through Kargo AnalysisTemplates
- All orchestration through Kargo Stages
- Unified UI and monitoring
NFR3: Security
- Credentials stored in Kubernetes Secrets
- Minimal permission scopes
- Audit trail in git
- Support for image signature verification
NFR4: Scalability
- Add images by editing configuration
- No infrastructure to manage
- Distributed across repos/teams
NFR5: Testability
- Unit tests for analysis tool (apps/image-factory/test_app.py)
- Unit tests for CDK8s manifest generation (cdk8s/image-factory/test_main.py)
- Integration tests verifying tool → state → CDK8s workflow (image-factory/test_integration.py)
- Tests verify data alignment between tool output and CDK8s input
Open Questions
-
How do we handle breaking changes in base images?
- Should we test images before promoting?
- Need a rollback mechanism?
-
What if a base image has a critical CVE?
- Should we rebuild immediately (skip waiting period)?
- How do we get notified of CVEs?
-
How do we handle multi-stage builds with multiple base images?
- Track all FROM statements?
- Prioritize by stage?
-
How do we handle rate limiting?
- Docker Hub has strict rate limits
- Need caching strategy?
- Should we use a pull-through cache?
-
How do we handle image lifecycle transitions?
- External → Managed: Add source info to images.yaml
- Managed → External: Remove source info from images.yaml
- Base image promoted to managed: Add to images.yaml with source
- How do we clean up old state files?