Decision Record: Ad-Hoc Load Testing Framework
Date: 2026-02-04
Status: Proposed
Category: Infrastructure
Decision Makers: Platform Engineering Team
Context
We run an integration platform that federates APIs, allowing producers to surface their APIs for self-service consumption. The platform operates large sandbox instances, and we need a flexible, easy-to-configure system for running ad-hoc load tests.
Current State
- Infrastructure: ArgoCD and GitLab available
- Test Generation: Can run in GitLab CI or sandbox environment
- Target System Under Test (SUT): Different sandbox environment from test generation
- Scale: Need to test multiple federated APIs with varying load profiles
- Use Cases:
- Ad-hoc performance validation
- Pre-production load testing
- API capacity planning
- Performance regression detection
Requirements
- Flexibility: Easy to configure different test scenarios and targets
- Self-Service: Teams should be able to trigger tests with minimal friction
- Isolation: Test generation and SUT should be separate environments
- Observability: Clear metrics and reporting
- Reproducibility: Tests should be version-controlled and repeatable
- Resource Efficiency: Don’t consume unnecessary sandbox resources
Decision
We will implement a k6-based load testing framework with the following architecture:
Tool Selection: k6 over Gatling
Chosen: k6
Alternatives Considered: Gatling, Locust, JMeter
Rationale:
- Kubernetes-native: k6 operator enables distributed testing in K8s clusters
- Lightweight: Smaller container footprint suitable for sandbox constraints
- Developer-friendly: JavaScript/TypeScript tests are easier to write and maintain
- GitLab Integration: Excellent CI/CD support with native performance reporting
- Flexible Execution: CLI, K8s operator, or cloud-based execution modes
- Modern Metrics: Built-in Prometheus/InfluxDB support
Trade-offs:
- Gatling has better GUI for test recording (not critical for API testing)
- k6 JavaScript runtime has learning curve for Java/JVM teams (acceptable given broader JS adoption)
Operator vs Non-Operator Deployment Comparison
A critical decision in implementing load testing is whether to use a Kubernetes operator, K8s Jobs, or simpler container-based execution. This affects architecture, scalability, operational complexity, and time-to-value.
Our Approach: We plan to use K8s Jobs triggered from GitLab CI, running on a separate cluster from the SUT. This offloads work from GitLab servers and avoids potential network bottlenecks between GitLab and the SUT.
Priority Concerns: Decision Matrix
These dimensions are critical to our decision-making process.
| Priority Dimension | k6 + Operator | k6 + K8s Job | k6 + GitLab Runner | Gatling + Operator | Gatling + K8s Job | Gatling + GitLab Runner |
|---|---|---|---|---|---|---|
| Quality of Reporting (Out-of-Box) | ⭐⭐⭐ JSON/text summary (needs Grafana for visual) | ⭐⭐⭐ JSON/text summary (needs Grafana for visual) | ⭐⭐⭐ JSON/text summary (needs Grafana for visual) | ⭐⭐⭐⭐⭐ Rich HTML reports built-in, detailed drill-downs, charts | ⭐⭐⭐⭐⭐ Rich HTML reports built-in, detailed drill-downs | ⭐⭐⭐⭐⭐ Rich HTML reports built-in |
| Quality with Tooling | ⭐⭐⭐⭐⭐ Excellent with Grafana/InfluxDB | ⭐⭐⭐⭐⭐ Excellent with Grafana/InfluxDB | ⭐⭐⭐⭐ Good with Grafana/InfluxDB | ⭐⭐⭐⭐⭐ Built-in + optional Grafana | ⭐⭐⭐⭐⭐ Built-in + optional Grafana | ⭐⭐⭐⭐⭐ Built-in + optional Grafana |
| Ease of Reporting | ⭐⭐⭐⭐ Automated via CRD, requires Grafana setup | ⭐⭐⭐⭐ Simple artifact collection, requires Grafana/report gen | ⭐⭐⭐⭐ GitLab artifacts, requires Grafana/report gen | ⭐⭐⭐⭐ Custom collection, HTML ready | ⭐⭐⭐⭐⭐ HTML reports work immediately | ⭐⭐⭐⭐⭐ HTML reports work immediately |
| Time to First Test | 1-2 days | 2-4 hours | 1-2 hours | 2-3 days | 3-5 hours | 1-2 hours |
| Time to MVP | 1-2 weeks | 1-3 days | 1-2 days | 2-3 weeks | 3-5 days | 1-2 days |
| Maturity | ⭐⭐⭐⭐⭐ Official Grafana Labs operator, production-ready | ⭐⭐⭐⭐⭐ Standard K8s Job pattern, rock solid | ⭐⭐⭐⭐⭐ Standard Docker execution | ⭐⭐⭐ Community operators, less mature | ⭐⭐⭐⭐⭐ Standard K8s Job pattern | ⭐⭐⭐⭐⭐ Standard Docker execution |
| Ease of Use | ⭐⭐⭐ Requires CRD knowledge, K8s expertise | ⭐⭐⭐⭐ Standard K8s Job, familiar to teams | ⭐⭐⭐⭐⭐ Simple Docker run command | ⭐⭐ Custom CRDs or complex Helm charts | ⭐⭐⭐⭐ Standard K8s Job | ⭐⭐⭐⭐⭐ Simple Docker run |
| Ease of Horizontal Scaling | ⭐⭐⭐⭐⭐ Built-in parallelism parameter | ⭐⭐⭐⭐ Job completions: N + manual coordination | ⭐⭐ Manual multi-runner orchestration | ⭐⭐⭐⭐ Operator-managed or manual | ⭐⭐⭐⭐ Job completions: N + coordination scripts | ⭐⭐ Manual orchestration |
Rating Key: ⭐⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐ Good | ⭐⭐⭐ Acceptable | ⭐⭐ Limited | ⭐ Poor
Reporting Deep Dive: k6 vs Gatling
This is a critical differentiator that deserves detailed explanation.
Gatling Reporting (Out-of-the-Box Winner)
What You Get Immediately:
- Rich HTML Reports: Beautiful, interactive reports generated automatically after each test
- Visual Charts: Response time distribution, requests/second, response time percentiles over time
- Drill-Down Capability: Click into specific requests, see detailed stats per endpoint
- Statistical Analysis: Min/max/mean/percentiles, standard deviation
- Error Analysis: Detailed breakdown of failures with counts and percentages
- Self-Contained: Single HTML file (or folder) you can share, no server required
Example Gatling Report Sections:
1. Global Information: Total requests, OK/KO counts, min/max/mean/percentiles
2. Statistics Table: Per-request breakdown with all metrics
3. Active Users Over Time: Graph showing VU ramp-up/down
4. Response Time Distribution: Histogram of latencies
5. Response Time Percentiles: P50/P75/P95/P99 over time
6. Requests Per Second: Throughput over time
7. Responses Per Second: Success/failure rates
Artifact Collection:
# Gatling generates to target/gatling/<timestamp>/
# Contains: index.html + js/ + style/ folders
kubectl cp <pod>:/results/gatling ./gatling-report
# Open index.html in browser - fully functional reportVerdict: ⭐⭐⭐⭐⭐ Production-ready reports with zero additional tooling
Gatling → Grafana Integration Options:
Since you already have Prometheus and JMX monitoring infrastructure, Gatling has several options:
Option 1: Prometheus + JMX Exporter ⭐⭐⭐⭐ (Best for your setup)
- How: Gatling exposes JMX metrics → JMX Exporter → Prometheus → Grafana
- Setup:
- Run Gatling with JMX enabled:
-Dgatling.jmx.enabled=true - Deploy JMX Exporter as sidecar in K8s Job pod
- Configure Prometheus to scrape JMX Exporter endpoint
- Run Gatling with JMX enabled:
- Pros:
- ✅ Leverages your existing Prometheus infrastructure
- ✅ Same pattern as other Java apps you monitor
- ✅ Real-time metrics during test execution
- Cons:
- ⚠️ JMX Exporter sidecar adds complexity to Job manifest
- ⚠️ Need to configure JMX metric mappings
- ⚠️ Community Grafana dashboards (not official)
- Setup Time: 2-3 hours (sidecar config + Prometheus ServiceMonitor + dashboard)
Option 2: Prometheus Pushgateway ⭐⭐⭐⭐
- How: Gatling pushes metrics to Pushgateway → Prometheus scrapes → Grafana
- Plugin: Use
gatling-prometheusplugin// build.sbt libraryDependencies += "com.github.lkishalmi.gatling" % "gatling-prometheus" % "3.11.1"# gatling.conf data { writers = [console, file, prometheus] } prometheus { pushgateway { url = "http://pushgateway.monitoring:9091" } } - Pros:
- ✅ Works well for batch jobs (like K8s Jobs)
- ✅ Simpler than JMX Exporter (no sidecar)
- ✅ Designed for short-lived processes
- Cons:
- ⚠️ Requires plugin installation (not built-in)
- ⚠️ Pushgateway required (may already have it)
- ⚠️ Metrics persist in Pushgateway after test (need cleanup)
- Setup Time: 1-2 hours (plugin + pushgateway config)
Option 3: InfluxDB Export ⭐⭐⭐
- How: Gatling → InfluxDB → Grafana
- Plugin:
gatling-influxdblibraryDependencies += "com.github.gatling" % "gatling-influxdb" % "1.1.4" - Pros:
- ✅ Direct time-series storage
- ✅ Good for historical trending
- Cons:
- ⚠️ Requires InfluxDB (if you don’t have it)
- ⚠️ Separate from your Prometheus infrastructure
- ⚠️ Plugin required
- Setup Time: 2-4 hours (deploy InfluxDB if needed + plugin config)
Option 4: Graphite Export ⭐⭐
- How: Built-in Gatling Graphite support → Grafana Graphite datasource
- Configuration: Built into Gatling (no plugin)
data { writers = [console, file, graphite] } graphite { host = "graphite.monitoring" port = 2003 } - Pros:
- ✅ No plugin required
- ✅ Built-in support
- Cons:
- ⚠️ Requires Graphite (probably don’t have it)
- ⚠️ Less common than Prometheus
- Setup Time: 2-4 hours (deploy Graphite + configure)
Detailed Setup: Option 1 (JMX Exporter - Recommended for you)
K8s Job manifest with JMX Exporter sidecar:
apiVersion: batch/v1
kind: Job
metadata:
name: gatling-test-${CI_PIPELINE_ID}
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9404"
prometheus.io/path: "/metrics"
spec:
containers:
# Main Gatling container
- name: gatling
image: denvazh/gatling:latest
command:
- /opt/gatling/bin/gatling.sh
- -sf=/simulations
- -s=com.example.ApiSimulation
env:
- name: JAVA_OPTS
value: "-Dgatling.jmx.enabled=true -Dcom.sun.management.jmxremote.port=1099 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"
resources:
requests:
memory: 2Gi
cpu: 1
limits:
memory: 4Gi
cpu: 2
# JMX Exporter sidecar
- name: jmx-exporter
image: bitnami/jmx-exporter:latest
ports:
- containerPort: 9404
name: metrics
volumeMounts:
- name: jmx-config
mountPath: /etc/jmx-exporter
command:
- java
- -jar
- /opt/bitnami/jmx-exporter/jmx_prometheus_httpserver.jar
- "9404"
- /etc/jmx-exporter/config.yaml
resources:
requests:
memory: 128Mi
cpu: 100m
volumes:
- name: jmx-config
configMap:
name: gatling-jmx-configJMX Exporter config:
# ConfigMap: gatling-jmx-config
apiVersion: v1
kind: ConfigMap
metadata:
name: gatling-jmx-config
data:
config.yaml: |
hostPort: localhost:1099
rules:
- pattern: "io.gatling.core<type=AllRequests><>(.+)"
name: gatling_all_requests_$1
- pattern: "io.gatling.core<type=Simulation><>(.+)"
name: gatling_simulation_$1
- pattern: "io.gatling.core<type=Request, name=(.+)><>(.+)"
name: gatling_request_$2
labels:
request: "$1"Prometheus ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: gatling-tests
namespace: load-testing
spec:
selector:
matchLabels:
job-type: load-test
podMetricsEndpoints:
- port: metrics
interval: 10sSetup Time Breakdown:
- JMX Exporter sidecar config: 30 minutes
- JMX metric mapping config: 1 hour
- Prometheus PodMonitor: 15 minutes
- Grafana dashboard: 1 hour
- Total: ~2-3 hours
Detailed Setup: Option 2 (Pushgateway - Simpler)
Gatling with Prometheus plugin:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: gatling
image: custom-gatling-with-prometheus-plugin:latest # Custom image with plugin
command:
- /opt/gatling/bin/gatling.sh
- -sf=/simulations
- -s=com.example.ApiSimulation
env:
- name: PUSHGATEWAY_URL
value: "http://pushgateway.monitoring:9091"
resources:
requests:
memory: 2Gi
cpu: 1Gatling config (baked into custom image):
# gatling.conf
data {
writers = [console, file, prometheus]
}
prometheus {
pushgateway {
url = ${?PUSHGATEWAY_URL}
jobName = "gatling-load-test"
}
}Setup Time Breakdown:
- Build custom image with plugin: 1 hour
- Pushgateway deployment (if needed): 30 minutes
- Prometheus scrape config: 15 minutes
- Grafana dashboard: 1 hour
- Total: ~2-3 hours (first time), 30 min (subsequent)
Comparison for Your Infrastructure:
| Method | Fits Existing Setup | Real-time | Setup Time | Complexity |
|---|---|---|---|---|
| JMX Exporter | ⭐⭐⭐⭐⭐ Uses existing Prometheus + JMX pattern | ✅ Yes | 2-3 hours | ⭐⭐⭐ Moderate |
| Pushgateway | ⭐⭐⭐⭐⭐ Uses existing Prometheus | ✅ Yes | 1-2 hours | ⭐⭐⭐⭐ Simpler |
| InfluxDB | ⭐⭐ Requires separate stack | ✅ Yes | 2-4 hours | ⭐⭐⭐ Moderate |
| Graphite | ⭐ Requires Graphite | ✅ Yes | 2-4 hours | ⭐⭐⭐ Moderate |
Recommendation for Your Setup:
- Prometheus Pushgateway (if you have it) - simplest
- JMX Exporter (if you don’t) - uses familiar pattern
Result: ⭐⭐⭐⭐ Gatling can integrate with your Prometheus/Grafana stack, but requires 1-3 hours additional setup vs k6’s native support
Comparison for Your Use Case (Existing Prometheus/Grafana):
| Aspect | k6 | Gatling |
|---|---|---|
| Prometheus Export | ⭐⭐⭐⭐⭐ Built-in experimental, or easy plugin | ⭐⭐⭐⭐ Via Pushgateway plugin or JMX Exporter |
| InfluxDB Export | ⭐⭐⭐⭐⭐ Built-in --out influxdb=... | ⭐⭐⭐ Requires plugin |
| Setup Effort (Prometheus) | 1 flag or small config | Plugin + custom image OR JMX sidecar (2-3 hours) |
| Setup Effort (InfluxDB) | 1 line in command (if you have InfluxDB) | Plugin + config file |
| Grafana Dashboards | ⭐⭐⭐⭐⭐ Official, well-maintained | ⭐⭐⭐ Community, requires customization |
| Dashboard Availability | Multiple official options | Limited community options |
| Data Schema | Standardized, well-documented | Less standardized, varies by export method |
| Real-time Monitoring | ⭐⭐⭐⭐⭐ Seamless | ⭐⭐⭐⭐ Works with setup |
| JMX Pattern Fit | N/A (not JVM) | ⭐⭐⭐⭐⭐ Perfect fit (you already monitor JMX) |
Since you already have Prometheus/Grafana:
- ✅ k6’s advantage remains strong (simpler Prometheus export)
- ✅ Official k6 Grafana dashboards work out-of-box
- ✅ Gatling can integrate via JMX Exporter (familiar pattern for your Java apps)
- ⚠️ Gatling requires 1-3 hours additional setup vs k6’s minutes
- ⚠️ Gatling dashboards are community-maintained (less polished)
k6 Reporting (Trivial with Existing Grafana)
What You Get Immediately:
-
Text Summary: Console output with basic stats
execution: local script: test.js output: - scenarios: (100.00%) 1 scenario, 100 max VUs, 5m30s max duration ✓ status is 200 ✓ response time OK checks.........................: 100.00% ✓ 50000 ✗ 0 data_received..................: 150 MB 500 kB/s data_sent......................: 5.0 MB 17 kB/s http_req_blocked...............: avg=1ms min=0s med=1ms max=10ms p(90)=2ms p(95)=3ms http_req_duration..............: avg=100ms min=50ms med=95ms max=500ms p(90)=150ms p(95)=200ms http_reqs......................: 50000 166.666667/s -
JSON Output: Machine-readable metrics for parsing
{ "metrics": { "http_req_duration": { "type": "trend", "contains": "time", "values": { "min": 50.123, "max": 500.456, "avg": 100.789, "med": 95.234, "p(90)": 150.567, "p(95)": 200.890, "p(99)": 450.123 } } } }
What You DON’T Get:
- ❌ No visual charts/graphs
- ❌ No drill-down HTML interface
- ❌ No time-series graphs (response time over duration)
- ❌ No distribution histograms
Options to Get Visual Reports:
Option 1: Grafana + InfluxDB (Best, Production-Grade)
- Export metrics:
k6 run --out influxdb=http://influxdb:8086/k6 test.js - Real-time dashboards during test execution
- Historical trending across test runs
- Requires: InfluxDB deployed, Grafana dashboards configured
- Setup Time: 2-4 hours for first-time setup
- Result: ⭐⭐⭐⭐⭐ Production-grade observability
Option 2: k6 HTML Report Generator (Third-Party)
- Tool:
k6-reporter(npm package) ork6-html-reporter - Generate HTML from JSON:
k6-reporter summary.json - Creates basic HTML page with charts
- Requires: Node.js, external package
- Result: ⭐⭐⭐ Basic HTML, not as rich as Gatling
Option 3: k6 Cloud (Commercial)
- Export to Grafana Cloud k6
- Beautiful reports, no infrastructure
- Requires: Subscription, data egress to cloud
- Result: ⭐⭐⭐⭐⭐ Excellent but costs $$
GitLab Performance Widget:
# k6 can output to GitLab's performance format
artifacts:
reports:
performance: performance.json # GitLab shows trend graph- Shows basic trend line in merge requests
- Result: ⭐⭐⭐ Useful for CI/CD gates, not detailed analysis
Verdict:
- Out-of-box: ⭐⭐⭐ Text/JSON only, requires tooling for visuals
- With Grafana: ⭐⭐⭐⭐⭐ Excellent real-time + historical analysis
- Trade-off: Setup overhead vs immediate gratification
Recommendation Based on Your Priorities
Since “Quality and Ease of Reporting” is your Priority #1, consider:
IMPORTANT: You Already Have Grafana 🎯
This significantly changes the evaluation in k6’s favor:
k6 with Existing Grafana (⭐⭐⭐⭐⭐ Recommended):
- ✅ Trivial setup: Add single flag
--out influxdb=http://influxdb:8086/k6 - ✅ Official dashboards: Import Grafana k6 dashboard in 5 minutes
- ✅ Real-time monitoring: Watch tests execute live in Grafana
- ✅ Unified observability: Monitor both load tests AND SUT in same Grafana instance
- ✅ Setup time: ~30 minutes (vs 2-4 hours if deploying Grafana from scratch)
- ✅ Result: Best of both worlds - HTML for ad-hoc sharing, Grafana for analysis
Gatling with Existing Grafana (⭐⭐⭐ Possible but more work):
- ⚠️ Requires
gatling-influxdbplugin (not built-in) - ⚠️ Community dashboards (less polished than k6’s official ones)
- ⚠️ Additional build configuration (Maven/sbt dependency)
- ⚠️ Still get HTML reports, but Grafana integration is secondary
- ⚠️ Setup time: ~2-3 hours (plugin + dashboard customization)
Our Recommendation (With Existing Grafana):
Phase 1 (Day 1): k6 + text/JSON
- Get first test working in 2-4 hours
- Text output sufficient to validate approach
Phase 1.5 (Day 2): Connect to Grafana (30 minutes)
- Add
--out influxdb=...to Job manifest - Import official k6 dashboard to Grafana
- Now have real-time monitoring + historical trending
Optional: k6-reporter for HTML reports
- Use for sharing results with stakeholders who don’t have Grafana access
- 2 hours setup time
Result:
- ✅ Gatling’s advantage (HTML reports) becomes less critical
- ✅ k6’s Grafana integration is simpler and better supported
- ✅ You get best of both: Grafana for analysis, optionally HTML for sharing
- ✅ All observability in one place (load tests + SUT metrics in same Grafana)
Why k6 wins with existing Grafana:
- Setup: 1-line config vs plugin installation
- Dashboard quality: Official vs community
- Unified monitoring: Load test + SUT metrics side-by-side
- Lower resources: 512Mi vs 2Gi memory
- Faster setup: 2-4 hrs vs 3-5 hrs total
- More accessible: JS vs Scala
Gatling only makes sense if:
- Team is already JVM/Scala-proficient
- Need Gatling-specific features (recorder, complex DSL)
- HTML reports are critical and Grafana access is restricted
- Willing to invest in plugin setup + custom dashboards
Decision Table: With Existing Prometheus/Grafana 🎯
| Factor | k6 | Gatling |
|---|---|---|
| Prometheus Integration | ⭐⭐⭐⭐⭐ Built-in experimental or xk6 plugin | ⭐⭐⭐⭐ Via Pushgateway plugin or JMX Exporter |
| InfluxDB Integration | ⭐⭐⭐⭐⭐ Built-in, 1-line flag | ⭐⭐⭐ Plugin required |
| Setup Time (Prometheus) | 30 min - 1 hour | 1-3 hours (JMX sidecar or custom image) |
| Setup Time (InfluxDB) | 30 minutes | 2-3 hours (plugin + config) |
| Dashboard Quality | ⭐⭐⭐⭐⭐ Official, well-maintained | ⭐⭐⭐ Community dashboards |
| HTML Reports | ⭐⭐⭐ Optional (k6-reporter) | ⭐⭐⭐⭐⭐ Built-in, excellent |
| Real-time Monitoring | ⭐⭐⭐⭐⭐ Seamless | ⭐⭐⭐⭐ Works with setup |
| Fits JMX Pattern | N/A (not JVM) | ⭐⭐⭐⭐⭐ Perfect (like your other Java apps) |
| Unified Monitoring | ⭐⭐⭐⭐⭐ Load tests + SUT in same Grafana | ⭐⭐⭐⭐⭐ Load tests + SUT in same Grafana |
| Total Setup (Day 1) | 2-4 hours (test) + 30-60 min (metrics) | 3-5 hours (test) + 1-3 hours (metrics) |
| Resource Footprint | 120MB image, 512Mi RAM | 500MB image, 2Gi RAM (+ JMX sidecar if used) |
| Language Accessibility | ⭐⭐⭐⭐⭐ JavaScript | ⭐⭐⭐ Scala/Java |
| Time to MVP with Observability | 1.5-2 days | 4-6 days |
Verdict with Existing Prometheus/Grafana: ⭐⭐⭐⭐⭐ k6 still wins
k6 Advantages:
- Simpler Prometheus/InfluxDB integration (minutes vs hours)
- Official Grafana dashboards work immediately
- Lower resource footprint (no JMX sidecar needed)
- More accessible language
- Faster time to production-quality observability
Gatling Advantages:
- HTML reports for sharing with non-Grafana users
- JMX pattern matches your existing Java app monitoring (familiar)
- Scala DSL if team is JVM-proficient
Key Insight: While Gatling can integrate with your Prometheus setup via JMX Exporter (same pattern as your other Java apps), the additional 1-3 hours of setup + community dashboards don’t offset k6’s speed and simplicity advantages.
Other Concerns: Supporting Dimensions
| Other Dimension | k6 + Operator | k6 + K8s Job | k6 + GitLab Runner | Gatling + Operator | Gatling + K8s Job | Gatling + GitLab Runner |
|---|---|---|---|---|---|---|
| Setup Complexity | ⭐⭐ Operator + CRD installation | ⭐⭐⭐⭐ Job manifest + kubectl | ⭐⭐⭐⭐⭐ Just Docker image | ⭐⭐ Custom operator or Helm | ⭐⭐⭐⭐ Job manifest + kubectl | ⭐⭐⭐⭐⭐ Just Docker image |
| Max Load Capacity | Very High (100k+ RPS) | High (50k+ RPS with multiple Jobs) | Medium (10k RPS per runner) | Very High (100k+ RPS) | High (50k+ RPS) | Medium (10k RPS per runner) |
| Resource Isolation | ⭐⭐⭐⭐⭐ Namespaces, quotas, limits | ⭐⭐⭐⭐⭐ Namespaces, quotas, limits | ⭐⭐⭐ Runner-level isolation | ⭐⭐⭐⭐⭐ Namespaces, quotas, limits | ⭐⭐⭐⭐⭐ Namespaces, quotas, limits | ⭐⭐⭐ Runner-level isolation |
| Network Policies | ⭐⭐⭐⭐⭐ Full NetworkPolicy support | ⭐⭐⭐⭐⭐ Full NetworkPolicy support | ⭐⭐ Limited to runner config | ⭐⭐⭐⭐⭐ Full NetworkPolicy support | ⭐⭐⭐⭐⭐ Full NetworkPolicy support | ⭐⭐ Limited to runner config |
| Network Bottleneck | ⭐⭐⭐⭐⭐ Separate cluster avoids GitLab bottleneck | ⭐⭐⭐⭐⭐ Separate cluster avoids GitLab bottleneck | ⭐⭐ Limited by GitLab network | ⭐⭐⭐⭐⭐ Separate cluster | ⭐⭐⭐⭐⭐ Separate cluster avoids bottleneck | ⭐⭐ Limited by GitLab network |
| Operational Overhead | ⭐⭐ Operator maintenance, upgrades | ⭐⭐⭐⭐ Minimal (standard Jobs) | ⭐⭐⭐⭐⭐ Minimal | ⭐⭐ Custom operator maintenance | ⭐⭐⭐⭐ Minimal (standard Jobs) | ⭐⭐⭐⭐⭐ Minimal |
| Observability | ⭐⭐⭐⭐⭐ K8s metrics, logs, events | ⭐⭐⭐⭐⭐ K8s logs, easy metric export | ⭐⭐⭐⭐ GitLab logs, artifacts | ⭐⭐⭐⭐ Custom dashboards | ⭐⭐⭐⭐⭐ K8s logs, metric export | ⭐⭐⭐⭐ GitLab logs, artifacts |
| Test Lifecycle | ⭐⭐⭐⭐⭐ Declarative, auto-cleanup | ⭐⭐⭐⭐ TTL for cleanup, simple | ⭐⭐⭐ Script-based management | ⭐⭐⭐ Custom scripts | ⭐⭐⭐⭐ TTL for cleanup | ⭐⭐⭐ Script-based |
| Multi-tenancy | ⭐⭐⭐⭐⭐ Namespace isolation, RBAC | ⭐⭐⭐⭐⭐ Namespace isolation, RBAC | ⭐⭐ Shared runner pool | ⭐⭐⭐⭐⭐ Namespace isolation | ⭐⭐⭐⭐⭐ Namespace isolation, RBAC | ⭐⭐ Shared runner pool |
| Community Support | ⭐⭐⭐⭐⭐ Active Grafana Labs | ⭐⭐⭐⭐⭐ Well-documented pattern | ⭐⭐⭐⭐⭐ Well documented | ⭐⭐⭐ Limited operator support | ⭐⭐⭐⭐⭐ Well-documented | ⭐⭐⭐⭐⭐ Well documented |
| ArgoCD Integration | ⭐⭐⭐⭐⭐ Native GitOps | ⭐⭐⭐⭐ CronJob or manual trigger | N/A (ephemeral) | ⭐⭐⭐⭐ Custom ArgoCD app | ⭐⭐⭐⭐ CronJob or manual | N/A (ephemeral) |
| Debugging | ⭐⭐⭐ K8s pod logs, exec | ⭐⭐⭐⭐ kubectl logs, local Docker test | ⭐⭐⭐⭐⭐ Local Docker run | ⭐⭐⭐ K8s pod logs, exec | ⭐⭐⭐⭐ kubectl logs, local test | ⭐⭐⭐⭐⭐ Local Docker run |
| Image Size | ~120MB | ~120MB | ~120MB | ~500MB (JVM) | ~500MB (JVM) | ~500MB (JVM) |
| Language/Ecosystem | JavaScript/TypeScript | JavaScript/TypeScript | JavaScript/TypeScript | Scala/Java | Scala/Java | Scala/Java |
| Best For | Production-grade, multi-team, high-scale, GitOps | Our use case: Offload from GitLab, avoid network bottleneck, good balance | Quick start, low-scale, simple tests | JVM shops, high-scale | JVM shops, offload from GitLab | JVM shops, quick start |
Detailed Comparison
k6 with K8s Job ⭐ RECOMMENDED
Architecture: GitLab CI triggers K8s Job on separate test cluster → Job executes k6 → Collect results
Strengths:
- Network Isolation: Runs on separate cluster from GitLab, avoiding network bottlenecks to SUT
- Resource Offloading: GitLab server doesn’t bear the load generation workload
- Standard K8s Pattern: Jobs are well-understood, mature, and widely used
- Fast Setup: Standard K8s manifest + kubectl, no operator installation required (2-4 hours to first test)
- Clean Reporting: k6’s excellent JSON/summary output easily collected as artifacts
- Horizontal Scaling: Use
completions: Nwith coordination for distributed load - Resource Management: Full K8s quotas, limits, and NetworkPolicy support
- Debugging: Can test Jobs locally with same Docker image
- Low Overhead: No operator to maintain, Jobs auto-cleanup with TTL
Weaknesses:
- Manual Coordination: For distributed tests, need custom coordination logic (vs operator’s built-in
parallelism) - Less Declarative: Requires scripting for test lifecycle (vs operator CRDs)
- No GitOps: Jobs are ephemeral, not continuously reconciled by ArgoCD
Example Usage:
# GitLab CI triggers this Job
apiVersion: batch/v1
kind: Job
metadata:
name: load-test-${CI_PIPELINE_ID}
namespace: load-testing
spec:
ttlSecondsAfterFinished: 3600 # Auto-cleanup after 1 hour
completions: 5 # Run 5 parallel Jobs for distributed load
parallelism: 5
template:
spec:
restartPolicy: Never
containers:
- name: k6
image: grafana/k6:latest
command:
- k6
- run
- --out=json=/results/output.json
- --vus=100
- --duration=5m
- /scripts/test.js
volumeMounts:
- name: test-script
mountPath: /scripts
- name: results
mountPath: /results
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1
memory: 1Gi
volumes:
- name: test-script
configMap:
name: test-script-${CI_PIPELINE_ID}
- name: results
emptyDir: {}GitLab CI Integration:
execute-load-test:
stage: test
image: bitnami/kubectl:latest
script:
# Create ConfigMap with test script
- kubectl create configmap test-script-${CI_PIPELINE_ID}
--from-file=test.js -n load-testing
# Create and run Job
- envsubst < k8s/job-template.yaml | kubectl apply -f -
# Wait for completion
- kubectl wait --for=condition=complete --timeout=30m
job/load-test-${CI_PIPELINE_ID} -n load-testing
# Collect results from Job pods
- kubectl logs job/load-test-${CI_PIPELINE_ID} -n load-testing > results.log
# Cleanup
- kubectl delete configmap test-script-${CI_PIPELINE_ID} -n load-testing
artifacts:
paths:
- results.log
reports:
performance: performance.jsonDistributed Load Pattern:
# For distributed load, use coordination via env vars
# Each Job instance gets an index: 0, 1, 2, 3, 4
# Split VUs across instances
export INSTANCE_INDEX=$JOB_COMPLETION_INDEX
export TOTAL_INSTANCES=5
export TOTAL_VUS=500
export VUS_PER_INSTANCE=$((TOTAL_VUS / TOTAL_INSTANCES))
k6 run \
--vus=${VUS_PER_INSTANCE} \
--duration=5m \
--out=json=/results/output-${INSTANCE_INDEX}.json \
test.jsWhen to Choose:
- Our use case: Need to offload from GitLab, avoid network bottlenecks
- Want K8s benefits (isolation, quotas, NetworkPolicies) without operator complexity
- Need faster MVP (days vs weeks)
- Team comfortable with K8s but wants simpler lifecycle than operator
- Don’t need GitOps continuous reconciliation
k6 with Operator
Strengths:
- Official Support: Grafana Labs maintains the operator, ensuring compatibility and updates
- Declarative: Define tests as Kubernetes CRDs (
TestRunresources) - Horizontal Scaling: Set
parallelism: 10to distribute load across 10 pods automatically - Resource Management: Leverage K8s resource quotas, limits, and autoscaling
- Network Control: Fine-grained NetworkPolicies to restrict test traffic
- GitOps Ready: Deploy via ArgoCD alongside application infrastructure
- Cloud Native: Integrates with service meshes, observability stacks
Weaknesses:
- Setup Time: Requires operator installation, namespace setup, RBAC configuration
- Learning Curve: Team needs to understand CRDs, K8s resource management
- Debugging Complexity: Failures require K8s troubleshooting skills
- Overhead: Operator consumes cluster resources even when idle
Example Usage:
apiVersion: k6.io/v1alpha1
kind: TestRun
metadata:
name: api-load-test
spec:
parallelism: 5 # 5 distributed pods
script:
configMap:
name: test-script
runner:
resources:
limits:
cpu: 1
memory: 1GiWhen to Choose:
- Running tests regularly (daily/weekly regression tests)
- Need to generate >50k RPS
- Multiple teams using shared infrastructure
- Strong K8s skills in team
- Security/isolation requirements (NetworkPolicies)
k6 with Docker (GitLab Runner)
Strengths:
- Simplicity: Just run
docker run grafana/k6:latest run test.js - Fast Setup: Working in under an hour
- Easy Debugging: Run tests locally with same Docker image
- Low Overhead: No persistent cluster resources
- Familiar: Standard GitLab CI patterns
Weaknesses:
- Scale Limits: Single runner caps at ~10k RPS (CPU bound)
- No Distribution: Can’t easily split load across multiple executors
- Resource Contention: Shares resources with other CI jobs
- Limited Isolation: Relies on runner network configuration
- Manual Orchestration: Need custom scripts for distributed tests
Example Usage:
# .gitlab-ci.yml
load-test:
image: grafana/k6:latest
script:
- k6 run --vus 100 --duration 5m test.js
artifacts:
reports:
performance: summary.jsonWhen to Choose:
- Getting started quickly (proof of concept)
- Infrequent ad-hoc testing
- Low-to-medium load requirements (<10k RPS)
- Small team with limited K8s expertise
- Want to validate approach before operator investment
Gatling with K8s Job
Architecture: GitLab CI triggers K8s Job → Job executes Gatling simulation → Collect HTML reports
Strengths:
- Network Isolation: Same benefits as k6 - separate cluster from GitLab
- Rich Reports: Gatling’s HTML reports are comprehensive and visual
- Standard Pattern: K8s Jobs are well-understood
- JVM Performance: Excellent for very high load scenarios
- Full Feature Set: All Gatling features available (feeders, checks, DSL)
Weaknesses:
- Larger Images: ~500MB JVM-based images (vs k6’s 120MB)
- Slower Startup: JVM warmup time adds latency
- Resource Intensive: Requires more memory per pod (typically 2Gi vs k6’s 512Mi)
- Coordination Complexity: Distributed Gatling requires Gatling Enterprise or custom scripts
- Language Barrier: Scala/Java less accessible than JavaScript
Example Usage:
apiVersion: batch/v1
kind: Job
metadata:
name: gatling-test-${CI_PIPELINE_ID}
spec:
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: Never
containers:
- name: gatling
image: denvazh/gatling:latest
command:
- /opt/gatling/bin/gatling.sh
- -sf
- /simulations
- -s
- com.example.ApiSimulation
- -rf
- /results
volumeMounts:
- name: simulations
mountPath: /simulations
- name: results
mountPath: /results
resources:
requests:
cpu: 1
memory: 2Gi
limits:
cpu: 2
memory: 4Gi
volumes:
- name: simulations
configMap:
name: gatling-simulation-${CI_PIPELINE_ID}
- name: results
emptyDir: {}When to Choose:
- JVM/Scala-based team
- Need Gatling-specific features (recorder, advanced DSL)
- Want to offload from GitLab with JVM tooling
- Very high load requirements (>100k RPS)
- Willing to accept larger resource footprint
Gatling with Operator
Strengths:
- High Performance: JVM-based, excellent for very high loads
- Scala DSL: Powerful test scripting for complex scenarios
- Detailed Reports: Rich HTML reports with drill-down metrics
- Enterprise Features: Commercial support available
Weaknesses:
- Less Mature Operators: No official operator; community solutions vary in quality
- Setup Complexity: May require custom Helm charts or operator development
- Larger Footprint: JVM + dependencies = ~500MB images
- JVM Overhead: Longer startup times, higher memory usage
- Smaller Community: Less K8s-native ecosystem than k6
Example Custom CRD:
apiVersion: loadtest.io/v1
kind: GatlingTest
metadata:
name: api-test
spec:
simulation: com.example.ApiSimulation
replicas: 5
resources:
requests:
memory: 2Gi
cpu: 1When to Choose:
- Team has strong JVM/Scala skills
- Need Gatling’s advanced features (feeders, checks, protocols)
- Willing to maintain custom operator
- Very high scale requirements (>100k RPS)
Gatling with Docker (GitLab Runner)
Strengths:
- Standard Approach: Well-documented Docker execution
- Quick Start: Run without operator complexity
- Flexible: Easy to customize with scripts
- Powerful: Full Gatling feature set available
Weaknesses:
- Large Images: 500MB+ (vs k6’s 120MB)
- Resource Intensive: JVM requires more memory
- Slower Startup: JVM warmup time
- Scala/Java Required: Higher barrier to entry for non-JVM teams
- Manual Scaling: Hard to distribute load
Example Usage:
load-test:
image: denvazh/gatling:latest
script:
- gatling.sh -s com.example.ApiSimulation
artifacts:
paths:
- target/gatling/When to Choose:
- JVM-based organization
- Need Gatling-specific features
- Ad-hoc testing without operator investment
- Small-to-medium scale (<20k RPS)
Recommended Decision Path
┌─────────────────────────────────────────────────┐
│ Start: Need load testing framework │
└──────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────┐
│ Need to offload from │ ──Yes──▶ K8s Job (k6 or Gatling)
│ GitLab + avoid network │ ↓
│ bottlenecks? │ Best balance: speed + isolation
└──────┬───────────────────┘
│
No
│
▼
┌──────────────────────────┐
│ Quick PoC only? │ ──Yes──▶ GitLab Runner (k6 or Gatling)
│ (<1 day setup) │ ↓
└──────┬───────────────────┘ Fastest start, limited scale
│
No
│
▼
┌──────────────────────────┐
│ Need GitOps reconciliation│ ──Yes──▶ k6 Operator
│ + max automation? │ ↓
└──────┬───────────────────┘ Production-grade, most overhead
│
No
│
▼
┌──────────────────────────┐
│ JVM-based organization? │ ──Yes──▶ Gatling + K8s Job
└──────┬───────────────────┘
│
No
│
▼
k6 + K8s Job (recommended default)
Our Decision: k6 + K8s Job
Selected Approach: k6 with Kubernetes Jobs
Implementation:
- GitLab CI orchestrates K8s Jobs on separate test cluster
- Jobs execute k6 load tests against SUT in different cluster
- Results collected as GitLab artifacts and exported to InfluxDB
- Horizontal scaling via Job
completionsparameter with coordination
Rationale Based on Priority Concerns:
-
Quality & Ease of Reporting (Priority 1):
- 🎯 GAME CHANGER: You already have Grafana deployed for SUT monitoring
- ✅ k6 wins decisively with existing Grafana:
- Built-in InfluxDB export:
--out influxdb=...(1-line config) - Official Grafana dashboards: Import in 5 minutes
- Real-time monitoring during test execution
- Historical trending across test runs
- Unified observability: Monitor load tests AND SUT in same Grafana instance
- Built-in InfluxDB export:
- ⚠️ Gatling HTML reports still superior for ad-hoc sharing, BUT:
- Requires plugin for Grafana integration (not built-in)
- Community dashboards (less mature than k6’s official ones)
- Setup time: ~2-3 hours vs k6’s ~30 minutes
- ✅ Implementation Plan:
- Phase 1 (Day 1): k6 text/JSON (2-4 hours to first test)
- Phase 1.5 (Day 2): Connect to existing Grafana (30 minutes)
- Optional: Add k6-reporter for HTML sharing (2 hours)
- ✅ Result: Best of both worlds - Grafana for analysis, optionally HTML for sharing
-
Speed to MVP (Priority 2):
- ✅ 2-4 hours to first test (vs 1-2 days for operator)
- ✅ 1-3 days to MVP (vs 1-2 weeks for operator)
- ✅ Standard K8s pattern, no operator installation required
- ✅ Team already familiar with K8s Jobs
-
Maturity (Priority 3):
- ✅ K8s Jobs are rock-solid, production-proven pattern
- ✅ k6 is mature, well-supported by Grafana Labs
- ✅ No reliance on less mature operator code paths
-
Ease of Use (Priority 4):
- ✅ Standard K8s Job manifests (familiar to team)
- ✅ Simple kubectl commands for management
- ✅ JavaScript test scripts (accessible to team)
- ⚠️ Slight complexity for distributed coordination (acceptable trade-off)
-
Ease of Horizontal Scaling (Priority 5):
- ✅ Job
completions: Nfor parallel execution - ✅ Coordination via environment variables (JOB_COMPLETION_INDEX)
- ⚠️ Not as seamless as operator’s
parallelismbut sufficient for needs
- ✅ Job
Additional Benefits:
- Network Isolation: Separate cluster avoids GitLab→SUT network bottleneck
- Resource Offloading: GitLab servers don’t bear load generation workload
- Cost-Effective: No operator overhead, Jobs auto-cleanup with TTL
- Security: Full NetworkPolicy support for SUT access control
Why Not Operator?:
- Operator setup takes 1-2 weeks vs 1-3 days for Jobs
- We don’t need GitOps continuous reconciliation (tests are ephemeral)
- Jobs provide 80% of benefits with 20% of complexity
- Can migrate to operator later if needs evolve
Why Not Gatling Despite Better Out-of-Box Reports?:
- Report quality alone doesn’t offset other factors:
- Larger images (~500MB vs 120MB) → slower startup, more cluster resources
- Higher resource requirements (2Gi+ memory vs 512Mi) → higher costs
- Scala/Java less accessible than JavaScript for our team
- Slower setup time (3-5 hours vs 2-4 hours)
- k6’s reporting story is better long-term:
- Grafana/InfluxDB provides real-time monitoring (not just post-test reports)
- Historical trending across test runs
- Integration with existing observability platform
- Gatling’s HTML reports are static snapshots
- Mitigation for initial reporting gap:
- k6 text/JSON output sufficient for MVP validation
- Optional: k6-reporter for basic HTML reports (2 hours setup)
- Phase 3: Grafana deployment provides production-grade observability
Why Not GitLab Runner Only?:
- Network bottleneck between GitLab and SUT
- Runner resource constraints limit scale (~10k RPS)
- No K8s NetworkPolicy support for SUT isolation
Architecture Overview
┌─────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
│ GitLab CI/CD │ │ Test Cluster │ │ SUT Cluster │
│ (Orchestration) │ │ (Separate from GL) │ │ (Sandbox SUT) │
│ │ │ │ │ │
│ - Test Repo │ ───▶ │ K8s Jobs (k6) │ ───▶ │ Federated APIs │
│ - Generation │ │ Distributed Pods │ │ (Target System) │
│ - kubectl trigger │ │ TTL Cleanup │ │ │
│ - Artifact collect │ │ │ │ │
└─────────────────────┘ └──────────────────────┘ └──────────────────────┘
│ │ │
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
│ Test Library │ │ K8s Resources │ │ API Catalog │
│ - Scenarios │ │ - Namespace │ │ - Service Registry │
│ - Templates │ │ - Resource Quotas │ │ - SLO Definitions │
│ - Helpers │ │ - Network Policies │ │ - Auth Configs │
│ - Job manifests │ │ - ConfigMaps │ │ │
└─────────────────────┘ └──────────────────────┘ └──────────────────────┘
Key Benefits:
✓ Offloads load from GitLab servers
✓ Avoids GitLab→SUT network bottleneck
✓ Full K8s isolation and quotas
✓ Fast setup (2-4 hours to first test)
Key Architectural Decisions
1. Test Generation Location: GitLab CI
Decision: Generate tests in GitLab CI pipeline
Alternative: In-cluster Job in sandbox environment
Rationale:
- Better secrets management (GitLab CI variables)
- No cluster resource consumption during generation
- Easier debugging and iteration
- Native GitLab artifact management
- Clear separation of concerns
Trade-offs:
- Requires GitLab runner with kubectl/API access
- Less suitable for very large test suite generation (acceptable for our scale)
2. Test Execution: K8s Jobs on Separate Cluster
Decision: Use Kubernetes Jobs on dedicated test cluster (separate from GitLab and SUT) Alternatives: GitLab Runner execution, k6 Operator
Rationale:
- Network Isolation: Avoids GitLab→SUT network bottleneck by running tests in separate cluster
- Resource Offloading: GitLab servers don’t bear load generation workload
- Fast Setup: Standard K8s Jobs require 2-4 hours vs 1-2 days for operator
- Maturity: K8s Jobs are production-proven, no operator dependencies
- Simplicity: Familiar pattern for team, less operational overhead
- Scaling: Job
completionsparameter enables horizontal scaling - Security: Full NetworkPolicy support for SUT access control
Trade-offs:
- Manual coordination for distributed tests (vs operator’s built-in
parallelism) - No GitOps continuous reconciliation (tests are ephemeral anyway)
- Mitigation: Coordination via JOB_COMPLETION_INDEX environment variable is straightforward
3. Test Storage: Hybrid Model
Decision:
- Reusable Components: GitLab repository (version-controlled)
- Generated Tests: Dynamic generation from API catalog
- Results:
- Short-term: GitLab artifacts (30 days)
- Long-term: InfluxDB for trending
- Reports: S3/MinIO for historical analysis
Rationale:
- Version control for test logic and scenarios
- Dynamic generation reduces maintenance burden
- Multiple retention strategies optimize cost and utility
4. Self-Service Pattern: GitLab CI Variables
Decision: Use GitLab CI manual triggers with pipeline variables
Variables:
TEST_SUITE: "api-federation" # Which test suite
TARGET_ENVIRONMENT: "sandbox-sut-1" # Target SUT
VIRTUAL_USERS: "100" # Concurrent users
TEST_DURATION: "5m" # Test duration
RAMP_UP_TIME: "30s" # Ramp-up period
TEST_PROFILE: "load" # smoke|load|stress|spikeRationale:
- No custom UI required
- GitLab’s existing RBAC and audit logging
- Easy to trigger via UI, API, or CLI
- Pipeline history provides audit trail
Trade-offs:
- Less user-friendly than dedicated UI
- Mitigation: Good documentation + optional wrapper API for non-technical users
5. Network Isolation
Decision: Enforce network policies restricting k6 pods to SUT environment only
NetworkPolicy:
- Allow: DNS resolution
- Allow: Traffic to sandbox-sut namespace only
- Deny: All other egressRationale:
- Prevent accidental load testing of non-target systems
- Security isolation between sandbox environments
- Clear blast radius containment
Implementation Plan
Phase 1: Foundation (1-3 days)
Goal: Basic working load test with K8s Jobs
Deliverables:
- K8s namespace configured on test cluster with resource quotas
- Basic GitLab CI pipeline triggering K8s Jobs
- Simple parameterized k6 test example
- Documentation for running first test
Tasks:
- Create
load-testingnamespace on test cluster with resource quotas and NetworkPolicies - Configure GitLab runner with kubectl access to test cluster
- Create K8s Job manifest template for k6
- Create GitLab CI pipeline that triggers Jobs via kubectl
- Write example k6 test script
- Implement Job result collection (logs → GitLab artifacts)
- (Optional) Set up k6-reporter for basic HTML reports (2 hours)
- Document execution workflow and reporting options
Acceptance Criteria:
- Team member can trigger load test via GitLab UI
- K8s Job executes on test cluster (not GitLab runner)
- Test targets sandbox SUT successfully
- Results collected in GitLab artifacts as JSON/text summary
- (Optionally) Basic HTML report generated
- Job auto-cleans up via TTL (1 hour after completion)
- Documentation explains reporting trade-offs and future Grafana setup
Phase 2: Self-Service & Generation (1-2 weeks)
Goal: Flexible, catalog-driven test generation
Deliverables:
- GitLab CI variables for test customization
- Test generation scripts (template-based)
- Integration with API catalog for dynamic test creation
- Multiple test profiles (smoke, load, stress, spike)
Tasks:
- Implement test generation scripts
- Create test scenario library
- Integrate API catalog discovery
- Add test profile configurations
- Create test templates
Acceptance Criteria:
- Tests can be generated from API catalog
- Multiple test profiles selectable
- No code changes required for common scenarios
Phase 3: Enhanced Observability & Alerting (3-5 days)
Goal: Leverage existing Grafana for production-grade observability
Deliverables:
InfluxDB integration(already exists for SUT monitoring)- k6 → existing Grafana integration (30 minutes)
- Enhanced Grafana dashboards with custom views
- Alerting and notification system
- Baseline comparison and regression detection
Tasks:
-
Deploy InfluxDB(already exists) - Configure k6 → existing InfluxDB in Job manifest (
--out influxdb=...) - Import official k6 Grafana dashboard (5 minutes)
- Customize dashboard for your API federation use case
- Create unified dashboard showing load test + SUT metrics side-by-side
- Set up GitLab performance reports for merge request widgets
- Configure Grafana alerts for test failures or SLO breaches
- Implement notification webhooks (Slack/email via Grafana alerting)
- Create baseline metrics storage for regression detection
Acceptance Criteria:
- Real-time metrics visible in existing Grafana during test (not just post-test like Gatling)
- Historical trend data available in existing InfluxDB across multiple test runs
- Grafana dashboards show P50/P75/P95/P99 latencies, throughput, error rates
- Unified view: Load test metrics AND SUT metrics in same dashboard
- GitLab shows performance regression indicators in merge requests
- Grafana alerts team of test failures or performance degradations
- Reporting quality now exceeds Gatling (dynamic vs static, real-time vs post-test, unified observability)
Phase 4: Advanced Features (2-3 weeks)
Goal: Production-ready testing framework
Deliverables:
- Multi-scenario testing (mixed workloads)
- Baseline comparison and regression detection
- Scheduled regression test suite
- SLO-based pass/fail criteria
- Advanced reporting and analytics
Tasks:
- Implement multi-scenario orchestration
- Build baseline metrics storage
- Create regression detection logic
- Set up scheduled test pipelines
- Implement SLO validation
- Build comprehensive report generator
Acceptance Criteria:
- Can run mixed workload tests (multiple APIs concurrently)
- Automatic detection of performance regressions
- Scheduled tests run nightly against main branch
- Tests pass/fail based on SLO thresholds
Technical Specifications
Repository Structure
load-testing-framework/
├── .gitlab-ci.yml # Main CI/CD pipeline
├── README.md # User documentation
├── scripts/
│ ├── generate-test.sh # Test generation from templates
│ ├── generate-from-catalog.js # API catalog integration
│ ├── wait-for-completion.sh # Test monitoring
│ ├── generate-report.sh # Results processing
│ └── validate-config.sh # Configuration validation
├── templates/
│ ├── test-template.js.tpl # k6 test template
│ ├── k6-job.yaml # K8s Job manifest template
│ └── scenarios.yaml.tpl # Scenario configurations
├── tests/
│ ├── scenarios/ # Pre-built test scenarios
│ │ ├── smoke-test.js # Quick sanity check
│ │ ├── load-test.js # Sustained load
│ │ ├── stress-test.js # Breaking point
│ │ └── spike-test.js # Sudden traffic spike
│ ├── helpers/
│ │ ├── auth.js # Authentication helpers
│ │ ├── checks.js # Common assertions
│ │ └── utils.js # Utilities
│ └── api-catalog.json # API definitions (generated)
├── config/
│ ├── environments.yaml # Environment configurations
│ ├── test-profiles.yaml # Load profiles (VUs, duration, etc.)
│ └── slo-thresholds.yaml # Performance SLO definitions
├── k8s/
│ ├── namespace.yaml # load-testing namespace
│ ├── resource-quota.yaml # Resource limits
│ ├── network-policy.yaml # Network isolation
│ └── job-template.yaml # K8s Job template (with envsubst vars)
└── monitoring/
├── grafana-dashboards/ # Grafana dashboard JSON
└── alerting-rules.yaml # Prometheus alerting rules
K8s Job Configuration
Namespace Configuration:
apiVersion: v1
kind: Namespace
metadata:
name: load-testing
labels:
environment: test-cluster
purpose: performance-testing
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: load-testing-quota
namespace: load-testing
spec:
hard:
requests.cpu: "8"
requests.memory: 16Gi
limits.cpu: "16"
limits.memory: 32Gi
pods: "50" # Allow multiple concurrent test Jobs
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: k6-job-egress-restriction
namespace: load-testing
spec:
podSelector:
matchLabels:
job-type: load-test # Applied to all k6 Job pods
policyTypes:
- Egress
egress:
# Allow DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
# Allow InfluxDB for metrics export
- to:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 8086
# Allow ONLY to SUT cluster (via external IP or cross-cluster service)
- to:
- podSelector: {} # Empty selector = all pods in any namespace
ports:
- protocol: TCP
port: 443
- protocol: TCP
port: 80
# Note: Adjust based on actual SUT cluster connectivity pattern
# (LoadBalancer IP, cross-cluster mesh, etc.)K8s Job Template:
# templates/k6-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: load-test-{{ CI_PIPELINE_ID }}
namespace: load-testing
labels:
app: k6
job-type: load-test
pipeline-id: "{{ CI_PIPELINE_ID }}"
test-suite: "{{ TEST_SUITE }}"
spec:
ttlSecondsAfterFinished: 3600 # Cleanup after 1 hour
completions: {{ PARALLELISM | default(1) }} # Number of parallel Jobs
parallelism: {{ PARALLELISM | default(1) }}
backoffLimit: 0 # Don't retry failed tests
template:
metadata:
labels:
app: k6
job-type: load-test
pipeline-id: "{{ CI_PIPELINE_ID }}"
spec:
restartPolicy: Never
containers:
- name: k6
image: grafana/k6:0.48.0 # Pin version for reproducibility
command:
- sh
- -c
- |
# Calculate this instance's share of VUs
TOTAL_VUS={{ VIRTUAL_USERS }}
INSTANCE_INDEX=${JOB_COMPLETION_INDEX:-0}
TOTAL_INSTANCES={{ PARALLELISM | default(1) }}
VUS_PER_INSTANCE=$((TOTAL_VUS / TOTAL_INSTANCES))
# Run k6 with this instance's VUs
k6 run \
--vus=${VUS_PER_INSTANCE} \
--duration={{ TEST_DURATION }} \
--out json=/results/summary.json \
--out influxdb=http://influxdb.monitoring:8086/k6 \
--tag testrun={{ CI_PIPELINE_ID }} \
--tag instance=${INSTANCE_INDEX} \
/scripts/test.js
# Output summary for GitLab artifact collection
echo "=== Test Instance ${INSTANCE_INDEX} Complete ===" > /results/instance-${INSTANCE_INDEX}.log
k6 summary /results/summary.json >> /results/instance-${INSTANCE_INDEX}.log
env:
- name: TARGET_BASE_URL
value: "{{ TARGET_BASE_URL }}"
- name: TEST_PROFILE
value: "{{ TEST_PROFILE }}"
volumeMounts:
- name: test-script
mountPath: /scripts
readOnly: true
- name: results
mountPath: /results
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1
memory: 1Gi
volumes:
- name: test-script
configMap:
name: test-script-{{ CI_PIPELINE_ID }}
- name: results
emptyDir: {}GitLab CI Pipeline
Core Pipeline (.gitlab-ci.yml):
stages:
- validate
- generate
- execute
- report
variables:
K6_NAMESPACE: load-testing
K8S_CLUSTER: test-cluster # Separate cluster from GitLab
TARGET_SUT_BASE_URL: "https://api.sandbox-sut-1.example.com"
# Allow manual triggers with parameters
workflow:
rules:
- if: $CI_PIPELINE_SOURCE == "web"
- if: $CI_PIPELINE_SOURCE == "schedule"
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
# Pipeline variables for self-service
variables:
TEST_SUITE:
value: "api-federation"
description: "Test suite to run"
TARGET_ENVIRONMENT:
value: "sandbox-sut-1"
description: "Target SUT environment"
VIRTUAL_USERS:
value: "100"
description: "Total virtual users across all instances"
PARALLELISM:
value: "1"
description: "Number of parallel Job instances (for distributed load)"
TEST_DURATION:
value: "5m"
description: "Test duration (30s, 5m, 1h)"
RAMP_UP_TIME:
value: "30s"
description: "Ramp-up duration"
TEST_PROFILE:
value: "load"
description: "Test profile: smoke|load|stress|spike"
# Validate configuration
validate:
stage: validate
image: grafana/k6:latest
script:
- echo "Validating test configuration..."
- ./scripts/validate-config.sh
- k6 inspect tests/scenarios/${TEST_PROFILE}-test.js
rules:
- if: $CI_PIPELINE_SOURCE == "web" || $CI_PIPELINE_SOURCE == "schedule"
# Generate test manifests
generate:
stage: generate
image: alpine:latest
before_script:
- apk add --no-cache curl jq gettext
script:
- echo "Generating tests for ${TEST_SUITE}..."
- |
# Fetch API catalog
curl -s https://api-catalog.example.com/apis \
-H "Authorization: Bearer ${API_CATALOG_TOKEN}" \
> tests/api-catalog.json
- |
# Export environment variables for template substitution
export TEST_NAME="load-test-${CI_PIPELINE_ID}"
export TARGET_BASE_URL="https://api.${TARGET_ENVIRONMENT}.example.com"
export TIMESTAMP=$(date +%s)
- |
# Generate k6 test script
envsubst < templates/test-template.js.tpl > generated/test.js
- |
# Generate K8s TestRun manifest
envsubst < templates/k6-testrun.yaml.tpl > generated/k6-testrun.yaml
- echo "Generated test configuration:"
- cat generated/k6-testrun.yaml
artifacts:
paths:
- generated/
- tests/api-catalog.json
expire_in: 7 days
# Execute load test via K8s Job
execute:
stage: execute
image: bitnami/kubectl:latest
before_script:
# Configure kubectl to access test cluster (separate from GitLab)
- kubectl config use-context ${K8S_CLUSTER}
script:
- echo "Creating k6 test resources on test cluster..."
- echo "Job will generate load from ${K8S_CLUSTER} targeting ${TARGET_ENVIRONMENT}"
- |
# Create ConfigMap with test script and API catalog
kubectl create configmap test-script-${CI_PIPELINE_ID} \
--from-file=test.js=generated/test.js \
--from-file=api-catalog.json=tests/api-catalog.json \
-n ${K6_NAMESPACE} \
--dry-run=client -o yaml | kubectl apply -f -
- |
# Generate Job manifest from template with environment substitution
export TARGET_BASE_URL="https://api.${TARGET_ENVIRONMENT}.example.com"
envsubst < templates/k6-job.yaml | kubectl apply -f - -n ${K6_NAMESPACE}
- echo "K8s Job 'load-test-${CI_PIPELINE_ID}' created with ${PARALLELISM} parallel instances"
- echo "Each instance will run ${VIRTUAL_USERS}/${PARALLELISM} virtual users"
- |
# Wait for Job completion (all pods must succeed)
echo "Waiting for test completion (timeout: 30m)..."
kubectl wait --for=condition=complete \
--timeout=30m \
job/load-test-${CI_PIPELINE_ID} \
-n ${K6_NAMESPACE}
- echo "Collecting test results from all Job instances..."
- mkdir -p results
- |
# Collect logs from all Job pods
kubectl logs \
-l job-name=load-test-${CI_PIPELINE_ID} \
-n ${K6_NAMESPACE} \
--all-containers=true \
--prefix=true \
> results/k6-full-output.log
- |
# Extract summary from each pod
for pod in $(kubectl get pods -l job-name=load-test-${CI_PIPELINE_ID} -n ${K6_NAMESPACE} -o name); do
echo "=== Results from $pod ===" >> results/k6-summary.log
kubectl logs $pod -n ${K6_NAMESPACE} | grep -A 50 "execution:" >> results/k6-summary.log || true
done
- echo "Test execution complete. Results collected."
after_script:
# Note: Job will auto-cleanup via ttlSecondsAfterFinished (1 hour)
# ConfigMap cleanup manual for immediate cleanup
- kubectl delete configmap test-script-${CI_PIPELINE_ID} -n ${K6_NAMESPACE} || true
artifacts:
when: always
paths:
- results/
expire_in: 30 days
environment:
name: test-cluster
url: https://api.${TARGET_ENVIRONMENT}.example.com
timeout: 35m # Slightly longer than Job wait timeout
# Generate and publish reports
report:
stage: report
image: python:3.11-slim
before_script:
- pip install -q k6-report-generator
script:
- echo "Generating performance reports..."
- ./scripts/generate-report.sh results/k6-output.log
- |
# Parse summary for GitLab performance widget
python -c "
import json
import sys
# Parse k6 summary and convert to GitLab format
with open('results/summary.json', 'r') as f:
data = json.load(f)
gitlab_perf = {
'metrics': [
{'name': 'http_req_duration_p95', 'value': data['metrics']['http_req_duration']['p(95)']},
{'name': 'http_req_duration_p99', 'value': data['metrics']['http_req_duration']['p(99)']},
{'name': 'http_req_failed_rate', 'value': data['metrics']['http_req_failed']['rate']},
{'name': 'http_reqs_total', 'value': data['metrics']['http_reqs']['count']},
{'name': 'vus_max', 'value': data['metrics']['vus_max']['value']},
]
}
with open('performance.json', 'w') as f:
json.dump(gitlab_perf, f, indent=2)
"
- echo "Performance summary:"
- cat performance.json
artifacts:
when: always
reports:
performance: performance.json
paths:
- results/report.html
- results/summary.json
- performance.json
expire_in: 30 days
dependencies:
- executeTest Script Template
Template (templates/test-template.js.tpl):
import http from 'k6/http';
import { check, group, sleep } from 'k6';
import { Rate, Trend, Counter } from 'k6/metrics';
import { SharedArray } from 'k6/data';
// Custom metrics
const errorRate = new Rate('errors');
const apiDuration = new Trend('api_duration');
const apiCalls = new Counter('api_calls');
// Load API catalog
const apis = new SharedArray('apis', function() {
return JSON.parse(open('./api-catalog.json'));
});
// Configuration from environment
const BASE_URL = __ENV.TARGET_BASE_URL || 'https://api.sandbox-sut-1.example.com';
const VUS = parseInt(__ENV.VIRTUAL_USERS || '10');
const DURATION = __ENV.TEST_DURATION || '1m';
const RAMP_UP = __ENV.RAMP_UP_TIME || '30s';
const PROFILE = __ENV.TEST_PROFILE || 'load';
// Load profile configurations
const profiles = {
smoke: {
stages: [
{ duration: '1m', target: 5 },
],
thresholds: {
'http_req_duration': ['p(95)<1000'],
'http_req_failed': ['rate<0.05'],
},
},
load: {
stages: [
{ duration: RAMP_UP, target: VUS * 0.5 },
{ duration: DURATION, target: VUS },
{ duration: '30s', target: 0 },
],
thresholds: {
'http_req_duration': ['p(95)<500', 'p(99)<1000'],
'http_req_failed': ['rate<0.01'],
},
},
stress: {
stages: [
{ duration: '2m', target: VUS },
{ duration: '5m', target: VUS * 2 },
{ duration: '2m', target: VUS * 3 },
{ duration: '5m', target: VUS },
{ duration: '2m', target: 0 },
],
thresholds: {
'http_req_duration': ['p(95)<1000', 'p(99)<2000'],
'http_req_failed': ['rate<0.05'],
},
},
spike: {
stages: [
{ duration: '1m', target: VUS },
{ duration: '10s', target: VUS * 5 }, // Spike
{ duration: '1m', target: VUS },
{ duration: '10s', target: VUS * 5 }, // Second spike
{ duration: '1m', target: 0 },
],
thresholds: {
'http_req_duration': ['p(95)<1500', 'p(99)<3000'],
'http_req_failed': ['rate<0.10'],
},
},
};
// Apply selected profile
export const options = {
...profiles[PROFILE],
tags: {
test_suite: '${TEST_SUITE}',
environment: '${TARGET_ENVIRONMENT}',
pipeline_id: '${CI_PIPELINE_ID}',
},
noConnectionReuse: false,
userAgent: 'k6-load-test/${CI_PIPELINE_ID}',
};
// Setup function (runs once per VU)
export function setup() {
console.log(`Starting ${PROFILE} test with ${VUS} VUs for ${DURATION}`);
console.log(`Target: ${BASE_URL}`);
console.log(`APIs under test: ${apis.length}`);
return {
apis: apis,
baseUrl: BASE_URL,
};
}
// Main test function
export default function(data) {
const api = data.apis[Math.floor(Math.random() * data.apis.length)];
group(`API: ${api.name}`, () => {
const url = `${data.baseUrl}${api.path}`;
const params = {
headers: {
'Content-Type': 'application/json',
'X-Test-Pipeline': '${CI_PIPELINE_ID}',
...(api.headers || {}),
},
tags: {
api_name: api.name,
api_path: api.path,
},
timeout: api.timeout_ms || '30s',
};
const response = http.get(url, params);
// Record metrics
apiCalls.add(1);
apiDuration.add(response.timings.duration, { api: api.name });
// Validate response
const checkResults = check(response, {
'status is 200': (r) => r.status === 200,
'response time OK': (r) => r.timings.duration < (api.slo_ms || 500),
'has valid body': (r) => r.body && r.body.length > 0,
'no errors in response': (r) => !r.json('error'),
});
errorRate.add(!checkResults);
// Log failures
if (!checkResults) {
console.error(`API ${api.name} failed: status=${response.status}, duration=${response.timings.duration}ms`);
}
});
// Think time
sleep(Math.random() * 2 + 1);
}
// Teardown function
export function teardown(data) {
console.log('Test completed');
}Test Profile Configurations
File: config/test-profiles.yaml
profiles:
smoke:
description: "Quick sanity check with minimal load"
virtualUsers: 5
duration: 1m
rampUp: 10s
thresholds:
p95: 1000ms
p99: 2000ms
errorRate: 5%
load:
description: "Sustained load test at expected traffic levels"
virtualUsers: 100
duration: 5m
rampUp: 30s
thresholds:
p95: 500ms
p99: 1000ms
errorRate: 1%
stress:
description: "Push beyond normal load to find breaking point"
virtualUsers: 200
duration: 10m
rampUp: 2m
stages:
- duration: 2m
target: 100
- duration: 5m
target: 200
- duration: 2m
target: 300
- duration: 1m
target: 0
thresholds:
p95: 1000ms
p99: 2000ms
errorRate: 5%
spike:
description: "Sudden traffic spikes to test auto-scaling"
virtualUsers: 150
duration: 5m
stages:
- duration: 1m
target: 50
- duration: 10s
target: 500 # Spike
- duration: 1m
target: 50
- duration: 10s
target: 500 # Second spike
- duration: 1m
target: 0
thresholds:
p95: 1500ms
p99: 3000ms
errorRate: 10%
soak:
description: "Extended duration test for stability and memory leaks"
virtualUsers: 50
duration: 2h
rampUp: 5m
thresholds:
p95: 500ms
p99: 1000ms
errorRate: 1%Monitoring Integration
InfluxDB Export Configuration:
apiVersion: k6.io/v1alpha1
kind: TestRun
spec:
script:
configMap:
name: test-script
arguments: |
--out influxdb=http://influxdb.monitoring:8086/k6
--tag testrun=${CI_PIPELINE_ID}
--tag suite=${TEST_SUITE}
--tag environment=${TARGET_ENVIRONMENT}
--tag branch=${CI_COMMIT_BRANCH}
runner:
env:
- name: K6_INFLUXDB_INSECURE
value: "false"
- name: K6_INFLUXDB_USERNAME
valueFrom:
secretKeyRef:
name: influxdb-credentials
key: username
- name: K6_INFLUXDB_PASSWORD
valueFrom:
secretKeyRef:
name: influxdb-credentials
key: passwordGrafana Dashboard JSON (excerpt):
{
"dashboard": {
"title": "k6 Load Test Dashboard",
"panels": [
{
"title": "HTTP Request Duration (p95/p99)",
"type": "graph",
"targets": [
{
"query": "SELECT percentile(\"value\", 95) FROM \"http_req_duration\" WHERE \"testrun\"='$testrun' GROUP BY time(10s)"
},
{
"query": "SELECT percentile(\"value\", 99) FROM \"http_req_duration\" WHERE \"testrun\"='$testrun' GROUP BY time(10s)"
}
]
},
{
"title": "Requests Per Second",
"type": "graph",
"targets": [
{
"query": "SELECT derivative(mean(\"value\"), 1s) FROM \"http_reqs\" WHERE \"testrun\"='$testrun' GROUP BY time(10s)"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"query": "SELECT mean(\"value\") FROM \"http_req_failed\" WHERE \"testrun\"='$testrun' GROUP BY time(10s)"
}
]
},
{
"title": "Virtual Users",
"type": "graph",
"targets": [
{
"query": "SELECT max(\"value\") FROM \"vus\" WHERE \"testrun\"='$testrun' GROUP BY time(10s)"
}
]
}
],
"templating": {
"list": [
{
"name": "testrun",
"type": "query",
"query": "SHOW TAG VALUES WITH KEY = \"testrun\"",
"current": {
"text": "auto",
"value": "$__auto_interval_testrun"
}
}
]
}
}
}Alternative Approaches Considered
Note: See the “Operator vs Non-Operator Deployment Comparison” section above for a comprehensive decision matrix comparing all execution approaches.
Alternative 1: Simple Docker-Based Execution
Approach: Run k6 directly in GitLab runner containers without K8s operator
Pros:
- Simpler initial setup (no operator required)
- Faster to implement (1-2 hours vs 1-2 days)
- Lower operational overhead (no operator maintenance)
- Easy local testing and debugging
Cons:
- Limited scaling (~10k RPS per runner, CPU bound)
- Less resource isolation (shared runner resources)
- No distributed load generation (manual orchestration required)
- Harder to implement network policies (runner-level only)
Decision: Use this as Phase 0 quick-start, then migrate to operator for scale
Rationale: Provides immediate value for validation while full operator infrastructure is being established. See decision matrix above for detailed comparison.
Alternative 2: Locust (Python-based)
Approach: Use Locust for Python-native load testing
Pros:
- Python-friendly (good for teams with Python expertise)
- Web UI for monitoring
- Distributed mode available
Cons:
- Less Kubernetes-native
- Heavier resource footprint
- Less modern metrics/observability
- Smaller community compared to k6
Decision: Rejected in favor of k6’s better K8s integration
Alternative 3: Managed Service (k6 Cloud, Grafana Cloud)
Approach: Use commercial k6 Cloud service
Pros:
- Zero infrastructure management
- Excellent reporting and analytics
- Global load generation locations
Cons:
- Cost per test run
- External dependency
- Data egress concerns (API catalog, secrets)
- Less control over execution environment
Decision: Rejected for initial implementation; revisit for global load testing needs
Alternative 4: On-Demand REST API Wrapper
Approach: Build REST API service that wraps k6 execution
Pros:
- More user-friendly than GitLab UI
- Custom UI possibilities
- Better programmatic integration
Cons:
- Additional service to maintain
- Reinvents GitLab’s workflow orchestration
- Requires authentication/authorization implementation
Decision: Defer to Phase 5 if self-service adoption is insufficient
Success Metrics
Adoption Metrics
- Target: 80% of teams use load testing before production deployments
- Measure: GitLab pipeline executions, unique user count
Performance Metrics
- Test Execution Time: <10 minutes for standard load tests
- Test Setup Time: <5 minutes from trigger to execution start
- Resource Utilization: <50% of sandbox-test cluster capacity
Quality Metrics
- Test Reliability: >95% successful test runs (not counting legitimate failures)
- False Positive Rate: <5% of test failures are infrastructure-related
Efficiency Metrics
- Time to Create New Test: <30 minutes for catalog-based tests
- Test Maintenance Burden: <2 hours/week team-wide
Risk Assessment
Technical Risks
| Risk | Impact | Probability | Mitigation |
|---|---|---|---|
| K8s Job failures | Medium | Low | Standard pattern, use backoffLimit: 0, log all failures |
| Test cluster resource exhaustion | High | Medium | Strict resource quotas, Job TTL cleanup, monitoring |
| Network bottleneck (test cluster → SUT) | Medium | Low | Use separate cluster, monitor bandwidth, tune parallelism |
| Network policy misconfiguration | High | Low | Thorough testing, clear documentation, dry-run validation |
| Test generation failures | Medium | Medium | Validation stage, dry-run mode, schema validation |
| Metric collection failures | Medium | Low | Multiple collection methods (logs + InfluxDB), retry logic |
| Job coordination errors (distributed tests) | Medium | Low | Test coordination logic thoroughly, use JOB_COMPLETION_INDEX |
Operational Risks
| Risk | Impact | Probability | Mitigation |
|---|---|---|---|
| Accidental production testing | Critical | Low | Network policies, namespace restrictions, clear naming |
| Test maintenance burden | Medium | High | Catalog-driven generation, reusable components |
| Low adoption | Medium | Medium | Good documentation, training, easy onboarding |
| Cost overrun (compute resources) | Medium | Low | Resource quotas, time limits, monitoring |
Security Risks
| Risk | Impact | Probability | Mitigation |
|---|---|---|---|
| Credential exposure in tests | High | Medium | GitLab secrets, vault integration, no hardcoded secrets |
| Unauthorized access to SUT | High | Low | GitLab RBAC, K8s RBAC, audit logging |
| DDoS-like impact on SUT | Medium | Medium | Rate limiting, circuit breakers, clear communication |
Open Questions
-
InfluxDB: Do we have an existing InfluxDB instance, or do we need to deploy one?
- Action: Check with platform team
-
API Catalog Integration: What format is the API catalog in? REST API, config file, service mesh?
- Action: Review API catalog documentation
-
Authentication: How should tests authenticate to federated APIs? OAuth2, API keys, mTLS?
- Action: Align with security team on test account strategy
-
Scheduled Tests: Should we run nightly regression tests? Which APIs?
- Action: Define with product team
-
SLO Definitions: Do we have formal SLOs for federated APIs?
- Action: Work with API producers to define/document
-
Cross-Sandbox Communication: Are there existing network policies between sandbox environments?
- Action: Review with network team
-
Cost Allocation: Should we track and charge back load testing costs per team?
- Action: Discuss with finance/platform teams
References
Documentation
Internal Resources
.ai/steering/argocd-development-workflow.md- ArgoCD patterns.ai/steering/docker-image-workflow.md- Container build patterns.ai/steering/testing-standards.md- Testing guidelines- API Catalog documentation (TBD)
- Sandbox environment inventory (TBD)
Example Projects
Next Steps
-
Immediate (This Week):
- Review and approve this decision record
- Answer open questions
- Assign owner for implementation
-
Short Term (Next Sprint):
- Create implementation project (BMAD or Codev format)
- Set up development environment
- Begin Phase 1 implementation
-
Medium Term (Next Month):
- Complete Phase 1 foundation
- Conduct pilot with 2-3 teams
- Gather feedback and iterate
-
Long Term (Next Quarter):
- Complete all phases
- Full team rollout
- Integration with CI/CD pipeline standards
Approval
Proposed By: Platform Engineering Team
Date: 2026-02-04
Reviewers:
- Platform Architecture Lead
- API Federation Team Lead
- Security Team
- SRE Team
Status: Awaiting Review
Last Updated: 2026-02-04
Version: 1.0