Multi-Pod Metrics Aggregation Analysis
Date: 2026-02-24
Status: Investigation
Priority: HIGH (Blockers final decision between k6 and Gatling)
Executive Summary
Critical Finding: Both k6 and Gatling have multi-pod metrics aggregation challenges, BUT:
- Gatling: Apparently solves this by combining metrics into single reports
- k6: No built-in multi-pod aggregation; requires custom solution
- Trust Gap: Current understanding of both approaches is insufficient for confidence
Decision Point: Depends on whether Gatling’s multi-pod aggregation is mathematically correct
The Multi-Pod Aggregation Problem
What Happens With Multiple Load Generator Pods
When distributing load across 3 identical pods:
Architecture:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Pod 1 │ │ Pod 2 │ │ Pod 3 │
│ 100 VUs │ │ 100 VUs │ │ 100 VUs │
│ k6/Gatl │ │ k6/Gatl │ │ k6/Gatl │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
▼ ▼ ▼
[SUT: Single system under test - receives 300 VU worth of load]
The Metrics Problem
Each pod generates its own JSON/summary with aggregated metrics:
// Pod 1
{
"metrics": {
"http_req_duration": {
"avg": 100,
"min": 50,
"max": 500,
"p(95)": 150,
"p(99)": 250,
"count": 50000
},
"http_reqs": 50000
}
}
// Pod 2
{
"metrics": {
"http_req_duration": {
"avg": 105,
"min": 55,
"max": 510,
"p(95)": 155,
"p(99)": 255,
"count": 50000
},
"http_reqs": 50000
}
}
// Pod 3
{
"metrics": {
"http_req_duration": {
"avg": 98,
"min": 48,
"max": 495,
"p(95)": 145,
"p(99)": 245,
"count": 50000
},
"http_reqs": 50000
}
}The Naive Merge (WRONG ❌)
// INCORRECT MERGE
{
"http_req_duration": {
"avg": (100 + 105 + 98) / 3 = 101, // ❌ Averaging averages!
"p(95)": (150 + 155 + 145) / 3 = 150 // ❌ Averaging percentiles!
},
"http_reqs": 150000 // ✅ Correct
}Why This Is Wrong:
- Averaging averages loses weight information (Pod 1’s 50k requests weigh same as Pod 2’s 50k, but they shouldn’t influence average equally if durations differ)
- Averaging percentiles is meaningless: the true p95 must be calculated from all 150,000 raw request times, not from the three p95 values
- Example: If Pod 1 has worst p95=150ms and Pods 2&3 have p95=300ms, the merged p95 could be 200ms (average), but true p95 might actually be 280ms (from the data)
The Correct Merge (RIGHT ✅)
// CORRECT MERGE (requires raw data)
// Collect ALL 150,000 raw request times
ALL_REQUESTS = [
...Pod1_50000_requests, // [100, 102, 101, 99, ...]
...Pod2_50000_requests, // [105, 107, 104, 103, ...]
...Pod3_50000_requests // [98, 100, 99, 97, ...]
]
// Recalculate from raw data
{
"http_req_duration": {
"avg": sum(ALL_REQUESTS) / 150000, // ✅ Correct weighted average
"p(95)": quantile(ALL_REQUESTS, 0.95), // ✅ True 95th percentile
"p(99)": quantile(ALL_REQUESTS, 0.99) // ✅ True 99th percentile
},
"http_reqs": 150000 // ✅ Correct
}Current State: k6 Operator
How k6 Operator Handles Multi-Pod
Current Approach (you’re using this):
- k6 Operator runs test across N pods
- Each pod generates JSON summary to
/tmp/k6-summary.jsonor similar - Results are collected from each pod
- Merging happens externally (you manage this)
Current Problem:
- ❌ No built-in aggregation of raw request-level data
- ❌ Only summary-level metrics available per pod
- ❌ Naive merging of summaries is mathematically unsound
- ❌ You’re “not confident” in the approach → Rightfully so
What Would Be Needed to Fix k6:
- Modify k6 operator to stream raw request data to shared database (InfluxDB/Prometheus)
- Post-test aggregation logic collects all raw data and recalculates metrics
- OR: Use xk6 extension to push raw data in real-time
- OR: Modify k6 jobs to write raw request JSON, merge those, then aggregate
Effort: 3-5 hours to implement + testing
Current State: Gatling
How Gatling Handles Multi-Pod
Claimed Behavior (from your testing):
- Gatling combines metrics into single report
- Shows totals/aggregates, not pod-by-pod breakdown
- Professional HTML report
Critical Questions NOT YET ANSWERED:
-
Does Gatling collect raw request data?
- If YES → Can recalculate percentiles correctly ✅
- If NO → Gatling also has the averaging-averages problem ❌
-
How does Gatling aggregate across pods?
- Does each pod write to shared database? (Good approach)
- Does lead pod collect results from all pods? (Potential single point of failure)
- Post-test aggregation or real-time?
-
What are the limitations?
- Maximum pod count before aggregation breaks?
- Data loss scenarios?
- Correctness validation?
Next Steps to Validate: Need to investigate Gatling’s actual multi-pod architecture
Mathematical Correctness Framework
Key Principle: You Cannot Calculate Percentiles from Percentiles
Given: p95 values from 3 pods = [150ms, 155ms, 145ms]
Question: What is true p95 for merged data?
Possible answers:
A) Average: (150 + 155 + 145) / 3 = 150ms ❌ WRONG
B) Maximum: max(150, 155, 145) = 155ms ❌ Often wrong
C) Minimum: min(150, 155, 145) = 145ms ❌ Often wrong
D) Must recalculate from ALL 150,000 raw times ✅ CORRECT
Why? Because the true p95 depends on the distribution shape, not just the pod p95 values.
Rules for Correct Multi-Pod Aggregation
| Metric Type | Can Aggregate From Summaries? | Method |
|---|---|---|
| Count/Total | ✅ YES | Sum pod counts |
| Average/Mean | ⚠️ MAYBE | Weighted average (count-weighted) |
| Min/Max | ✅ YES | Min of mins, max of maxes |
| Percentiles (p50/p95/p99) | ❌ NO | Must use raw data |
| Standard Deviation | ❌ NO | Need raw data to recalculate |
| Error Rates | ✅ YES | Sum failures / sum total |
| Throughput (req/s) | ✅ YES | Sum all req/s |
Tools Capable of Correct Multi-Pod Aggregation
Option 1: InfluxDB + Real-Time Streaming (⭐ RECOMMENDED if possible)
Architecture:
Pod 1 ──┐
Pod 2 ──┼─→ InfluxDB (continuous time-series) ──→ Grafana (live dashboard)
Pod 3 ──┘ Post-test: Grafana query for aggregates
How It Works:
- Each pod streams INDIVIDUAL REQUEST metrics to InfluxDB (not summaries)
- InfluxDB stores with tags:
pod="1", request_type="GET /api" - Post-test: Grafana queries and aggregates across all pods
- Example Grafana query:
avg(http_request_duration{pod=~"1|2|3"})
Correctness: ✅ 100% correct (using raw data) Tools That Support This:
- k6 with
--out influxdb=... - Gatling with influxdb plugin + custom agent
Option 2: Prometheus Pushgateway + Real-Time Metrics
Architecture:
Pod 1 ──┐
Pod 2 ──┼─→ Prometheus PushGateway ──→ Prometheus ──→ Grafana
Pod 3 ──┘
How It Works:
- Each pod pushes prometheus-formatted metrics
- Prometheus scrapes and stores time-series
- Grafana aggregates across pods
Correctness: ✅ 100% correct (if using raw request data) Tools That Support This:
- k6 via xk6-output-prometheus
- Gatling via gatling-prometheus plugin
Option 3: Raw Data Collection + Post-Test Aggregation
Architecture:
Pod 1 ──┐
Pod 2 ──┼─→ SharedStorage (e.g., S3/PVC) ──→ Aggregation Script ──→ Report
Pod 3 ──┘
(each pod: all request JSON) (merge + recalculate)
How It Works:
- Each pod writes ALL raw request data to shared location (not just summary)
- Post-test script: Load all raw data, recalculate statistics
- Generate unified report
Correctness: ✅ 100% correct (using raw data) Effort to Implement:
- k6: High - need custom logging of raw requests (medium effort)
- Gatling: Medium - Gatling already supports CSV export of all requests
Option 4: Single Large Pod (Avoid Multi-Pod Entirely)
Architecture:
Pod 1
(VUs for entire load)
└─→ k6 or Gatling generates all requests
└─→ Single JSON summary (no merging needed)
Correctness: ✅ 100% correct (single source of truth) Trade-offs:
- ⚠️ Single pod limits resource efficiency
- ⚠️ Can’t scale horizontally
- ⚠️ But: Pod can have larger resource requests (e.g., 2Gi memory, 2 CPU)
When This Works: <50k RPS or <500 VU tests
Decision: k6 vs Gatling for Multi-Pod
Head-to-Head Comparison
| Factor | k6 | Gatling |
|---|---|---|
| Multi-Pod Support | ⭐⭐ Works but needs custom aggregation | ⭐⭐⭐ Has built-in aggregation (TBD: correctness) |
| Raw Data Availability | ⭐⭐⭐⭐ Easy with InfluxDB streaming | ⭐⭐⭐ CSV export available |
| Prometheus Integration | ⭐⭐⭐⭐⭐ Built-in or simple plugin | ⭐⭐⭐ Via Pushgateway plugin |
| InfluxDB Integration | ⭐⭐⭐⭐⭐ Built-in --out influxdb | ⭐⭐⭐ Plugin available |
| Grafana Dashboards | ⭐⭐⭐⭐⭐ Official, high-quality | ⭐⭐⭐ Community, variable quality |
| Setup Time (InfluxDB streaming) | 2-3 hours | 3-4 hours |
| Aggregation Confidence | ⚠️ Needs custom validation | ⚠️ Needs verification of Gatling’s approach |
| Mathematical Correctness | ✅ If using InfluxDB correctly | ⚠️ To be determined |
Current Situation Analysis
Your Statement: “Trust in the report is critical”
Current Problem with Both Tools:
- ❌ k6 Operator: No clear aggregation strategy (you feel uncertain)
- ❌ Gatling: Unclear HOW aggregation works (need to verify correctness)
The Blocker: You don’t have confidence in either tool’s multi-pod approach yet.
Recommended Next Steps
Immediate (Today)
Option A: Validate Gatling’s Correctness (1-2 hours)
- Run Gatling with 3 pods
- Inspect the combined report:
- Does it show individual pod metrics? (Bad sign)
- Or does it show merged totals? (Good sign)
- Check the math: avg of metrics should use weighted averages
- Verify: Run same test on 1 pod vs 3 pods
- Do throughput numbers match? (should be ~3x)
- Do response times look reasonable? (similar p95)
- Does error rate calculation make sense?
Option B: Commit to k6 + InfluxDB (2-3 hours implementation)
- Instead of merging JSON summaries, stream raw metrics to InfluxDB
- Each pod:
k6 run --out influxdb=http://influxdb:8086/k6 test.js - All raw request data stored in InfluxDB
- Grafana aggregates automatically
- Post-test: Export report from Grafana (consistent, trustworthy)
Short-term (Next 1-2 Days)
Based on Validation Results:
If Gatling’s aggregation is mathematically sound ✅:
- Decision: Switch to Gatling + K8s Jobs
- Rationale: Built-in, validated multi-pod aggregation + professional HTML reports
- Setup: 3-5 hours for initial Gatling template + first test
If Gatling’s aggregation has same issues ❌:
- Decision: Stick with k6 + InfluxDB streaming
- Rationale: At least you understand the aggregation, Grafana handles the math
- Setup: 2-3 hours to switch from JSON summary to InfluxDB streaming
- Benefit: Real-time monitoring + historically accurate trending
If Both Have Issues ⚠️:
- Decision: Use Option 4 (single large pod) until you can invest in proper solution
- Rationale: Fewer VUs but 100% mathematical correctness
- Timeline: 0 hours (use immediately), revisit when you have time for InfluxDB setup
Recommended Implementation Path
Path 1: Verify & Keep k6 (Safest Short-term)
Phase 1: Validate Current Approach (2-3 hours)
- Run current k6 Operator setup with 3 pods
- Verify throughput approximately 3x single pod
- Check percentile math (should be within reasonable bounds)
- Document what you THINK is happening
Phase 2: Move to InfluxDB (2-3 hours additional)
- Modify k6 Job manifest to use
--out influxdb=http://influxdb:8086/k6 - Set up Grafana datasource pointed at InfluxDB
- Import official k6 Grafana dashboard
- Run test and verify metrics appear in Grafana
- Verify aggregation is correct (same test 1 pod vs 3 pods)
Phase 3: Generate Reports
- Live Grafana dashboards for real-time monitoring
- Post-test: Export Grafana dashboard PNG/PDF for distribution
- Optional: Add k6-reporter for basic HTML (additional 1-2 hours)
Confidence: ⭐⭐⭐⭐⭐ High (you control and understand every step)
Path 2: Verify & Potentially Switch to Gatling (Higher Risk)
Phase 1: Deep Dive into Gatling’s Aggregation (2-3 hours)
- Create test scenario with 3 Gatling pods
- Analyze combined HTML report:
- How does it merge metrics?
- Are percentiles recalculated or averaged?
- Does math check out?
- Run validation test: 1 pod vs 3 pods, verify consistency
- Contact Gatling community/docs to confirm approach
Phase 2: Implement Gatling + K8s Jobs (3-5 hours)
- Similar to current decision record (already planned)
- Focus on multi-pod validation
Phase 3: Generate Reports
- Built-in HTML reports (excellent)
- Optional: Grafana integration via InfluxDB plugin (additional 2-3 hours)
Confidence: ⭐⭐⭐ Medium (depends on Gatling’s validation being correct)
Path 3: Single Pod + Wait (Lowest Risk)
Immediate Action (0 hours):
- Modify k6 Operator: Use single large pod instead of 3 pods
- Set resource requests: 2Gi memory, 2 CPU (most systems can handle)
- Keep using current JSON summary approach
- Generate reports as before
Why This Works:
- ✅ Zero aggregation problems (single source of truth)
- ✅ Simpler troubleshooting
- ✅ Can handle 50-100k RPS with good pod size
When to Upgrade: After you’ve validated either k6+InfluxDB or Gatling’s correctness
Confidence: ⭐⭐⭐⭐⭐ Very High (simplest, most reliable, but less scalable)
REVISED PATHS FOR YOUR SITUATION
Since you can’t directly integrate with the observability team’s InfluxDB, here are your realistic options:
Option A: Single Large Pod (SAFEST, 0 Setup Hours) ⭐ RECOMMENDED
What This Means:
- Use k6 Operator with
parallelism: 1(single pod, no distribution) - Larger pod size: 2Gi memory, 2 CPU
- Can generate 10k-50k RPS from single pod
- All metrics in single JSON summary (no aggregation needed)
Pros:
- ✅ Zero setup complexity
- ✅ 100% mathematically correct (single source of truth)
- ✅ Can start today
- ✅ Most reliable approach
- ✅ Simple to troubleshoot
Cons:
- ⚠️ Not horizontally scalable (limited to single pod resources)
- ⚠️ If you outgrow to 50k+ RPS, need redesign
When to Use:
- Perfect for: Your 10k-50k RPS target (single pod can handle 50k+ with good sizing)
- Do this if: You want to get started quickly and confidence matters more than scale
Setup Time: 0 hours (use immediately with existing k6 Operator)
Option B: Gatling + K8s Jobs (INVESTIGATE, 3-5 Hours)
What This Means:
- Switch from k6 to Gatling
- Use K8s Jobs (not Operator) as per original decision record
- Gatling handles multi-pod aggregation internally
- Gets you beautiful HTML reports
Before Deciding: Must validate HOW Gatling aggregates:
- Does it stream raw request data to shared location?
- Or does it have the same averaging-percentiles problem as k6?
- How does it handle pod failures mid-test?
Pros (if Gatling aggregation is sound):
- ✅ Built-in HTML reporting (professional, standalone)
- ✅ Apparently solves multi-pod aggregation
- ✅ Can use 3 pods for better resource distribution
Cons:
- ⚠️ Larger container images (500MB vs 120MB)
- ⚠️ Higher per-pod memory needs (2Gi+ vs 512Mi)
- ⚠️ Need to validate correctness first
- ⚠️ Switching from k6 requires learning Scala/Gatling DSL
When to Use:
- If Gatling’s multi-pod aggregation validates as mathematically sound
- If you want professional HTML reports for stakeholder sharing
- If you’re willing to verify correctness before committing
Setup Time: 1-2 hours validation + 3-5 hours implementation (if proceeding)
Option C: k6 with Local Prometheus/InfluxDB (DIY, 4-6 Hours)
What This Means:
- Deploy your own isolated InfluxDB in the test cluster (different from observability team’s)
- k6 streams to your local InfluxDB:
--out influxdb=http://localhost:8086/k6 - Grafana instance in test cluster (not shared with observability team)
- Works but means maintaining separate observability stack
Pros:
- ✅ Mathematically correct aggregation (InfluxDB stores raw data)
- ✅ Can use multi-pod without aggregation concerns
- ✅ Real-time monitoring during tests
- ✅ Historical trending of test runs
Cons:
- ⚠️ Requires managing separate InfluxDB + Grafana (operational overhead)
- ⚠️ Not integrated with main observability stack
- ⚠️ More complex than single-pod
- ⚠️ Setup takes 4-6 hours
When to Use:
- If you need multi-pod scaling AND don’t want to maintain separate stack
- If InfluxDB deployment is trivial for your team
- If you want to learn the “right way” to do distributed load testing
Setup Time: 4-6 hours (InfluxDB deployment + Grafana config + k6 integration)
RECOMMENDATION: Option A (Single Pod) + Option B Investigation (2-3 Days)
Immediate Action (Today)
Use Option A now:
- Modify k6 Operator to use single pod with larger resources
- Run your first performance test TODAY
- Get baseline understanding of your system
- Generate JSON summary reports (mathematically sound, no aggregation issues)
Parallel: Validate Gatling (Tomorrow, 1-2 hours)
While Option A is running:
- Quick Gatling multi-pod test (3 pods)
- Inspect combined HTML report
- Validate: Does the math check out?
- Decision: Is Gatling worth switching to?
Follow-up (Day 3)
If Gatling validates:
- Consider switching for better reporting
- But Option A is sufficient for needs
If Gatling doesn’t validate:
- Stick with Option A (single pod)
- Move to Option C later only if you need >50k RPS
Why This Approach:
- ✅ Get working solution today (Option A)
- ✅ Validate alternative in parallel (Option B investigation)
- ✅ No blockers, no waiting
- ✅ Low risk (can always stick with Option A)
- ✅ Build confidence in metrics immediately
Bottom Line
Your target is 10k-50k RPS. A single k6 pod with 2Gi memory and 2 CPU can easily generate this load. You probably don’t need multi-pod distribution at all.
Start with Option A, get working tests, build confidence in your metrics. Then invest in Option B or C only if you actually outgrow single-pod capacity.
Action Items
Immediate (Today)
- Validate: Run k6 Operator test with 3 pods, collect metrics
- Check: Throughput ≈ 3x single pod?
- Check: p95 response time in reasonable range?
- Check: Error counts and percentages make sense?
- Document: Write down what you observe in results
Short-term (Next 2-3 Days)
- Choose Path: Decide between Path 1 (k6+InfluxDB), Path 2 (Gatling validate), or Path 3 (single pod)
- Implement: Set up chosen solution
- Validate: Run test, verify confidence in metrics
Medium-term (Next 1-2 Weeks)
- Formalize: Update decision-record.md with multi-pod aggregation approach
- Document: Create runbook for future performance tests
- Automation: Integrate into CI/CD pipeline
YOUR SPECIFIC SITUATION
Constraints Discovered:
- ✅ Target load: 10k-50k RPS (single pod CAN handle this)
- ⚠️ InfluxDB available but owned by different team (can’t integrate directly)
- ✅ You need working solution that “fits” within your constraints
- ❓ Gatling worth investigating (built-in aggregation)
Impact: Cannot use Path 1 (k6 + InfluxDB) directly due to observability team ownership
Revised Recommendation: See “REVISED PATHS FOR YOUR SITUATION” below
References
- k6 InfluxDB Output Documentation
- k6 Operator GitHub
- Gatling Multi-Pod Patterns (TBD: need to verify their approach)
- Statistical Correctness in Load Testing - classic reference on the averaging-averages problem