Multi-Pod Metrics Aggregation Analysis

Date: 2026-02-24
Status: Investigation
Priority: HIGH (Blockers final decision between k6 and Gatling)

Executive Summary

Critical Finding: Both k6 and Gatling have multi-pod metrics aggregation challenges, BUT:

Gatling: Apparently solves this by combining metrics into single reports
k6: No built-in multi-pod aggregation; requires custom solution
Trust Gap: Current understanding of both approaches is insufficient for confidence

Decision Point: Depends on whether Gatling’s multi-pod aggregation is mathematically correct

The Multi-Pod Aggregation Problem

What Happens With Multiple Load Generator Pods

When distributing load across 3 identical pods:

Architecture:
┌─────────┐  ┌─────────┐  ┌─────────┐
│ Pod 1   │  │ Pod 2   │  │ Pod 3   │
│ 100 VUs │  │ 100 VUs │  │ 100 VUs │
│ k6/Gatl │  │ k6/Gatl │  │ k6/Gatl │
└────┬────┘  └────┬────┘  └────┬────┘
     │            │            │
     ▼            ▼            ▼
  [SUT: Single system under test - receives 300 VU worth of load]

The Metrics Problem

Each pod generates its own JSON/summary with aggregated metrics:

// Pod 1
{
  "metrics": {
    "http_req_duration": {
      "avg": 100,
      "min": 50,
      "max": 500,
      "p(95)": 150,
      "p(99)": 250,
      "count": 50000
    },
    "http_reqs": 50000
  }
}
 
// Pod 2
{
  "metrics": {
    "http_req_duration": {
      "avg": 105,
      "min": 55,
      "max": 510,
      "p(95)": 155,
      "p(99)": 255,
      "count": 50000
    },
    "http_reqs": 50000
  }
}
 
// Pod 3
{
  "metrics": {
    "http_req_duration": {
      "avg": 98,
      "min": 48,
      "max": 495,
      "p(95)": 145,
      "p(99)": 245,
      "count": 50000
    },
    "http_reqs": 50000
  }
}

The Naive Merge (WRONG ❌)

// INCORRECT MERGE
{
  "http_req_duration": {
    "avg": (100 + 105 + 98) / 3 = 101,    // ❌ Averaging averages!
    "p(95)": (150 + 155 + 145) / 3 = 150  // ❌ Averaging percentiles!
  },
  "http_reqs": 150000  // ✅ Correct
}

Why This Is Wrong:

Averaging averages loses weight information (Pod 1’s 50k requests weigh same as Pod 2’s 50k, but they shouldn’t influence average equally if durations differ)
Averaging percentiles is meaningless: the true p95 must be calculated from all 150,000 raw request times, not from the three p95 values
Example: If Pod 1 has worst p95=150ms and Pods 2&3 have p95=300ms, the merged p95 could be 200ms (average), but true p95 might actually be 280ms (from the data)

The Correct Merge (RIGHT ✅)

// CORRECT MERGE (requires raw data)
// Collect ALL 150,000 raw request times
ALL_REQUESTS = [
  ...Pod1_50000_requests,  // [100, 102, 101, 99, ...]
  ...Pod2_50000_requests,  // [105, 107, 104, 103, ...]
  ...Pod3_50000_requests   // [98, 100, 99, 97, ...]
]
 
// Recalculate from raw data
{
  "http_req_duration": {
    "avg": sum(ALL_REQUESTS) / 150000,  // ✅ Correct weighted average
    "p(95)": quantile(ALL_REQUESTS, 0.95),  // ✅ True 95th percentile
    "p(99)": quantile(ALL_REQUESTS, 0.99)   // ✅ True 99th percentile
  },
  "http_reqs": 150000  // ✅ Correct
}

Current State: k6 Operator

How k6 Operator Handles Multi-Pod

Current Approach (you’re using this):

k6 Operator runs test across N pods
Each pod generates JSON summary to /tmp/k6-summary.json or similar
Results are collected from each pod
Merging happens externally (you manage this)

Current Problem:

❌ No built-in aggregation of raw request-level data
❌ Only summary-level metrics available per pod
❌ Naive merging of summaries is mathematically unsound
❌ You’re “not confident” in the approach → Rightfully so

What Would Be Needed to Fix k6:

Modify k6 operator to stream raw request data to shared database (InfluxDB/Prometheus)
Post-test aggregation logic collects all raw data and recalculates metrics
OR: Use xk6 extension to push raw data in real-time
OR: Modify k6 jobs to write raw request JSON, merge those, then aggregate

Effort: 3-5 hours to implement + testing

Current State: Gatling

How Gatling Handles Multi-Pod

Claimed Behavior (from your testing):

Gatling combines metrics into single report
Shows totals/aggregates, not pod-by-pod breakdown
Professional HTML report

Critical Questions NOT YET ANSWERED:

Does Gatling collect raw request data?
- If YES → Can recalculate percentiles correctly ✅
- If NO → Gatling also has the averaging-averages problem ❌
How does Gatling aggregate across pods?
- Does each pod write to shared database? (Good approach)
- Does lead pod collect results from all pods? (Potential single point of failure)
- Post-test aggregation or real-time?
What are the limitations?
- Maximum pod count before aggregation breaks?
- Data loss scenarios?
- Correctness validation?

Next Steps to Validate: Need to investigate Gatling’s actual multi-pod architecture

Mathematical Correctness Framework

Key Principle: You Cannot Calculate Percentiles from Percentiles

Given: p95 values from 3 pods = [150ms, 155ms, 145ms]
Question: What is true p95 for merged data?

Possible answers:
A) Average: (150 + 155 + 145) / 3 = 150ms      ❌ WRONG
B) Maximum: max(150, 155, 145) = 155ms         ❌ Often wrong
C) Minimum: min(150, 155, 145) = 145ms         ❌ Often wrong
D) Must recalculate from ALL 150,000 raw times ✅ CORRECT

Why? Because the true p95 depends on the distribution shape, not just the pod p95 values.

Rules for Correct Multi-Pod Aggregation

Metric Type	Can Aggregate From Summaries?	Method
Count/Total	✅ YES	Sum pod counts
Average/Mean	⚠️ MAYBE	Weighted average (count-weighted)
Min/Max	✅ YES	Min of mins, max of maxes
Percentiles (p50/p95/p99)	❌ NO	Must use raw data
Standard Deviation	❌ NO	Need raw data to recalculate
Error Rates	✅ YES	Sum failures / sum total
Throughput (req/s)	✅ YES	Sum all req/s

Tools Capable of Correct Multi-Pod Aggregation

Option 1: InfluxDB + Real-Time Streaming (⭐ RECOMMENDED if possible)

Architecture:

Pod 1 ──┐
Pod 2 ──┼─→ InfluxDB (continuous time-series) ──→ Grafana (live dashboard)
Pod 3 ──┘                                              Post-test: Grafana query for aggregates

How It Works:

Each pod streams INDIVIDUAL REQUEST metrics to InfluxDB (not summaries)
InfluxDB stores with tags: pod="1", request_type="GET /api"
Post-test: Grafana queries and aggregates across all pods
Example Grafana query: avg(http_request_duration{pod=~"1|2|3"})

Correctness: ✅ 100% correct (using raw data) Tools That Support This:

k6 with --out influxdb=...
Gatling with influxdb plugin + custom agent

Option 2: Prometheus Pushgateway + Real-Time Metrics

Architecture:

Pod 1 ──┐
Pod 2 ──┼─→ Prometheus PushGateway ──→ Prometheus ──→ Grafana
Pod 3 ──┘

How It Works:

Each pod pushes prometheus-formatted metrics
Prometheus scrapes and stores time-series
Grafana aggregates across pods

Correctness: ✅ 100% correct (if using raw request data) Tools That Support This:

k6 via xk6-output-prometheus
Gatling via gatling-prometheus plugin

Option 3: Raw Data Collection + Post-Test Aggregation

Architecture:

Pod 1 ──┐
Pod 2 ──┼─→ SharedStorage (e.g., S3/PVC) ──→ Aggregation Script ──→ Report
Pod 3 ──┘
         (each pod: all request JSON)           (merge + recalculate)

How It Works:

Each pod writes ALL raw request data to shared location (not just summary)
Post-test script: Load all raw data, recalculate statistics
Generate unified report

Correctness: ✅ 100% correct (using raw data) Effort to Implement:

k6: High - need custom logging of raw requests (medium effort)
Gatling: Medium - Gatling already supports CSV export of all requests

Option 4: Single Large Pod (Avoid Multi-Pod Entirely)

Architecture:

Pod 1
(VUs for entire load)
└─→ k6 or Gatling generates all requests
└─→ Single JSON summary (no merging needed)

Correctness: ✅ 100% correct (single source of truth) Trade-offs:

⚠️ Single pod limits resource efficiency
⚠️ Can’t scale horizontally
⚠️ But: Pod can have larger resource requests (e.g., 2Gi memory, 2 CPU)

When This Works: <50k RPS or <500 VU tests

Decision: k6 vs Gatling for Multi-Pod

Head-to-Head Comparison

Factor	k6	Gatling
Multi-Pod Support	⭐⭐ Works but needs custom aggregation	⭐⭐⭐ Has built-in aggregation (TBD: correctness)
Raw Data Availability	⭐⭐⭐⭐ Easy with InfluxDB streaming	⭐⭐⭐ CSV export available
Prometheus Integration	⭐⭐⭐⭐⭐ Built-in or simple plugin	⭐⭐⭐ Via Pushgateway plugin
InfluxDB Integration	⭐⭐⭐⭐⭐ Built-in `--out influxdb`	⭐⭐⭐ Plugin available
Grafana Dashboards	⭐⭐⭐⭐⭐ Official, high-quality	⭐⭐⭐ Community, variable quality
Setup Time (InfluxDB streaming)	2-3 hours	3-4 hours
Aggregation Confidence	⚠️ Needs custom validation	⚠️ Needs verification of Gatling’s approach
Mathematical Correctness	✅ If using InfluxDB correctly	⚠️ To be determined

Current Situation Analysis

Your Statement: “Trust in the report is critical”

Current Problem with Both Tools:

❌ k6 Operator: No clear aggregation strategy (you feel uncertain)
❌ Gatling: Unclear HOW aggregation works (need to verify correctness)

The Blocker: You don’t have confidence in either tool’s multi-pod approach yet.

Recommended Next Steps

Immediate (Today)

Option A: Validate Gatling’s Correctness (1-2 hours)

Run Gatling with 3 pods
Inspect the combined report:
- Does it show individual pod metrics? (Bad sign)
- Or does it show merged totals? (Good sign)
- Check the math: avg of metrics should use weighted averages
Verify: Run same test on 1 pod vs 3 pods
- Do throughput numbers match? (should be ~3x)
- Do response times look reasonable? (similar p95)
- Does error rate calculation make sense?

Option B: Commit to k6 + InfluxDB (2-3 hours implementation)

Instead of merging JSON summaries, stream raw metrics to InfluxDB
Each pod: k6 run --out influxdb=http://influxdb:8086/k6 test.js
All raw request data stored in InfluxDB
Grafana aggregates automatically
Post-test: Export report from Grafana (consistent, trustworthy)

Short-term (Next 1-2 Days)

Based on Validation Results:

If Gatling’s aggregation is mathematically sound ✅:

Decision: Switch to Gatling + K8s Jobs
Rationale: Built-in, validated multi-pod aggregation + professional HTML reports
Setup: 3-5 hours for initial Gatling template + first test

If Gatling’s aggregation has same issues ❌:

Decision: Stick with k6 + InfluxDB streaming
Rationale: At least you understand the aggregation, Grafana handles the math
Setup: 2-3 hours to switch from JSON summary to InfluxDB streaming
Benefit: Real-time monitoring + historically accurate trending

If Both Have Issues ⚠️:

Decision: Use Option 4 (single large pod) until you can invest in proper solution
Rationale: Fewer VUs but 100% mathematical correctness
Timeline: 0 hours (use immediately), revisit when you have time for InfluxDB setup

Recommended Implementation Path

Path 1: Verify & Keep k6 (Safest Short-term)

Phase 1: Validate Current Approach (2-3 hours)

Run current k6 Operator setup with 3 pods
Verify throughput approximately 3x single pod
Check percentile math (should be within reasonable bounds)
Document what you THINK is happening

Phase 2: Move to InfluxDB (2-3 hours additional)

Modify k6 Job manifest to use --out influxdb=http://influxdb:8086/k6
Set up Grafana datasource pointed at InfluxDB
Import official k6 Grafana dashboard
Run test and verify metrics appear in Grafana
Verify aggregation is correct (same test 1 pod vs 3 pods)

Phase 3: Generate Reports

Live Grafana dashboards for real-time monitoring
Post-test: Export Grafana dashboard PNG/PDF for distribution
Optional: Add k6-reporter for basic HTML (additional 1-2 hours)

Confidence: ⭐⭐⭐⭐⭐ High (you control and understand every step)

Path 2: Verify & Potentially Switch to Gatling (Higher Risk)

Phase 1: Deep Dive into Gatling’s Aggregation (2-3 hours)

Create test scenario with 3 Gatling pods
Analyze combined HTML report:
- How does it merge metrics?
- Are percentiles recalculated or averaged?
- Does math check out?
Run validation test: 1 pod vs 3 pods, verify consistency
Contact Gatling community/docs to confirm approach

Phase 2: Implement Gatling + K8s Jobs (3-5 hours)

Similar to current decision record (already planned)
Focus on multi-pod validation

Phase 3: Generate Reports

Built-in HTML reports (excellent)
Optional: Grafana integration via InfluxDB plugin (additional 2-3 hours)

Confidence: ⭐⭐⭐ Medium (depends on Gatling’s validation being correct)

Path 3: Single Pod + Wait (Lowest Risk)

Immediate Action (0 hours):

Modify k6 Operator: Use single large pod instead of 3 pods
Set resource requests: 2Gi memory, 2 CPU (most systems can handle)
Keep using current JSON summary approach
Generate reports as before

Why This Works:

✅ Zero aggregation problems (single source of truth)
✅ Simpler troubleshooting
✅ Can handle 50-100k RPS with good pod size

When to Upgrade: After you’ve validated either k6+InfluxDB or Gatling’s correctness

Confidence: ⭐⭐⭐⭐⭐ Very High (simplest, most reliable, but less scalable)

REVISED PATHS FOR YOUR SITUATION

Since you can’t directly integrate with the observability team’s InfluxDB, here are your realistic options:

Option A: Single Large Pod (SAFEST, 0 Setup Hours) ⭐ RECOMMENDED

What This Means:

Use k6 Operator with parallelism: 1 (single pod, no distribution)
Larger pod size: 2Gi memory, 2 CPU
Can generate 10k-50k RPS from single pod
All metrics in single JSON summary (no aggregation needed)

Pros:

✅ Zero setup complexity
✅ 100% mathematically correct (single source of truth)
✅ Can start today
✅ Most reliable approach
✅ Simple to troubleshoot

Cons:

⚠️ Not horizontally scalable (limited to single pod resources)
⚠️ If you outgrow to 50k+ RPS, need redesign

When to Use:

Perfect for: Your 10k-50k RPS target (single pod can handle 50k+ with good sizing)
Do this if: You want to get started quickly and confidence matters more than scale

Setup Time: 0 hours (use immediately with existing k6 Operator)

Option B: Gatling + K8s Jobs (INVESTIGATE, 3-5 Hours)

What This Means:

Switch from k6 to Gatling
Use K8s Jobs (not Operator) as per original decision record
Gatling handles multi-pod aggregation internally
Gets you beautiful HTML reports

Before Deciding: Must validate HOW Gatling aggregates:

Does it stream raw request data to shared location?
Or does it have the same averaging-percentiles problem as k6?
How does it handle pod failures mid-test?

Pros (if Gatling aggregation is sound):

✅ Built-in HTML reporting (professional, standalone)
✅ Apparently solves multi-pod aggregation
✅ Can use 3 pods for better resource distribution

Cons:

⚠️ Larger container images (500MB vs 120MB)
⚠️ Higher per-pod memory needs (2Gi+ vs 512Mi)
⚠️ Need to validate correctness first
⚠️ Switching from k6 requires learning Scala/Gatling DSL

When to Use:

If Gatling’s multi-pod aggregation validates as mathematically sound
If you want professional HTML reports for stakeholder sharing
If you’re willing to verify correctness before committing

Setup Time: 1-2 hours validation + 3-5 hours implementation (if proceeding)

Option C: k6 with Local Prometheus/InfluxDB (DIY, 4-6 Hours)

What This Means:

Deploy your own isolated InfluxDB in the test cluster (different from observability team’s)
k6 streams to your local InfluxDB: --out influxdb=http://localhost:8086/k6
Grafana instance in test cluster (not shared with observability team)
Works but means maintaining separate observability stack

Pros:

✅ Mathematically correct aggregation (InfluxDB stores raw data)
✅ Can use multi-pod without aggregation concerns
✅ Real-time monitoring during tests
✅ Historical trending of test runs

Cons:

⚠️ Requires managing separate InfluxDB + Grafana (operational overhead)
⚠️ Not integrated with main observability stack
⚠️ More complex than single-pod
⚠️ Setup takes 4-6 hours

When to Use:

If you need multi-pod scaling AND don’t want to maintain separate stack
If InfluxDB deployment is trivial for your team
If you want to learn the “right way” to do distributed load testing

Setup Time: 4-6 hours (InfluxDB deployment + Grafana config + k6 integration)

RECOMMENDATION: Option A (Single Pod) + Option B Investigation (2-3 Days)

Immediate Action (Today)

Use Option A now:

Modify k6 Operator to use single pod with larger resources
Run your first performance test TODAY
Get baseline understanding of your system
Generate JSON summary reports (mathematically sound, no aggregation issues)

Parallel: Validate Gatling (Tomorrow, 1-2 hours)

While Option A is running:

Quick Gatling multi-pod test (3 pods)
Inspect combined HTML report
Validate: Does the math check out?
Decision: Is Gatling worth switching to?

Follow-up (Day 3)

If Gatling validates:

Consider switching for better reporting
But Option A is sufficient for needs

If Gatling doesn’t validate:

Stick with Option A (single pod)
Move to Option C later only if you need >50k RPS

Why This Approach:

✅ Get working solution today (Option A)
✅ Validate alternative in parallel (Option B investigation)
✅ No blockers, no waiting
✅ Low risk (can always stick with Option A)
✅ Build confidence in metrics immediately

Bottom Line

Your target is 10k-50k RPS. A single k6 pod with 2Gi memory and 2 CPU can easily generate this load. You probably don’t need multi-pod distribution at all.

Start with Option A, get working tests, build confidence in your metrics. Then invest in Option B or C only if you actually outgrow single-pod capacity.

Action Items

Immediate (Today)

Validate: Run k6 Operator test with 3 pods, collect metrics
- Check: Throughput ≈ 3x single pod?
- Check: p95 response time in reasonable range?
- Check: Error counts and percentages make sense?
Document: Write down what you observe in results

Short-term (Next 2-3 Days)

Choose Path: Decide between Path 1 (k6+InfluxDB), Path 2 (Gatling validate), or Path 3 (single pod)
Implement: Set up chosen solution
Validate: Run test, verify confidence in metrics

Medium-term (Next 1-2 Weeks)

Formalize: Update decision-record.md with multi-pod aggregation approach
Document: Create runbook for future performance tests
Automation: Integrate into CI/CD pipeline

YOUR SPECIFIC SITUATION

Constraints Discovered:

✅ Target load: 10k-50k RPS (single pod CAN handle this)
⚠️ InfluxDB available but owned by different team (can’t integrate directly)
✅ You need working solution that “fits” within your constraints
❓ Gatling worth investigating (built-in aggregation)

Impact: Cannot use Path 1 (k6 + InfluxDB) directly due to observability team ownership

Revised Recommendation: See “REVISED PATHS FOR YOUR SITUATION” below

References

k6 InfluxDB Output Documentation
k6 Operator GitHub
Gatling Multi-Pod Patterns (TBD: need to verify their approach)
Statistical Correctness in Load Testing - classic reference on the averaging-averages problem

Techcle Wiki

Explorer

MULTI POD AGGREGATION ANALYSIS

Multi-Pod Metrics Aggregation Analysis

Executive Summary

The Multi-Pod Aggregation Problem

What Happens With Multiple Load Generator Pods

The Metrics Problem

The Naive Merge (WRONG ❌)

The Correct Merge (RIGHT ✅)

Current State: k6 Operator

How k6 Operator Handles Multi-Pod

Current State: Gatling

How Gatling Handles Multi-Pod

Mathematical Correctness Framework

Key Principle: You Cannot Calculate Percentiles from Percentiles

Rules for Correct Multi-Pod Aggregation

Tools Capable of Correct Multi-Pod Aggregation

Option 1: InfluxDB + Real-Time Streaming (⭐ RECOMMENDED if possible)

Option 2: Prometheus Pushgateway + Real-Time Metrics

Option 3: Raw Data Collection + Post-Test Aggregation

Option 4: Single Large Pod (Avoid Multi-Pod Entirely)

Decision: k6 vs Gatling for Multi-Pod

Head-to-Head Comparison

Current Situation Analysis

Recommended Next Steps

Immediate (Today)

Short-term (Next 1-2 Days)

Recommended Implementation Path

Path 1: Verify & Keep k6 (Safest Short-term)

Path 2: Verify & Potentially Switch to Gatling (Higher Risk)

Path 3: Single Pod + Wait (Lowest Risk)

REVISED PATHS FOR YOUR SITUATION

Option A: Single Large Pod (SAFEST, 0 Setup Hours) ⭐ RECOMMENDED

Option B: Gatling + K8s Jobs (INVESTIGATE, 3-5 Hours)

Option C: k6 with Local Prometheus/InfluxDB (DIY, 4-6 Hours)

RECOMMENDATION: Option A (Single Pod) + Option B Investigation (2-3 Days)

Immediate Action (Today)

Parallel: Validate Gatling (Tomorrow, 1-2 hours)

Follow-up (Day 3)

Bottom Line

Action Items

Immediate (Today)

Short-term (Next 2-3 Days)

Medium-term (Next 1-2 Weeks)

YOUR SPECIFIC SITUATION

References

Graph View

Table of Contents