Multi-Pod Metrics Aggregation Analysis

Date: 2026-02-24
Status: Investigation
Priority: HIGH (Blockers final decision between k6 and Gatling)

Executive Summary

Critical Finding: Both k6 and Gatling have multi-pod metrics aggregation challenges, BUT:

  • Gatling: Apparently solves this by combining metrics into single reports
  • k6: No built-in multi-pod aggregation; requires custom solution
  • Trust Gap: Current understanding of both approaches is insufficient for confidence

Decision Point: Depends on whether Gatling’s multi-pod aggregation is mathematically correct


The Multi-Pod Aggregation Problem

What Happens With Multiple Load Generator Pods

When distributing load across 3 identical pods:

Architecture:
┌─────────┐  ┌─────────┐  ┌─────────┐
│ Pod 1   │  │ Pod 2   │  │ Pod 3   │
│ 100 VUs │  │ 100 VUs │  │ 100 VUs │
│ k6/Gatl │  │ k6/Gatl │  │ k6/Gatl │
└────┬────┘  └────┬────┘  └────┬────┘
     │            │            │
     ▼            ▼            ▼
  [SUT: Single system under test - receives 300 VU worth of load]

The Metrics Problem

Each pod generates its own JSON/summary with aggregated metrics:

// Pod 1
{
  "metrics": {
    "http_req_duration": {
      "avg": 100,
      "min": 50,
      "max": 500,
      "p(95)": 150,
      "p(99)": 250,
      "count": 50000
    },
    "http_reqs": 50000
  }
}
 
// Pod 2
{
  "metrics": {
    "http_req_duration": {
      "avg": 105,
      "min": 55,
      "max": 510,
      "p(95)": 155,
      "p(99)": 255,
      "count": 50000
    },
    "http_reqs": 50000
  }
}
 
// Pod 3
{
  "metrics": {
    "http_req_duration": {
      "avg": 98,
      "min": 48,
      "max": 495,
      "p(95)": 145,
      "p(99)": 245,
      "count": 50000
    },
    "http_reqs": 50000
  }
}

The Naive Merge (WRONG ❌)

// INCORRECT MERGE
{
  "http_req_duration": {
    "avg": (100 + 105 + 98) / 3 = 101,    // ❌ Averaging averages!
    "p(95)": (150 + 155 + 145) / 3 = 150  // ❌ Averaging percentiles!
  },
  "http_reqs": 150000  // ✅ Correct
}

Why This Is Wrong:

  • Averaging averages loses weight information (Pod 1’s 50k requests weigh same as Pod 2’s 50k, but they shouldn’t influence average equally if durations differ)
  • Averaging percentiles is meaningless: the true p95 must be calculated from all 150,000 raw request times, not from the three p95 values
  • Example: If Pod 1 has worst p95=150ms and Pods 2&3 have p95=300ms, the merged p95 could be 200ms (average), but true p95 might actually be 280ms (from the data)

The Correct Merge (RIGHT ✅)

// CORRECT MERGE (requires raw data)
// Collect ALL 150,000 raw request times
ALL_REQUESTS = [
  ...Pod1_50000_requests,  // [100, 102, 101, 99, ...]
  ...Pod2_50000_requests,  // [105, 107, 104, 103, ...]
  ...Pod3_50000_requests   // [98, 100, 99, 97, ...]
]
 
// Recalculate from raw data
{
  "http_req_duration": {
    "avg": sum(ALL_REQUESTS) / 150000,  // ✅ Correct weighted average
    "p(95)": quantile(ALL_REQUESTS, 0.95),  // ✅ True 95th percentile
    "p(99)": quantile(ALL_REQUESTS, 0.99)   // ✅ True 99th percentile
  },
  "http_reqs": 150000  // ✅ Correct
}

Current State: k6 Operator

How k6 Operator Handles Multi-Pod

Current Approach (you’re using this):

  • k6 Operator runs test across N pods
  • Each pod generates JSON summary to /tmp/k6-summary.json or similar
  • Results are collected from each pod
  • Merging happens externally (you manage this)

Current Problem:

  • ❌ No built-in aggregation of raw request-level data
  • ❌ Only summary-level metrics available per pod
  • ❌ Naive merging of summaries is mathematically unsound
  • ❌ You’re “not confident” in the approach → Rightfully so

What Would Be Needed to Fix k6:

  1. Modify k6 operator to stream raw request data to shared database (InfluxDB/Prometheus)
  2. Post-test aggregation logic collects all raw data and recalculates metrics
  3. OR: Use xk6 extension to push raw data in real-time
  4. OR: Modify k6 jobs to write raw request JSON, merge those, then aggregate

Effort: 3-5 hours to implement + testing


Current State: Gatling

How Gatling Handles Multi-Pod

Claimed Behavior (from your testing):

  • Gatling combines metrics into single report
  • Shows totals/aggregates, not pod-by-pod breakdown
  • Professional HTML report

Critical Questions NOT YET ANSWERED:

  1. Does Gatling collect raw request data?

    • If YES → Can recalculate percentiles correctly ✅
    • If NO → Gatling also has the averaging-averages problem ❌
  2. How does Gatling aggregate across pods?

    • Does each pod write to shared database? (Good approach)
    • Does lead pod collect results from all pods? (Potential single point of failure)
    • Post-test aggregation or real-time?
  3. What are the limitations?

    • Maximum pod count before aggregation breaks?
    • Data loss scenarios?
    • Correctness validation?

Next Steps to Validate: Need to investigate Gatling’s actual multi-pod architecture


Mathematical Correctness Framework

Key Principle: You Cannot Calculate Percentiles from Percentiles

Given: p95 values from 3 pods = [150ms, 155ms, 145ms]
Question: What is true p95 for merged data?

Possible answers:
A) Average: (150 + 155 + 145) / 3 = 150ms      ❌ WRONG
B) Maximum: max(150, 155, 145) = 155ms         ❌ Often wrong
C) Minimum: min(150, 155, 145) = 145ms         ❌ Often wrong
D) Must recalculate from ALL 150,000 raw times ✅ CORRECT

Why? Because the true p95 depends on the distribution shape, not just the pod p95 values.

Rules for Correct Multi-Pod Aggregation

Metric TypeCan Aggregate From Summaries?Method
Count/Total✅ YESSum pod counts
Average/Mean⚠️ MAYBEWeighted average (count-weighted)
Min/Max✅ YESMin of mins, max of maxes
Percentiles (p50/p95/p99)❌ NOMust use raw data
Standard Deviation❌ NONeed raw data to recalculate
Error Rates✅ YESSum failures / sum total
Throughput (req/s)✅ YESSum all req/s

Tools Capable of Correct Multi-Pod Aggregation

Architecture:

Pod 1 ──┐
Pod 2 ──┼─→ InfluxDB (continuous time-series) ──→ Grafana (live dashboard)
Pod 3 ──┘                                              Post-test: Grafana query for aggregates

How It Works:

  • Each pod streams INDIVIDUAL REQUEST metrics to InfluxDB (not summaries)
  • InfluxDB stores with tags: pod="1", request_type="GET /api"
  • Post-test: Grafana queries and aggregates across all pods
  • Example Grafana query: avg(http_request_duration{pod=~"1|2|3"})

Correctness: ✅ 100% correct (using raw data) Tools That Support This:

  • k6 with --out influxdb=...
  • Gatling with influxdb plugin + custom agent

Option 2: Prometheus Pushgateway + Real-Time Metrics

Architecture:

Pod 1 ──┐
Pod 2 ──┼─→ Prometheus PushGateway ──→ Prometheus ──→ Grafana
Pod 3 ──┘

How It Works:

  • Each pod pushes prometheus-formatted metrics
  • Prometheus scrapes and stores time-series
  • Grafana aggregates across pods

Correctness: ✅ 100% correct (if using raw request data) Tools That Support This:

  • k6 via xk6-output-prometheus
  • Gatling via gatling-prometheus plugin

Option 3: Raw Data Collection + Post-Test Aggregation

Architecture:

Pod 1 ──┐
Pod 2 ──┼─→ SharedStorage (e.g., S3/PVC) ──→ Aggregation Script ──→ Report
Pod 3 ──┘
         (each pod: all request JSON)           (merge + recalculate)

How It Works:

  • Each pod writes ALL raw request data to shared location (not just summary)
  • Post-test script: Load all raw data, recalculate statistics
  • Generate unified report

Correctness: ✅ 100% correct (using raw data) Effort to Implement:

  • k6: High - need custom logging of raw requests (medium effort)
  • Gatling: Medium - Gatling already supports CSV export of all requests

Option 4: Single Large Pod (Avoid Multi-Pod Entirely)

Architecture:

Pod 1
(VUs for entire load)
└─→ k6 or Gatling generates all requests
└─→ Single JSON summary (no merging needed)

Correctness: ✅ 100% correct (single source of truth) Trade-offs:

  • ⚠️ Single pod limits resource efficiency
  • ⚠️ Can’t scale horizontally
  • ⚠️ But: Pod can have larger resource requests (e.g., 2Gi memory, 2 CPU)

When This Works: <50k RPS or <500 VU tests


Decision: k6 vs Gatling for Multi-Pod

Head-to-Head Comparison

Factork6Gatling
Multi-Pod Support⭐⭐ Works but needs custom aggregation⭐⭐⭐ Has built-in aggregation (TBD: correctness)
Raw Data Availability⭐⭐⭐⭐ Easy with InfluxDB streaming⭐⭐⭐ CSV export available
Prometheus Integration⭐⭐⭐⭐⭐ Built-in or simple plugin⭐⭐⭐ Via Pushgateway plugin
InfluxDB Integration⭐⭐⭐⭐⭐ Built-in --out influxdb⭐⭐⭐ Plugin available
Grafana Dashboards⭐⭐⭐⭐⭐ Official, high-quality⭐⭐⭐ Community, variable quality
Setup Time (InfluxDB streaming)2-3 hours3-4 hours
Aggregation Confidence⚠️ Needs custom validation⚠️ Needs verification of Gatling’s approach
Mathematical Correctness✅ If using InfluxDB correctly⚠️ To be determined

Current Situation Analysis

Your Statement: “Trust in the report is critical”

Current Problem with Both Tools:

  • ❌ k6 Operator: No clear aggregation strategy (you feel uncertain)
  • ❌ Gatling: Unclear HOW aggregation works (need to verify correctness)

The Blocker: You don’t have confidence in either tool’s multi-pod approach yet.


Immediate (Today)

Option A: Validate Gatling’s Correctness (1-2 hours)

  1. Run Gatling with 3 pods
  2. Inspect the combined report:
    • Does it show individual pod metrics? (Bad sign)
    • Or does it show merged totals? (Good sign)
    • Check the math: avg of metrics should use weighted averages
  3. Verify: Run same test on 1 pod vs 3 pods
    • Do throughput numbers match? (should be ~3x)
    • Do response times look reasonable? (similar p95)
    • Does error rate calculation make sense?

Option B: Commit to k6 + InfluxDB (2-3 hours implementation)

  1. Instead of merging JSON summaries, stream raw metrics to InfluxDB
  2. Each pod: k6 run --out influxdb=http://influxdb:8086/k6 test.js
  3. All raw request data stored in InfluxDB
  4. Grafana aggregates automatically
  5. Post-test: Export report from Grafana (consistent, trustworthy)

Short-term (Next 1-2 Days)

Based on Validation Results:

If Gatling’s aggregation is mathematically sound ✅:

  • Decision: Switch to Gatling + K8s Jobs
  • Rationale: Built-in, validated multi-pod aggregation + professional HTML reports
  • Setup: 3-5 hours for initial Gatling template + first test

If Gatling’s aggregation has same issues ❌:

  • Decision: Stick with k6 + InfluxDB streaming
  • Rationale: At least you understand the aggregation, Grafana handles the math
  • Setup: 2-3 hours to switch from JSON summary to InfluxDB streaming
  • Benefit: Real-time monitoring + historically accurate trending

If Both Have Issues ⚠️:

  • Decision: Use Option 4 (single large pod) until you can invest in proper solution
  • Rationale: Fewer VUs but 100% mathematical correctness
  • Timeline: 0 hours (use immediately), revisit when you have time for InfluxDB setup

Path 1: Verify & Keep k6 (Safest Short-term)

Phase 1: Validate Current Approach (2-3 hours)

  1. Run current k6 Operator setup with 3 pods
  2. Verify throughput approximately 3x single pod
  3. Check percentile math (should be within reasonable bounds)
  4. Document what you THINK is happening

Phase 2: Move to InfluxDB (2-3 hours additional)

  1. Modify k6 Job manifest to use --out influxdb=http://influxdb:8086/k6
  2. Set up Grafana datasource pointed at InfluxDB
  3. Import official k6 Grafana dashboard
  4. Run test and verify metrics appear in Grafana
  5. Verify aggregation is correct (same test 1 pod vs 3 pods)

Phase 3: Generate Reports

  • Live Grafana dashboards for real-time monitoring
  • Post-test: Export Grafana dashboard PNG/PDF for distribution
  • Optional: Add k6-reporter for basic HTML (additional 1-2 hours)

Confidence: ⭐⭐⭐⭐⭐ High (you control and understand every step)

Path 2: Verify & Potentially Switch to Gatling (Higher Risk)

Phase 1: Deep Dive into Gatling’s Aggregation (2-3 hours)

  1. Create test scenario with 3 Gatling pods
  2. Analyze combined HTML report:
    • How does it merge metrics?
    • Are percentiles recalculated or averaged?
    • Does math check out?
  3. Run validation test: 1 pod vs 3 pods, verify consistency
  4. Contact Gatling community/docs to confirm approach

Phase 2: Implement Gatling + K8s Jobs (3-5 hours)

  • Similar to current decision record (already planned)
  • Focus on multi-pod validation

Phase 3: Generate Reports

  • Built-in HTML reports (excellent)
  • Optional: Grafana integration via InfluxDB plugin (additional 2-3 hours)

Confidence: ⭐⭐⭐ Medium (depends on Gatling’s validation being correct)

Path 3: Single Pod + Wait (Lowest Risk)

Immediate Action (0 hours):

  1. Modify k6 Operator: Use single large pod instead of 3 pods
  2. Set resource requests: 2Gi memory, 2 CPU (most systems can handle)
  3. Keep using current JSON summary approach
  4. Generate reports as before

Why This Works:

  • ✅ Zero aggregation problems (single source of truth)
  • ✅ Simpler troubleshooting
  • ✅ Can handle 50-100k RPS with good pod size

When to Upgrade: After you’ve validated either k6+InfluxDB or Gatling’s correctness

Confidence: ⭐⭐⭐⭐⭐ Very High (simplest, most reliable, but less scalable)


REVISED PATHS FOR YOUR SITUATION

Since you can’t directly integrate with the observability team’s InfluxDB, here are your realistic options:

What This Means:

  • Use k6 Operator with parallelism: 1 (single pod, no distribution)
  • Larger pod size: 2Gi memory, 2 CPU
  • Can generate 10k-50k RPS from single pod
  • All metrics in single JSON summary (no aggregation needed)

Pros:

  • ✅ Zero setup complexity
  • ✅ 100% mathematically correct (single source of truth)
  • ✅ Can start today
  • ✅ Most reliable approach
  • ✅ Simple to troubleshoot

Cons:

  • ⚠️ Not horizontally scalable (limited to single pod resources)
  • ⚠️ If you outgrow to 50k+ RPS, need redesign

When to Use:

  • Perfect for: Your 10k-50k RPS target (single pod can handle 50k+ with good sizing)
  • Do this if: You want to get started quickly and confidence matters more than scale

Setup Time: 0 hours (use immediately with existing k6 Operator)


Option B: Gatling + K8s Jobs (INVESTIGATE, 3-5 Hours)

What This Means:

  • Switch from k6 to Gatling
  • Use K8s Jobs (not Operator) as per original decision record
  • Gatling handles multi-pod aggregation internally
  • Gets you beautiful HTML reports

Before Deciding: Must validate HOW Gatling aggregates:

  1. Does it stream raw request data to shared location?
  2. Or does it have the same averaging-percentiles problem as k6?
  3. How does it handle pod failures mid-test?

Pros (if Gatling aggregation is sound):

  • ✅ Built-in HTML reporting (professional, standalone)
  • ✅ Apparently solves multi-pod aggregation
  • ✅ Can use 3 pods for better resource distribution

Cons:

  • ⚠️ Larger container images (500MB vs 120MB)
  • ⚠️ Higher per-pod memory needs (2Gi+ vs 512Mi)
  • ⚠️ Need to validate correctness first
  • ⚠️ Switching from k6 requires learning Scala/Gatling DSL

When to Use:

  • If Gatling’s multi-pod aggregation validates as mathematically sound
  • If you want professional HTML reports for stakeholder sharing
  • If you’re willing to verify correctness before committing

Setup Time: 1-2 hours validation + 3-5 hours implementation (if proceeding)


Option C: k6 with Local Prometheus/InfluxDB (DIY, 4-6 Hours)

What This Means:

  • Deploy your own isolated InfluxDB in the test cluster (different from observability team’s)
  • k6 streams to your local InfluxDB: --out influxdb=http://localhost:8086/k6
  • Grafana instance in test cluster (not shared with observability team)
  • Works but means maintaining separate observability stack

Pros:

  • ✅ Mathematically correct aggregation (InfluxDB stores raw data)
  • ✅ Can use multi-pod without aggregation concerns
  • ✅ Real-time monitoring during tests
  • ✅ Historical trending of test runs

Cons:

  • ⚠️ Requires managing separate InfluxDB + Grafana (operational overhead)
  • ⚠️ Not integrated with main observability stack
  • ⚠️ More complex than single-pod
  • ⚠️ Setup takes 4-6 hours

When to Use:

  • If you need multi-pod scaling AND don’t want to maintain separate stack
  • If InfluxDB deployment is trivial for your team
  • If you want to learn the “right way” to do distributed load testing

Setup Time: 4-6 hours (InfluxDB deployment + Grafana config + k6 integration)


RECOMMENDATION: Option A (Single Pod) + Option B Investigation (2-3 Days)

Immediate Action (Today)

Use Option A now:

  1. Modify k6 Operator to use single pod with larger resources
  2. Run your first performance test TODAY
  3. Get baseline understanding of your system
  4. Generate JSON summary reports (mathematically sound, no aggregation issues)

Parallel: Validate Gatling (Tomorrow, 1-2 hours)

While Option A is running:

  1. Quick Gatling multi-pod test (3 pods)
  2. Inspect combined HTML report
  3. Validate: Does the math check out?
  4. Decision: Is Gatling worth switching to?

Follow-up (Day 3)

If Gatling validates:

  • Consider switching for better reporting
  • But Option A is sufficient for needs

If Gatling doesn’t validate:

  • Stick with Option A (single pod)
  • Move to Option C later only if you need >50k RPS

Why This Approach:

  • ✅ Get working solution today (Option A)
  • ✅ Validate alternative in parallel (Option B investigation)
  • ✅ No blockers, no waiting
  • ✅ Low risk (can always stick with Option A)
  • ✅ Build confidence in metrics immediately

Bottom Line

Your target is 10k-50k RPS. A single k6 pod with 2Gi memory and 2 CPU can easily generate this load. You probably don’t need multi-pod distribution at all.

Start with Option A, get working tests, build confidence in your metrics. Then invest in Option B or C only if you actually outgrow single-pod capacity.


Action Items

Immediate (Today)

  • Validate: Run k6 Operator test with 3 pods, collect metrics
    • Check: Throughput ≈ 3x single pod?
    • Check: p95 response time in reasonable range?
    • Check: Error counts and percentages make sense?
  • Document: Write down what you observe in results

Short-term (Next 2-3 Days)

  • Choose Path: Decide between Path 1 (k6+InfluxDB), Path 2 (Gatling validate), or Path 3 (single pod)
  • Implement: Set up chosen solution
  • Validate: Run test, verify confidence in metrics

Medium-term (Next 1-2 Weeks)

  • Formalize: Update decision-record.md with multi-pod aggregation approach
  • Document: Create runbook for future performance tests
  • Automation: Integrate into CI/CD pipeline

YOUR SPECIFIC SITUATION

Constraints Discovered:

  1. ✅ Target load: 10k-50k RPS (single pod CAN handle this)
  2. ⚠️ InfluxDB available but owned by different team (can’t integrate directly)
  3. ✅ You need working solution that “fits” within your constraints
  4. ❓ Gatling worth investigating (built-in aggregation)

Impact: Cannot use Path 1 (k6 + InfluxDB) directly due to observability team ownership

Revised Recommendation: See “REVISED PATHS FOR YOUR SITUATION” below


References