HIP Platform - Non-Functional Requirements (NFR) Strawman

Version: 0.1 (Strawman - Under Review)
Created: 2026-02-10
Status: Draft - Seeking Stakeholder Feedback
Target Approval: Q1 2026


1. Executive Summary

This document presents a strawman of Non-Functional Requirements (NFRs) for the HIP enterprise API integration platform. It is intentionally incomplete and designed to spark discussion with stakeholders.

Key Points for Discussion:

  • NFRs are performance, reliability, security, and operational targets
  • This strawman captures initial thinking and known constraints
  • Open questions and assumptions are explicitly called out
  • Governance framework for measuring and tracking NFRs
  • Goal: Iterate with stakeholders to reach consensus and publish final NFR targets

Scope: Covers Kong ingress, microservices, managed egress, and platform APIs. Excludes backend systems and AWS infrastructure.


2. How to Use This Document

2.1 How to Read This Strawman

The NFR Strawman is organized by topic, not by reading order. You don’t need to read it front-to-back.

Quickest Review (30 min):

  1. Read Section 1: Executive Summary
  2. Read Section 7: Assumptions (what are we assuming?)
  3. Read Section 8: Open Questions (what needs decisions?)
  4. Skip to your area of concern

Thorough Review (2 hours):

  1. Executive Summary (5 min)
  2. Sections 3-6: Your specific domain (Performance? Security? Ops?)
  3. Section 7: Assumptions (15 min)
  4. Section 8: Open Questions (30 min)
  5. Take notes on disagreements

Complete Deep Dive (4 hours):

  • Read entire document
  • Cross-reference between sections
  • Identify interdependencies
  • Prepare detailed feedback

2.2 Key Sections by Role

If you’re a Developer/Producer:

  • Section 3.2 (Response Time) - What latency can you expect?
  • Section 4.2 (Error Rates) - How reliable is the platform?
  • Section 5.2 (Observability) - What can you see/debug?
  • Section 8.2, Q2 (Differentiated SLOs?) - Should your APIs have different targets?

If you’re an Operator/SRE:

  • Section 5.1 (Maintainability) - How much maintenance work?
  • Section 5.3 (Incident Response) - How do we handle failures?
  • Section 5.4 (Change Management) - How do we deploy safely?
  • Section 8.4 (Operational Questions) - On-call coverage? Rollback strategy?

If you’re in Security/Compliance:

  • Section 6 (Security NFRs) - Encryption, auth, audit
  • Section 7.3 (Data & Security Assumptions) - What are we assuming?
  • Section 8.3 (Security Questions) - Audit requirements? Key management?
  • Section 6.3 (CNI Requirements) - What about critical workloads?

If you’re Finance/Operations:

  • Section 3.3 (Resource Efficiency) - Cost per request?
  • Section 5.1 (Maintenance Window) - How much operational overhead?
  • Section 7 (Scalability) - How does this grow?
  • Section 8.5 (Growth Questions) - When do we add regions? Increase capacity?

If you’re an API Consumer:

  • Section 3.2 (Response Time) - What’s the latency?
  • Section 4.1-4.2 (Availability) - Is it reliable?
  • Section 6 (Security) - Is my data safe?
  • Section 8 (Open Questions) - Is there anything that concerns you?

2.3 How to Provide Feedback

Feedback Template (Copy & Paste):

**Your Name/Team**: [e.g., "Alice Smith, Finance Team"]

**Section**: [e.g., "Section 3.2 Response Time"]
**Specific Item**: [e.g., "P95 Latency target of 0.5 sec"]

**Type of Feedback**: 
- [ ] Question (need clarification)
- [ ] Disagreement (target too aggressive/loose)
- [ ] Missing Requirement (something not covered)
- [ ] Suggestion (alternative approach)
- [ ] Concern (worried about implications)

**Current Strawman**: 
[Quote the specific text from the document you're commenting on]

**Your Input**:
[What do you think? Be specific.]

**Rationale** (Why does this matter?):
[Business impact? Technical impact? Risk?]

**Suggested Alternative** (if applicable):
[What would you change it to? What would you measure instead?]

**Questions for Discussion**:
[Anything you want to discuss in the synthesis session?]

Example Good Feedback:

**Your Name/Team**: Bob Johnson, Security Team

**Section**: Section 6.2 Access Control & Authentication
**Specific Item**: "RBAC Granularity" marked as OPEN

**Type of Feedback**: 
- [x] Missing Requirement (something not covered)

**Current Strawman**: 
"RBAC Granularity | TBD | Role-based consumer access | OPEN"

**Your Input**:
For CNI workloads, we need fine-grained RBAC that restricts access to specific endpoints 
and methods. A consumer key shouldn't be able to call sensitive endpoints like admin 
operations. We should differentiate between "read", "write", "admin" permissions per endpoint.

**Rationale**:
CNI workloads are critical infrastructure. If a consumer key is compromised, we need to 
limit the blast radius. Currently, an API key has access to the entire API - this violates 
least-privilege principle.

**Suggested Alternative**:
Implement RBAC at Kong level with:
- Consumer key scopes (read, write, admin, etc.)
- Per-endpoint ACLs
- Time-bound permissions
- Audit log of all RBAC decisions

**Questions for Discussion**:
1. Should this be mandatory for CNI, optional for others?
2. Is Kong's RBAC sufficient or do we need custom solution?

3. Performance NFRs

3.1 Throughput (Scalability)

MetricTargetNotesStatus
Peak Burst Throughput20,000 req/sShort-term spike capacityStrawman
Sustained ThroughputTBDNormal operating load - needs definitionOPEN
Test CoverageTested & published up to targetMust demonstrate capabilityGOAL

Strawman Interpretation: HIP must handle bursts to 20k req/s without cascading failures. Normal load expected to be significantly lower until legacy integration platforms have been migrated from.

Questions for Stakeholders:

  1. What is realistic “normal load” (P50 daily peak)?
  2. How often do 20k req/s bursts occur? —Internal
  3. Should we publish test results publicly or internally?
  1. What’s acceptable behavior when exceeding 20k req/s? (reject, queue, degrade?)

3.2 Response Time (Latency)

MetricTargetPlatform ComponentsStatus
P50 LatencyTBDKong + Microservice + EgressOPEN
P95 Latency~0.5 secPlatform overhead acceptable rangeSTRAWMAN
P99 Latency~0.5 secTail latency targetSTRAWMAN

Strawman Interpretation:

  • Platform components (Kong routing, microservice processing, egress gateway) add ~0.5 sec overhead
  • P95 and P99 should stay within this budget for sync APIs
  • Excludes backend system response time (that’s producer responsibility)

Scope Clarification:

  • ✅ Includes: Kong processing, microservice execution, egress auth & routing
  • ❌ Excludes: Backend system latency, client network latency, DNS resolution
  • ⚠️ Variable: Request/response payload size, transformation complexity

Questions for Stakeholders:

  1. Is 0.5 sec acceptable for all API types, or should we differentiate?
  2. Should we measure from ALB ingress or Kong ingress?
  3. What about async APIs (fire-and-forget)? Different SLOs?
  4. Is P95/P99 the right percentile, or should we track P90 as well?
  5. What production latency baseline do we have today?

3.3 Resource Efficiency

MetricTargetNotesStatus
CPU UtilizationTBDUnder 20k req/s burst loadOPEN
Memory FootprintTBDPer-pod and cluster-wideOPEN
Cost per RequestTBD$ per req at 20k req/sOPEN

Questions for Stakeholders:

  1. Are there cost per request targets from Finance/Operations?
  2. Should we optimize for spot instances vs on-demand?
  3. What’s the acceptable cost variance between normal and burst load?

4. Reliability & Availability NFRs

4.1 Availability (Uptime)

MetricTargetInterpretationStatus
Availability TargetTrack vs AWS SLARelative to AWS service limitsSTRAWMAN
Failure ModeDegrade gracefullyPartial failures preferred over cascadingPRINCIPLE
Error BudgetTBDErrors per month allowedOPEN

Strawman Rationale:

  • 99% uptime (52 min downtime/month) is aggressive for single-region deployment
  • AWS itself has occasional outages (recent issues affecting 99% SLA claims)
  • Better approach: Define platform availability relative to AWS availability
  • Example: “Platform must maintain 99.5% uptime when AWS service is available”

Current Operating Model:

  • Single AWS region (no multi-region redundancy)
  • 1 hour/month out-of-hours maintenance window budgeted
  • Reduces real downtime budget to ~22 min/month for unplanned issues

Questions for Stakeholders:

  1. Is “relative to AWS” acceptable or do you need absolute availability target?
  2. How do you handle AWS regional outages currently?
  3. Should we commit to specific AWS availability tiers (99.9 / 99.95 / 99.99)?
  4. What’s acceptable impact during the 1hr/month maintenance window?

4.2 Error Rates & Success Criteria

MetricTargetDefinitionStatus
Successful Request RateTBD2xx/3xx responses vs totalOPEN
Client Error Rate (4xx)TBDInvalid requests, auth failuresOPEN
Server Error Rate (5xx)TBDPlatform errors onlyOPEN
Error BudgetTBDErrors allowed per monthOPEN

Strawman Thinking:

  • Most 4xx errors are producer/consumer mistakes (not platform failures)
  • Platform should focus on minimizing 5xx errors
  • Transient errors (network blips, transient service restarts) acceptable
  • CNI workloads may require stricter error rates than other APIs

Questions for Stakeholders:

  1. What error rate is acceptable? (1% = 99% success, 0.5% = 99.5% success)
  2. Should we differentiate by API category (CNI vs general)?
  3. How do we handle “consumer timeout” vs “platform timeout”?
  4. What’s the acceptable rate of duplicate/lost messages?

4.3 Data Consistency

MetricTargetConsistency ModelStatus
API DefinitionsEventual consistencyConfig propagates within secondsSTRAWMAN
Consumer KeysEventual consistencyKey revocation propagates within 1 minSTRAWMAN
Platform StateEventual consistencyAcceptable across clusterSTRAWMAN

Strawman Rationale:

  • HIP is stateless (no data persistence between requests)
  • Eventual consistency appropriate for configuration and keys
  • Not suitable for mission-critical transactional data (not HIP’s role)

Data Sensitivity Principle:

  • ✅ No sensitive data in logs
  • ✅ No PII in request/response payloads (producer responsibility)
  • ✅ No persistence of API payloads
  • ✅ Secrets management via vault/secrets store

Questions for Stakeholders:

  1. Is eventual consistency acceptable for all use cases?
  2. Are there APIs that require strict ordering/no duplicates?
  3. What’s maximum acceptable propagation delay for key changes?

5. Security NFRs

5.1 Data Protection

MetricTargetScopeStatus
Encryption in TransitTLS 1.2+All network trafficREQUIREMENT
Encryption at RestTLS (in-transit only)Platform does not persist dataSTATEMENT
Log SanitizationNo sensitive dataLogs must not contain secrets, PIIREQUIREMENT

Important Clarification:

  • HIP does not persist API payloads (stateless pass-through)
  • Logs collected by O11Y team (no sensitive data permitted)
  • Secrets management via Kubernetes secrets + vault (implementation TBD)
  • Producers responsible for encrypting sensitive data in payloads

5.2 Access Control & Authentication

MetricTargetImplementationStatus
API Producer AuthKeycloak integrationKong ← Keycloak ← SSOIMPLEMENTED
API Consumer AuthAPI KeysKong validates consumer keysIMPLEMENTED
Network PoliciesNamespace-levelKyverno enforcementIMPLEMENTED
RBAC GranularityTBDRole-based consumer accessOPEN

Questions for Stakeholders:

  1. Do API consumers need fine-grained RBAC (per-endpoint access)?
  2. Should we support mTLS between platform components?
  3. What audit trail is required for CNI workloads?
  4. How often should API keys be rotated? Mandatory expiration?

5.3 CNI-Specific Requirements

RequirementTargetRationaleStatus
Audit LoggingAll requests loggedIncident response, complianceREQUIREMENT
Network IsolationZero-trust (Kyverno)CNI workloads isolatedIMPLEMENTED
Secret ManagementCentralized, versionedNo embedded secretsREQUIREMENT
Incident ResponseDocumented playbooksCNI incidents require rapid responseGOAL

Questions for Stakeholders:

  1. What’s the required retention period for CNI audit logs?
  2. How quickly must we detect and respond to CNI security incidents?
  3. Are there compliance frameworks (NIST, etc.) that apply?

6. Operational NFRs

6.1 Maintainability & Upgrades

MetricTargetImplementationStatus
Planned Maintenance1 hr/month out-of-hoursK8s, Kong, app upgradesBUDGET
Zero-Planned-DowntimeGoalBlue-green, canary deploysGOAL
Unplanned Downtime~22 min/month budgetAfter 1hr maintenance windowCALCULATED

Strawman Maintenance Model:

  • 1 hour window scheduled monthly (e.g., 2am Saturday UTC)
  • All upgrades (K8s, Kong, microservices) occur within this window
  • Zero-downtime during normal business hours (goal)
  • Automatic health checks, rollback on failure

Deployment Strategy:

  • Kong upgrades: Rolling restart on Kong node group
  • Microservice upgrades: Rolling restart on API Microservices node group
  • Zero-downtime strategy: Traffic drained before restart

Questions for Stakeholders:

  1. Is 1 hr/month sufficient for all planned upgrades?
  2. Can we achieve zero-planned-downtime for all components?
  3. What’s the rollback SLA if an upgrade fails?
  4. Should we test upgrades in staging first (extends timeline)?

6.2 Observability & Monitoring

MetricTargetOwnershipStatus
Metrics Collection100% of requestsO11Y team infrastructureIMPLEMENTED
Distributed TracingSampled (% TBD)Jaeger integrationIN PROGRESS
Log AggregationAll platform logsELK/Loki stackIMPLEMENTED
Alert CoverageTBDPagerduty / similarOPEN
DashboardsReal-time platform healthGrafanaIMPLEMENTED

Measurement & SLI/SLO:

  • Must be able to measure all NFR targets
  • SLI (Service Level Indicator) = actual measured value
  • SLO (Service Level Objective) = target we commit to

Questions for Stakeholders:

  1. What % of requests should we trace (all vs sampled)?
  2. What metrics are critical for alerting (P95 latency, error rate, etc.)?
  3. How long should we retain detailed metrics/logs?
  4. Should we publish SLO dashboards to consumers?

6.3 Incident Response & Resilience

MetricTargetProcessStatus
MTTR (Mean Time To Recover)TBDDepends on incident typeOPEN
Graceful DegradationDrop low-priority trafficShed load before cascadePRINCIPLE
Circuit BreakerEnabledPrevent cascading failuresREQUIREMENT
Retry LogicExponential backoffAvoid overwhelming backendsREQUIREMENT

Failure Modes to Address:

  1. Kong unavailable → Requests fail (no redundancy in single-region)
  2. One microservice down → Route around it (other replicas available)
  3. Egress gateway down → Backend calls fail (need circuit breaker)
  4. Backend slow → Don’t overwhelm with retries (backoff + timeout)
  5. Cert expiration → Services fail (need automated renewal)

Questions for Stakeholders:

  1. Should Kong have redundancy on Kong node group?
  2. What’s acceptable failure domain (namespace, node, AZ)?
  3. How should we handle partial backend failures?
  4. What’s the circuit breaker timeout policy?

6.4 Change Management

ProcessTargetRequirementStatus
Configuration ChangesGitOps trackedAll changes via GitREQUIREMENT
Code ReviewAll PRs reviewedTwo-approval minimumPOLICY
Rollback Capability1-click or git revertRapid rollback on issuesREQUIREMENT
Change LogAutomated from GitAudit trail of all changesREQUIREMENT

Questions for Stakeholders:

  1. Who has authority to approve changes? (architecture, security, etc.)
  2. What’s the change window policy? (business hours only, etc.)
  3. Should we gate changes based on error budget?

7. Scalability & Growth NFRs

7.1 Horizontal Scalability

DimensionTargetCurrentStatus
Request Throughput20,000 req/s burstUnknown baselineTARGET
API Catalog SizeTBD APIs~100s of APIsOPEN
Concurrent ConsumersTBD teams20+ producing teamsOPEN
Data VolumeTBDLogs, metrics, configsOPEN

Scalability Model:

  • Stateless services can scale horizontally (add more pods)
  • Kong, microservices, egress gateways all scalable
  • Database/stateful components (Keycloak) may be limiting factor

Questions for Stakeholders:

  1. What’s the expected growth rate (APIs/month, teams/month)?
  2. Should catalog size affect platform performance?
  3. Are there bottlenecks we haven’t identified?
  4. What’s max acceptable API catalog size?

8. Assumptions

8.1 Architecture Assumptions

  • Single Region Adequate: Single AWS region acceptable without multi-region HA
  • Node Group Isolation: Kong and microservices on separate node groups sufficient for isolation
  • Kong Redundancy: Kong node group has built-in HA (multiple pods)
  • Stateless Services: Platform doesn’t need to persist API payloads (confirmed)
  • Eventually Consistent: Configuration consistency across cluster in seconds is acceptable
  • Keycloak Availability: Keycloak in-cluster outage causes authentication failures (no fallback)

8.2 Operational Assumptions

  • 1 Hr/Month Sufficient: 1 hour monthly maintenance window adequate for all upgrades
  • AWS Reliability: Assume AWS availability zones don’t all fail simultaneously
  • Network Stability: Assume network partitions between nodes are rare/short-lived
  • Cert Auto-Renewal: TLS cert expiration automated (no manual renewal)
  • Log Retention: 30 days log retention sufficient (O11Y team policy)
  • Secrets Rotation: Automated secrets rotation for service accounts

8.3 Data 7.3 Data & Security Security Assumptions

  • No Sensitive Data in Logs: Producers responsible for not logging secrets (we enforce filters)
  • No Data Persistence: Platform doesn’t store API payloads beyond request processing
  • Eventual Consistency OK for Configs: API definition changes propagate within seconds
  • API Keys Don’t Expire: Consumer keys valid until manually revoked
  • TLS Everywhere: All traffic between components encrypted (at least internally)

8.4 Growth Assumptions

  • 20k req/s is Peak: Burst load, not sustained load
  • API Growth Manageable: Catalog size won’t cause performance degradation
  • Team Scaling Linear: Adding producers/consumers doesn’t require architecture changes
  • Cost Scales Linearly: Cost per request remains constant at scale

9. Open Questions & Decisions Needed

9.1 Performance Questions

Q1: What is “normal” sustained throughput?

  • Strawman: 5-10k req/s? (Needs validation)
  • Impact: Affects baseline resource allocation, auto-scaling thresholds

Q2: Should response time targets vary by API type?

  • Real-time APIs (< 100ms)?
  • Standard APIs (0.5s)?
  • Batch APIs (multiple seconds)?
  • Decision needed: Differentiated SLOs?

Q3: Are there high-priority APIs that need stricter SLOs?

  • CNI workloads?
  • Finance APIs?
  • High-volume producer APIs?
  • Decision needed: Tiered SLOs?

9.2 Reliability Questions

Q4: What constitutes platform success?

  • Does “success” include backend timeouts? (Platform issue or backend issue?)
  • How do we differentiate platform errors from consumer errors?
  • Decision needed: Define “platform error” vs “consumer error”**

Q5: Should Kong have redundancy?

  • Current: Single Kong node group (single point of failure)
  • Option A: Multiple replicas on same node group (resilient to pod failure)
  • Option B: Multiple Kong pods spread across multiple nodes
  • Decision needed: Kong redundancy strategy**

Q6: What error rate is acceptable?

  • 1% (99% success rate)?
  • 0.5% (99.5% success rate)?
  • 0.1% (99.9% success rate)?
  • Context-dependent?
  • Decision needed: Error budget**

9.3 Security Questions

Q7: What audit trail is required for CNI?

  • All requests logged? (High volume)
  • Just authentication events?
  • Requests that modify resources?
  • Decision needed: CNI audit scope and retention**

Q8: Should we implement rate limiting per consumer?

  • Fair-use protection?
  • Prevent noisy neighbor?
  • Plan: Keycloak/Kong level?
  • Decision needed: Rate limiting policy**

Q9: How should we handle key compromise?

  • Immediate revocation?
  • Grace period for consumers to rotate?
  • Logging of rotated keys?
  • Decision needed: Key compromise response**

9.4 Operational Questions

Q10: Can we achieve zero-planned-downtime?

  • Kong: Blue-green deployment?
  • Microservices: Rolling restart?
  • State coordination needed?
  • Decision needed: Zero-downtime deployment strategy**

Q11: What’s the rollback SLA?

  • Automatic rollback on failed deployment? (How long detection?)
  • Manual rollback request? (How quickly can ops respond?)
  • Decision needed: Rollback automation level**

Q12: Should we have a 24/7 on-call rotation?

  • Only for CNI incidents?
  • For all platform incidents?
  • Coverage model?
  • Decision needed: On-call requirements**

9.5 Growth & Scale Questions

Q13: When do we revisit these NFRs?

  • Annually?
  • When reaching 50% of targets?
  • When receiving customer complaints?
  • Decision needed: NFR review cadence**

Q14: What’s the multi-region trigger?

  • Customer demand?
  • Regulatory requirement?
  • Cost threshold?
  • Decision needed: Multi-region criteria**

Q15: Should we support other regions proactively?

  • Design for multi-region now?
  • Single-region design, migrate later?
  • Decision needed: Future-proofing vs pragmatism**

10. Governance & Measurement Framework

10.1 How We’ll Track NFRs

SLI (Service Level Indicators) - What we measure:

SLI = (Successful Requests) / (Total Requests)
SLI = P95 Latency from metrics
SLI = Requests per second from metrics
SLI = Uptime (not down) percentage

SLO (Service Level Objectives) - What we commit to:

SLO: SLI >= 99.0% success rate
SLO: P95 latency <= 0.5 sec
SLO: Throughput >= 20k req/s (burst)
SLO: Uptime >= "relative to AWS"

Error Budget - How much we can fail:

If SLO = 99%, error budget = 1% = 864 min downtime/month
We can tolerate some failures to meet this budget
When error budget exhausted, freeze all non-critical changes

10.2 Measurement & Reporting

Who Measures:

  • O11Y team: Collects metrics, logs, traces
  • API Management team: Interprets metrics, tracks SLO status
  • Core Infra team: Measures availability, reliability metrics

Frequency:

  • Real-time dashboards: Grafana (continuous)
  • Daily reports: SLO status, error budget burn rate
  • Weekly: Team review of metrics vs targets
  • Monthly: Full NFR review and reporting

Stakeholder Communication:

  • Producer teams: API-specific latency, error rates
  • Consumers: Availability, response time
  • Leadership: Overall platform health, error budget status
  • Board: Strategic NFR progress, roadmap impact

10.3 Escalation & Response

When SLO is Breached:

  1. Alert triggered automatically (P95 > 0.6s, success < 99%, etc.)
  2. On-call engineer investigates
  3. Incident commander engaged for major breaches
  4. Root cause documented, change review if needed

Error Budget Exhaustion:

  • All non-critical changes frozen
  • Focus shifts to stability and cost reduction
  • New features on hold until budget recovers
  • Stakeholder notification of change freeze

11. Strawman Review Process

11.1 How to Use This Document

For Platform Team:

  1. Read through and add any missing dimensions
  2. Identify assumptions you disagree with
  3. Flag decisions you can make now vs need stakeholder input

For Stakeholders:

  1. Review assumptions - do they match your understanding?
  2. Answer open questions - provide your requirements
  3. Challenge targets that seem unrealistic
  4. Identify missing NFR dimensions

For Producers & Consumers:

  1. Review SLOs - do they match expectations?
  2. Identify stricter requirements for your use case
  3. Provide feedback on real-world latency/throughput needs

11.2 Feedback Template

When providing feedback, use this format:

**Section**: [e.g., "2.2 Response Time"]
**Topic**: [e.g., "P95 Latency target"]
**Current Strawman**: [Quote the strawman text]
**Feedback**: [Your thoughts]
**Rationale**: [Why this matters]
**Suggested Change**: [Alternative target or approach]
**Question**: [Clarification needed?]

11.3 Approval Process

  1. Team Review (2 weeks): Platform teams review internally
  2. Stakeholder Consultation (2-3 weeks): Async feedback via document
  3. Synthesis Session (2 hours): Team discusses major disagreements
  4. Revision (1 week): Incorporate agreed changes
  5. Leadership Sign-off (1 week): Final approval
  6. Publication (immediate): Publish final NFRs

12. Next Steps

12.1 Immediate (Week 1)

  • Distribute this strawman to stakeholders
  • Request feedback using template (Section 10.2)
  • Feedback deadline: 2 weeks
  • Create dedicated Slack channel for questions

12.2 Short Term (Weeks 2-4)

  • Compile all feedback
  • Identify consensus vs disagreement
  • Schedule synthesis session for major disagreements
  • Create baseline measurement for current state

12.3 Medium Term (Weeks 4-8)

  • Publish revised NFR document (v1.0)
  • Create measurement/dashboard for each NFR
  • Establish SLO monitoring and alerting
  • Begin tracking against targets

13. Reference: Current Known Metrics

Establish Baseline:

Current state (to be measured):
- Average request latency: _____ ms
- P95 latency: _____ ms
- P99 latency: _____ ms
- Current peak throughput: _____ req/s
- Error rate: _____% 
- Typical monthly downtime: _____ minutes
- Kong availability: _____% uptime

Appendix A: Glossary

TermDefinition
SLAService Level Agreement - contractual commitment to customers
SLOService Level Objective - internal target for performance
SLIService Level Indicator - actual measured metric
P95/P9995th/99th percentile latency (95%/99% of requests faster than this)
ThroughputRequests per second the system can handle
MTTRMean Time To Recover - average time to fix an incident
RTORecovery Time Objective - max acceptable downtime after failure
RPORecovery Point Objective - max acceptable data loss
Error BudgetAmount of failure allowed while still meeting SLO
Zero-DowntimeDeployment without any user-facing outage

Appendix B: Industry Benchmarks

For reference, typical targets:

MetricStartupsEstablishedEnterpriseNotes
Availability99%99.5%99.9%+Higher = more expensive
P95 Latency1+ sec500ms100-200msBackend dependent
Error Rate1-2%0.5-1%<0.1%Better = higher cost
ThroughputVariable1-10k req/s10k+ req/sDepends on use case
Maintenance WindowNone1-4 hr/monthUnplanned onlyHigher cost = less maintenance

Note: HIP targets put us in “Established” to “Enterprise” range - which drives infrastructure/operational investment


14. Coverage Analysis: What Else Should We Consider?

Beyond the NFRs documented above, there are additional dimensions worth considering. This section identifies gaps and future considerations.

14.1 Dimensions Fully Covered ✅

The strawman thoroughly addresses:

  • Throughput & Scalability: 20k req/s burst, horizontal scaling
  • Latency Targets: P95/P99 response time
  • Availability Model: Relative to AWS SLA
  • Security Basics: Encryption, authentication, CNI-specific security
  • Operational: Maintenance windows, observability, incident response
  • Reliability Principles: Graceful degradation, circuit breakers

14.2 Dimensions Partially Covered (Open Questions)

Areas with initial thinking but needing stakeholder input:

  • Error Rates (Section 3.2): What success rate is acceptable? (1% vs 0.5% vs 0.1%)
  • Sustained Load (Section 2.1): What’s “normal” throughput vs “burst”?
  • Cost per Request (Section 2.3): Should we optimize for cost?
  • Tiered SLOs (Section 8.2, Q2): Different targets for different API types?
  • On-Call Model (Section 8.4, Q12): 24/7 coverage or business hours?
  • Rate Limiting (Section 8.3, Q8): Per-consumer limits? Per-endpoint?

14.3 Dimensions Not Covered - Consider for v1.1 or Future ⚠️

These topics may warrant addition to NFR document after stakeholder feedback:

Deployment & Release Management

  • How frequently can we deploy? (daily, weekly, on-demand?)
  • How fast should deployments complete? (5 min, 30 min?)
  • Canary/progressive rollout strategy?
  • Rollback time SLO?
  • When to discuss: If producers need faster deployment cycles

Consumer/External SLAs

  • What do we promise externally to API consumers?
  • Are external SLAs different from internal SLOs?
  • Any contractual SLA commitments?
  • When to discuss: If selling platform as service or have customers

Capacity Planning Model

  • How much headroom do we need? (70% utilization? 50%?)
  • Can we handle 2x or 10x normal load?
  • Forecasting horizon for capacity?
  • When to discuss: If rapid growth expected

API Lifecycle Management

  • How long support old API versions?
  • Backward compatibility policy?
  • Deprecation timeline?
  • When to discuss: As API catalog matures

Data Retention & Privacy

  • Log retention duration? (30 days, 90 days, 1 year?)
  • GDPR/PII handling?
  • Audit log immutability?
  • When to discuss: If compliance requirements identified

Multi-Tenancy Isolation Depth

  • Noisy neighbor protection (rate limit per producer)?
  • Resource quotas per API?
  • Cost attribution?
  • Failure blast radius control?
  • When to discuss: As teams scale and resource contention increases

Async/Event-Driven APIs

  • Do we support pub/sub or event patterns?
  • Message ordering requirements?
  • Deduplication guarantees?
  • When to discuss: If producers request async capabilities

Schema Management & Evolution

  • How manage API schema changes?
  • Breaking change detection?
  • Multiple schema version support?
  • When to discuss: As API catalog grows and versioning becomes critical

Disaster Recovery & Business Continuity

  • RTO (Recovery Time Objective)?
  • RPO (Recovery Point Objective)?
  • Backup strategy and testing?
  • Multi-region requirements?
  • When to discuss: If high-availability becomes critical

Performance Testing & Validation

  • How validate 20k req/s target? (load testing, soak testing?)
  • Regression testing for performance?
  • Baseline measurement of current state?
  • When to discuss: Before claiming compliance with targets

Cost Model & Chargeback

  • Fixed vs variable cost model?
  • Chargeback to teams?
  • Cost transparency?
  • Cost forecasting?
  • When to discuss: If financial accountability required

Compliance & Audit Frameworks

  • Which compliance frameworks apply? (SOC2, ISO 27001, HIPAA, PCI-DSS?)
  • Audit trail immutability?
  • Compliance reporting automation?
  • When to discuss: If regulatory requirements identified

Customer Support SLA

  • How quickly respond to issues?
  • Support hours (24/7 or business hours)?
  • Support channels?
  • Self-service troubleshooting guides?
  • When to discuss: If external customers to support

14.4 Assessment: Should Any Move to v1.0?

Ask stakeholders:

“Looking at this list of not-yet-covered dimensions, which ones are critical for v1.0? Which can wait until v1.1?”

Likely candidates for v1.0 if raised:

  1. Deployment Frequency - If producers need rapid iteration
  2. Canary/Progressive Rollout - If safety is critical concern
  3. Rate Limiting - If noisy neighbor issues expected
  4. Cost Model - If chargebacks/accountability required
  5. Async API Support - If requested by producers

Likely candidates to defer to v1.1 or future:

  • API lifecycle/deprecation policy
  • Data retention specifics
  • Schema management
  • Disaster recovery
  • Compliance frameworks (can start with documentation)

Document Metadata

Version History:

  • v0.1 (2026-02-10): Initial strawman - seeking feedback

Authors: API Management Team, Core Infra Team

Stakeholders for Review:

  • Platform teams (O11Y, DevEx, Core Infra, API Mgmt)
  • Producer team representatives (2-3 teams)
  • Security/Compliance team
  • Finance/Operations team

Contact: [Slack channel to be created]

Next Review Date: 2026-04-01 (after stakeholder feedback incorporated)


END OF STRAWMAN


How to Proceed

This strawman is intentionally incomplete and designed to be discussed.

Next Actions:

  1. Share with your stakeholder group
  2. Use the feedback template to gather input
  3. Schedule a synthesis discussion for week 2-3
  4. Iterate toward consensus
  5. Publish v1.0 once approved

Would you like me to:

  • Create a stakeholder feedback form/template?
  • Build sample dashboards for measuring these NFRs?
  • Draft SLO policies based on these targets?
  • Create an implementation roadmap for achieving these targets?