HIP Platform - Non-Functional Requirements (NFR) Strawman
Version: 0.1 (Strawman - Under Review)
Created: 2026-02-10
Status: Draft - Seeking Stakeholder Feedback
Target Approval: Q1 2026
1. Executive Summary
This document presents a strawman of Non-Functional Requirements (NFRs) for the HIP enterprise API integration platform. It is intentionally incomplete and designed to spark discussion with stakeholders.
Key Points for Discussion:
- NFRs are performance, reliability, security, and operational targets
- This strawman captures initial thinking and known constraints
- Open questions and assumptions are explicitly called out
- Governance framework for measuring and tracking NFRs
- Goal: Iterate with stakeholders to reach consensus and publish final NFR targets
Scope: Covers Kong ingress, microservices, managed egress, and platform APIs. Excludes backend systems and AWS infrastructure.
2. How to Use This Document
2.1 How to Read This Strawman
The NFR Strawman is organized by topic, not by reading order. You don’t need to read it front-to-back.
Quickest Review (30 min):
- Read Section 1: Executive Summary
- Read Section 7: Assumptions (what are we assuming?)
- Read Section 8: Open Questions (what needs decisions?)
- Skip to your area of concern
Thorough Review (2 hours):
- Executive Summary (5 min)
- Sections 3-6: Your specific domain (Performance? Security? Ops?)
- Section 7: Assumptions (15 min)
- Section 8: Open Questions (30 min)
- Take notes on disagreements
Complete Deep Dive (4 hours):
- Read entire document
- Cross-reference between sections
- Identify interdependencies
- Prepare detailed feedback
2.2 Key Sections by Role
If you’re a Developer/Producer:
- Section 3.2 (Response Time) - What latency can you expect?
- Section 4.2 (Error Rates) - How reliable is the platform?
- Section 5.2 (Observability) - What can you see/debug?
- Section 8.2, Q2 (Differentiated SLOs?) - Should your APIs have different targets?
If you’re an Operator/SRE:
- Section 5.1 (Maintainability) - How much maintenance work?
- Section 5.3 (Incident Response) - How do we handle failures?
- Section 5.4 (Change Management) - How do we deploy safely?
- Section 8.4 (Operational Questions) - On-call coverage? Rollback strategy?
If you’re in Security/Compliance:
- Section 6 (Security NFRs) - Encryption, auth, audit
- Section 7.3 (Data & Security Assumptions) - What are we assuming?
- Section 8.3 (Security Questions) - Audit requirements? Key management?
- Section 6.3 (CNI Requirements) - What about critical workloads?
If you’re Finance/Operations:
- Section 3.3 (Resource Efficiency) - Cost per request?
- Section 5.1 (Maintenance Window) - How much operational overhead?
- Section 7 (Scalability) - How does this grow?
- Section 8.5 (Growth Questions) - When do we add regions? Increase capacity?
If you’re an API Consumer:
- Section 3.2 (Response Time) - What’s the latency?
- Section 4.1-4.2 (Availability) - Is it reliable?
- Section 6 (Security) - Is my data safe?
- Section 8 (Open Questions) - Is there anything that concerns you?
2.3 How to Provide Feedback
Feedback Template (Copy & Paste):
**Your Name/Team**: [e.g., "Alice Smith, Finance Team"]
**Section**: [e.g., "Section 3.2 Response Time"]
**Specific Item**: [e.g., "P95 Latency target of 0.5 sec"]
**Type of Feedback**:
- [ ] Question (need clarification)
- [ ] Disagreement (target too aggressive/loose)
- [ ] Missing Requirement (something not covered)
- [ ] Suggestion (alternative approach)
- [ ] Concern (worried about implications)
**Current Strawman**:
[Quote the specific text from the document you're commenting on]
**Your Input**:
[What do you think? Be specific.]
**Rationale** (Why does this matter?):
[Business impact? Technical impact? Risk?]
**Suggested Alternative** (if applicable):
[What would you change it to? What would you measure instead?]
**Questions for Discussion**:
[Anything you want to discuss in the synthesis session?]
Example Good Feedback:
**Your Name/Team**: Bob Johnson, Security Team
**Section**: Section 6.2 Access Control & Authentication
**Specific Item**: "RBAC Granularity" marked as OPEN
**Type of Feedback**:
- [x] Missing Requirement (something not covered)
**Current Strawman**:
"RBAC Granularity | TBD | Role-based consumer access | OPEN"
**Your Input**:
For CNI workloads, we need fine-grained RBAC that restricts access to specific endpoints
and methods. A consumer key shouldn't be able to call sensitive endpoints like admin
operations. We should differentiate between "read", "write", "admin" permissions per endpoint.
**Rationale**:
CNI workloads are critical infrastructure. If a consumer key is compromised, we need to
limit the blast radius. Currently, an API key has access to the entire API - this violates
least-privilege principle.
**Suggested Alternative**:
Implement RBAC at Kong level with:
- Consumer key scopes (read, write, admin, etc.)
- Per-endpoint ACLs
- Time-bound permissions
- Audit log of all RBAC decisions
**Questions for Discussion**:
1. Should this be mandatory for CNI, optional for others?
2. Is Kong's RBAC sufficient or do we need custom solution?
3. Performance NFRs
3.1 Throughput (Scalability)
| Metric | Target | Notes | Status |
|---|---|---|---|
| Peak Burst Throughput | 20,000 req/s | Short-term spike capacity | Strawman |
| Sustained Throughput | TBD | Normal operating load - needs definition | OPEN |
| Test Coverage | Tested & published up to target | Must demonstrate capability | GOAL |
Strawman Interpretation: HIP must handle bursts to 20k req/s without cascading failures. Normal load expected to be significantly lower until legacy integration platforms have been migrated from.
Questions for Stakeholders:
- What is realistic “normal load” (P50 daily peak)?
- How often do 20k req/s bursts occur? —Internal
- Should we publish test results publicly or internally?
- What’s acceptable behavior when exceeding 20k req/s? (reject, queue, degrade?)
3.2 Response Time (Latency)
| Metric | Target | Platform Components | Status |
|---|---|---|---|
| P50 Latency | TBD | Kong + Microservice + Egress | OPEN |
| P95 Latency | ~0.5 sec | Platform overhead acceptable range | STRAWMAN |
| P99 Latency | ~0.5 sec | Tail latency target | STRAWMAN |
Strawman Interpretation:
- Platform components (Kong routing, microservice processing, egress gateway) add ~0.5 sec overhead
- P95 and P99 should stay within this budget for sync APIs
- Excludes backend system response time (that’s producer responsibility)
Scope Clarification:
- ✅ Includes: Kong processing, microservice execution, egress auth & routing
- ❌ Excludes: Backend system latency, client network latency, DNS resolution
- ⚠️ Variable: Request/response payload size, transformation complexity
Questions for Stakeholders:
- Is 0.5 sec acceptable for all API types, or should we differentiate?
- Should we measure from ALB ingress or Kong ingress?
- What about async APIs (fire-and-forget)? Different SLOs?
- Is P95/P99 the right percentile, or should we track P90 as well?
- What production latency baseline do we have today?
3.3 Resource Efficiency
| Metric | Target | Notes | Status |
|---|---|---|---|
| CPU Utilization | TBD | Under 20k req/s burst load | OPEN |
| Memory Footprint | TBD | Per-pod and cluster-wide | OPEN |
| Cost per Request | TBD | $ per req at 20k req/s | OPEN |
Questions for Stakeholders:
- Are there cost per request targets from Finance/Operations?
- Should we optimize for spot instances vs on-demand?
- What’s the acceptable cost variance between normal and burst load?
4. Reliability & Availability NFRs
4.1 Availability (Uptime)
| Metric | Target | Interpretation | Status |
|---|---|---|---|
| Availability Target | Track vs AWS SLA | Relative to AWS service limits | STRAWMAN |
| Failure Mode | Degrade gracefully | Partial failures preferred over cascading | PRINCIPLE |
| Error Budget | TBD | Errors per month allowed | OPEN |
Strawman Rationale:
- 99% uptime (52 min downtime/month) is aggressive for single-region deployment
- AWS itself has occasional outages (recent issues affecting 99% SLA claims)
- Better approach: Define platform availability relative to AWS availability
- Example: “Platform must maintain 99.5% uptime when AWS service is available”
Current Operating Model:
- Single AWS region (no multi-region redundancy)
- 1 hour/month out-of-hours maintenance window budgeted
- Reduces real downtime budget to ~22 min/month for unplanned issues
Questions for Stakeholders:
- Is “relative to AWS” acceptable or do you need absolute availability target?
- How do you handle AWS regional outages currently?
- Should we commit to specific AWS availability tiers (99.9 / 99.95 / 99.99)?
- What’s acceptable impact during the 1hr/month maintenance window?
4.2 Error Rates & Success Criteria
| Metric | Target | Definition | Status |
|---|---|---|---|
| Successful Request Rate | TBD | 2xx/3xx responses vs total | OPEN |
| Client Error Rate (4xx) | TBD | Invalid requests, auth failures | OPEN |
| Server Error Rate (5xx) | TBD | Platform errors only | OPEN |
| Error Budget | TBD | Errors allowed per month | OPEN |
Strawman Thinking:
- Most 4xx errors are producer/consumer mistakes (not platform failures)
- Platform should focus on minimizing 5xx errors
- Transient errors (network blips, transient service restarts) acceptable
- CNI workloads may require stricter error rates than other APIs
Questions for Stakeholders:
- What error rate is acceptable? (1% = 99% success, 0.5% = 99.5% success)
- Should we differentiate by API category (CNI vs general)?
- How do we handle “consumer timeout” vs “platform timeout”?
- What’s the acceptable rate of duplicate/lost messages?
4.3 Data Consistency
| Metric | Target | Consistency Model | Status |
|---|---|---|---|
| API Definitions | Eventual consistency | Config propagates within seconds | STRAWMAN |
| Consumer Keys | Eventual consistency | Key revocation propagates within 1 min | STRAWMAN |
| Platform State | Eventual consistency | Acceptable across cluster | STRAWMAN |
Strawman Rationale:
- HIP is stateless (no data persistence between requests)
- Eventual consistency appropriate for configuration and keys
- Not suitable for mission-critical transactional data (not HIP’s role)
Data Sensitivity Principle:
- ✅ No sensitive data in logs
- ✅ No PII in request/response payloads (producer responsibility)
- ✅ No persistence of API payloads
- ✅ Secrets management via vault/secrets store
Questions for Stakeholders:
- Is eventual consistency acceptable for all use cases?
- Are there APIs that require strict ordering/no duplicates?
- What’s maximum acceptable propagation delay for key changes?
5. Security NFRs
5.1 Data Protection
| Metric | Target | Scope | Status |
|---|---|---|---|
| Encryption in Transit | TLS 1.2+ | All network traffic | REQUIREMENT |
| Encryption at Rest | TLS (in-transit only) | Platform does not persist data | STATEMENT |
| Log Sanitization | No sensitive data | Logs must not contain secrets, PII | REQUIREMENT |
Important Clarification:
- HIP does not persist API payloads (stateless pass-through)
- Logs collected by O11Y team (no sensitive data permitted)
- Secrets management via Kubernetes secrets + vault (implementation TBD)
- Producers responsible for encrypting sensitive data in payloads
5.2 Access Control & Authentication
| Metric | Target | Implementation | Status |
|---|---|---|---|
| API Producer Auth | Keycloak integration | Kong ← Keycloak ← SSO | IMPLEMENTED |
| API Consumer Auth | API Keys | Kong validates consumer keys | IMPLEMENTED |
| Network Policies | Namespace-level | Kyverno enforcement | IMPLEMENTED |
| RBAC Granularity | TBD | Role-based consumer access | OPEN |
Questions for Stakeholders:
- Do API consumers need fine-grained RBAC (per-endpoint access)?
- Should we support mTLS between platform components?
- What audit trail is required for CNI workloads?
- How often should API keys be rotated? Mandatory expiration?
5.3 CNI-Specific Requirements
| Requirement | Target | Rationale | Status |
|---|---|---|---|
| Audit Logging | All requests logged | Incident response, compliance | REQUIREMENT |
| Network Isolation | Zero-trust (Kyverno) | CNI workloads isolated | IMPLEMENTED |
| Secret Management | Centralized, versioned | No embedded secrets | REQUIREMENT |
| Incident Response | Documented playbooks | CNI incidents require rapid response | GOAL |
Questions for Stakeholders:
- What’s the required retention period for CNI audit logs?
- How quickly must we detect and respond to CNI security incidents?
- Are there compliance frameworks (NIST, etc.) that apply?
6. Operational NFRs
6.1 Maintainability & Upgrades
| Metric | Target | Implementation | Status |
|---|---|---|---|
| Planned Maintenance | 1 hr/month out-of-hours | K8s, Kong, app upgrades | BUDGET |
| Zero-Planned-Downtime | Goal | Blue-green, canary deploys | GOAL |
| Unplanned Downtime | ~22 min/month budget | After 1hr maintenance window | CALCULATED |
Strawman Maintenance Model:
- 1 hour window scheduled monthly (e.g., 2am Saturday UTC)
- All upgrades (K8s, Kong, microservices) occur within this window
- Zero-downtime during normal business hours (goal)
- Automatic health checks, rollback on failure
Deployment Strategy:
- Kong upgrades: Rolling restart on Kong node group
- Microservice upgrades: Rolling restart on API Microservices node group
- Zero-downtime strategy: Traffic drained before restart
Questions for Stakeholders:
- Is 1 hr/month sufficient for all planned upgrades?
- Can we achieve zero-planned-downtime for all components?
- What’s the rollback SLA if an upgrade fails?
- Should we test upgrades in staging first (extends timeline)?
6.2 Observability & Monitoring
| Metric | Target | Ownership | Status |
|---|---|---|---|
| Metrics Collection | 100% of requests | O11Y team infrastructure | IMPLEMENTED |
| Distributed Tracing | Sampled (% TBD) | Jaeger integration | IN PROGRESS |
| Log Aggregation | All platform logs | ELK/Loki stack | IMPLEMENTED |
| Alert Coverage | TBD | Pagerduty / similar | OPEN |
| Dashboards | Real-time platform health | Grafana | IMPLEMENTED |
Measurement & SLI/SLO:
- Must be able to measure all NFR targets
- SLI (Service Level Indicator) = actual measured value
- SLO (Service Level Objective) = target we commit to
Questions for Stakeholders:
- What % of requests should we trace (all vs sampled)?
- What metrics are critical for alerting (P95 latency, error rate, etc.)?
- How long should we retain detailed metrics/logs?
- Should we publish SLO dashboards to consumers?
6.3 Incident Response & Resilience
| Metric | Target | Process | Status |
|---|---|---|---|
| MTTR (Mean Time To Recover) | TBD | Depends on incident type | OPEN |
| Graceful Degradation | Drop low-priority traffic | Shed load before cascade | PRINCIPLE |
| Circuit Breaker | Enabled | Prevent cascading failures | REQUIREMENT |
| Retry Logic | Exponential backoff | Avoid overwhelming backends | REQUIREMENT |
Failure Modes to Address:
- Kong unavailable → Requests fail (no redundancy in single-region)
- One microservice down → Route around it (other replicas available)
- Egress gateway down → Backend calls fail (need circuit breaker)
- Backend slow → Don’t overwhelm with retries (backoff + timeout)
- Cert expiration → Services fail (need automated renewal)
Questions for Stakeholders:
- Should Kong have redundancy on Kong node group?
- What’s acceptable failure domain (namespace, node, AZ)?
- How should we handle partial backend failures?
- What’s the circuit breaker timeout policy?
6.4 Change Management
| Process | Target | Requirement | Status |
|---|---|---|---|
| Configuration Changes | GitOps tracked | All changes via Git | REQUIREMENT |
| Code Review | All PRs reviewed | Two-approval minimum | POLICY |
| Rollback Capability | 1-click or git revert | Rapid rollback on issues | REQUIREMENT |
| Change Log | Automated from Git | Audit trail of all changes | REQUIREMENT |
Questions for Stakeholders:
- Who has authority to approve changes? (architecture, security, etc.)
- What’s the change window policy? (business hours only, etc.)
- Should we gate changes based on error budget?
7. Scalability & Growth NFRs
7.1 Horizontal Scalability
| Dimension | Target | Current | Status |
|---|---|---|---|
| Request Throughput | 20,000 req/s burst | Unknown baseline | TARGET |
| API Catalog Size | TBD APIs | ~100s of APIs | OPEN |
| Concurrent Consumers | TBD teams | 20+ producing teams | OPEN |
| Data Volume | TBD | Logs, metrics, configs | OPEN |
Scalability Model:
- Stateless services can scale horizontally (add more pods)
- Kong, microservices, egress gateways all scalable
- Database/stateful components (Keycloak) may be limiting factor
Questions for Stakeholders:
- What’s the expected growth rate (APIs/month, teams/month)?
- Should catalog size affect platform performance?
- Are there bottlenecks we haven’t identified?
- What’s max acceptable API catalog size?
8. Assumptions
8.1 Architecture Assumptions
- Single Region Adequate: Single AWS region acceptable without multi-region HA
- Node Group Isolation: Kong and microservices on separate node groups sufficient for isolation
- Kong Redundancy: Kong node group has built-in HA (multiple pods)
- Stateless Services: Platform doesn’t need to persist API payloads (confirmed)
- Eventually Consistent: Configuration consistency across cluster in seconds is acceptable
- Keycloak Availability: Keycloak in-cluster outage causes authentication failures (no fallback)
8.2 Operational Assumptions
- 1 Hr/Month Sufficient: 1 hour monthly maintenance window adequate for all upgrades
- AWS Reliability: Assume AWS availability zones don’t all fail simultaneously
- Network Stability: Assume network partitions between nodes are rare/short-lived
- Cert Auto-Renewal: TLS cert expiration automated (no manual renewal)
- Log Retention: 30 days log retention sufficient (O11Y team policy)
- Secrets Rotation: Automated secrets rotation for service accounts
8.3 Data 7.3 Data & Security Security Assumptions
- No Sensitive Data in Logs: Producers responsible for not logging secrets (we enforce filters)
- No Data Persistence: Platform doesn’t store API payloads beyond request processing
- Eventual Consistency OK for Configs: API definition changes propagate within seconds
- API Keys Don’t Expire: Consumer keys valid until manually revoked
- TLS Everywhere: All traffic between components encrypted (at least internally)
8.4 Growth Assumptions
- 20k req/s is Peak: Burst load, not sustained load
- API Growth Manageable: Catalog size won’t cause performance degradation
- Team Scaling Linear: Adding producers/consumers doesn’t require architecture changes
- Cost Scales Linearly: Cost per request remains constant at scale
9. Open Questions & Decisions Needed
9.1 Performance Questions
Q1: What is “normal” sustained throughput?
- Strawman: 5-10k req/s? (Needs validation)
- Impact: Affects baseline resource allocation, auto-scaling thresholds
Q2: Should response time targets vary by API type?
- Real-time APIs (< 100ms)?
- Standard APIs (0.5s)?
- Batch APIs (multiple seconds)?
- Decision needed: Differentiated SLOs?
Q3: Are there high-priority APIs that need stricter SLOs?
- CNI workloads?
- Finance APIs?
- High-volume producer APIs?
- Decision needed: Tiered SLOs?
9.2 Reliability Questions
Q4: What constitutes platform success?
- Does “success” include backend timeouts? (Platform issue or backend issue?)
- How do we differentiate platform errors from consumer errors?
- Decision needed: Define “platform error” vs “consumer error”**
Q5: Should Kong have redundancy?
- Current: Single Kong node group (single point of failure)
- Option A: Multiple replicas on same node group (resilient to pod failure)
- Option B: Multiple Kong pods spread across multiple nodes
- Decision needed: Kong redundancy strategy**
Q6: What error rate is acceptable?
- 1% (99% success rate)?
- 0.5% (99.5% success rate)?
- 0.1% (99.9% success rate)?
- Context-dependent?
- Decision needed: Error budget**
9.3 Security Questions
Q7: What audit trail is required for CNI?
- All requests logged? (High volume)
- Just authentication events?
- Requests that modify resources?
- Decision needed: CNI audit scope and retention**
Q8: Should we implement rate limiting per consumer?
- Fair-use protection?
- Prevent noisy neighbor?
- Plan: Keycloak/Kong level?
- Decision needed: Rate limiting policy**
Q9: How should we handle key compromise?
- Immediate revocation?
- Grace period for consumers to rotate?
- Logging of rotated keys?
- Decision needed: Key compromise response**
9.4 Operational Questions
Q10: Can we achieve zero-planned-downtime?
- Kong: Blue-green deployment?
- Microservices: Rolling restart?
- State coordination needed?
- Decision needed: Zero-downtime deployment strategy**
Q11: What’s the rollback SLA?
- Automatic rollback on failed deployment? (How long detection?)
- Manual rollback request? (How quickly can ops respond?)
- Decision needed: Rollback automation level**
Q12: Should we have a 24/7 on-call rotation?
- Only for CNI incidents?
- For all platform incidents?
- Coverage model?
- Decision needed: On-call requirements**
9.5 Growth & Scale Questions
Q13: When do we revisit these NFRs?
- Annually?
- When reaching 50% of targets?
- When receiving customer complaints?
- Decision needed: NFR review cadence**
Q14: What’s the multi-region trigger?
- Customer demand?
- Regulatory requirement?
- Cost threshold?
- Decision needed: Multi-region criteria**
Q15: Should we support other regions proactively?
- Design for multi-region now?
- Single-region design, migrate later?
- Decision needed: Future-proofing vs pragmatism**
10. Governance & Measurement Framework
10.1 How We’ll Track NFRs
SLI (Service Level Indicators) - What we measure:
SLI = (Successful Requests) / (Total Requests)
SLI = P95 Latency from metrics
SLI = Requests per second from metrics
SLI = Uptime (not down) percentage
SLO (Service Level Objectives) - What we commit to:
SLO: SLI >= 99.0% success rate
SLO: P95 latency <= 0.5 sec
SLO: Throughput >= 20k req/s (burst)
SLO: Uptime >= "relative to AWS"
Error Budget - How much we can fail:
If SLO = 99%, error budget = 1% = 864 min downtime/month
We can tolerate some failures to meet this budget
When error budget exhausted, freeze all non-critical changes
10.2 Measurement & Reporting
Who Measures:
- O11Y team: Collects metrics, logs, traces
- API Management team: Interprets metrics, tracks SLO status
- Core Infra team: Measures availability, reliability metrics
Frequency:
- Real-time dashboards: Grafana (continuous)
- Daily reports: SLO status, error budget burn rate
- Weekly: Team review of metrics vs targets
- Monthly: Full NFR review and reporting
Stakeholder Communication:
- Producer teams: API-specific latency, error rates
- Consumers: Availability, response time
- Leadership: Overall platform health, error budget status
- Board: Strategic NFR progress, roadmap impact
10.3 Escalation & Response
When SLO is Breached:
- Alert triggered automatically (P95 > 0.6s, success < 99%, etc.)
- On-call engineer investigates
- Incident commander engaged for major breaches
- Root cause documented, change review if needed
Error Budget Exhaustion:
- All non-critical changes frozen
- Focus shifts to stability and cost reduction
- New features on hold until budget recovers
- Stakeholder notification of change freeze
11. Strawman Review Process
11.1 How to Use This Document
For Platform Team:
- Read through and add any missing dimensions
- Identify assumptions you disagree with
- Flag decisions you can make now vs need stakeholder input
For Stakeholders:
- Review assumptions - do they match your understanding?
- Answer open questions - provide your requirements
- Challenge targets that seem unrealistic
- Identify missing NFR dimensions
For Producers & Consumers:
- Review SLOs - do they match expectations?
- Identify stricter requirements for your use case
- Provide feedback on real-world latency/throughput needs
11.2 Feedback Template
When providing feedback, use this format:
**Section**: [e.g., "2.2 Response Time"]
**Topic**: [e.g., "P95 Latency target"]
**Current Strawman**: [Quote the strawman text]
**Feedback**: [Your thoughts]
**Rationale**: [Why this matters]
**Suggested Change**: [Alternative target or approach]
**Question**: [Clarification needed?]
11.3 Approval Process
- Team Review (2 weeks): Platform teams review internally
- Stakeholder Consultation (2-3 weeks): Async feedback via document
- Synthesis Session (2 hours): Team discusses major disagreements
- Revision (1 week): Incorporate agreed changes
- Leadership Sign-off (1 week): Final approval
- Publication (immediate): Publish final NFRs
12. Next Steps
12.1 Immediate (Week 1)
- Distribute this strawman to stakeholders
- Request feedback using template (Section 10.2)
- Feedback deadline: 2 weeks
- Create dedicated Slack channel for questions
12.2 Short Term (Weeks 2-4)
- Compile all feedback
- Identify consensus vs disagreement
- Schedule synthesis session for major disagreements
- Create baseline measurement for current state
12.3 Medium Term (Weeks 4-8)
- Publish revised NFR document (v1.0)
- Create measurement/dashboard for each NFR
- Establish SLO monitoring and alerting
- Begin tracking against targets
13. Reference: Current Known Metrics
Establish Baseline:
Current state (to be measured):
- Average request latency: _____ ms
- P95 latency: _____ ms
- P99 latency: _____ ms
- Current peak throughput: _____ req/s
- Error rate: _____%
- Typical monthly downtime: _____ minutes
- Kong availability: _____% uptime
Appendix A: Glossary
| Term | Definition |
|---|---|
| SLA | Service Level Agreement - contractual commitment to customers |
| SLO | Service Level Objective - internal target for performance |
| SLI | Service Level Indicator - actual measured metric |
| P95/P99 | 95th/99th percentile latency (95%/99% of requests faster than this) |
| Throughput | Requests per second the system can handle |
| MTTR | Mean Time To Recover - average time to fix an incident |
| RTO | Recovery Time Objective - max acceptable downtime after failure |
| RPO | Recovery Point Objective - max acceptable data loss |
| Error Budget | Amount of failure allowed while still meeting SLO |
| Zero-Downtime | Deployment without any user-facing outage |
Appendix B: Industry Benchmarks
For reference, typical targets:
| Metric | Startups | Established | Enterprise | Notes |
|---|---|---|---|---|
| Availability | 99% | 99.5% | 99.9%+ | Higher = more expensive |
| P95 Latency | 1+ sec | 500ms | 100-200ms | Backend dependent |
| Error Rate | 1-2% | 0.5-1% | <0.1% | Better = higher cost |
| Throughput | Variable | 1-10k req/s | 10k+ req/s | Depends on use case |
| Maintenance Window | None | 1-4 hr/month | Unplanned only | Higher cost = less maintenance |
Note: HIP targets put us in “Established” to “Enterprise” range - which drives infrastructure/operational investment
14. Coverage Analysis: What Else Should We Consider?
Beyond the NFRs documented above, there are additional dimensions worth considering. This section identifies gaps and future considerations.
14.1 Dimensions Fully Covered ✅
The strawman thoroughly addresses:
- Throughput & Scalability: 20k req/s burst, horizontal scaling
- Latency Targets: P95/P99 response time
- Availability Model: Relative to AWS SLA
- Security Basics: Encryption, authentication, CNI-specific security
- Operational: Maintenance windows, observability, incident response
- Reliability Principles: Graceful degradation, circuit breakers
14.2 Dimensions Partially Covered (Open Questions)
Areas with initial thinking but needing stakeholder input:
- Error Rates (Section 3.2): What success rate is acceptable? (1% vs 0.5% vs 0.1%)
- Sustained Load (Section 2.1): What’s “normal” throughput vs “burst”?
- Cost per Request (Section 2.3): Should we optimize for cost?
- Tiered SLOs (Section 8.2, Q2): Different targets for different API types?
- On-Call Model (Section 8.4, Q12): 24/7 coverage or business hours?
- Rate Limiting (Section 8.3, Q8): Per-consumer limits? Per-endpoint?
14.3 Dimensions Not Covered - Consider for v1.1 or Future ⚠️
These topics may warrant addition to NFR document after stakeholder feedback:
Deployment & Release Management
- How frequently can we deploy? (daily, weekly, on-demand?)
- How fast should deployments complete? (5 min, 30 min?)
- Canary/progressive rollout strategy?
- Rollback time SLO?
- When to discuss: If producers need faster deployment cycles
Consumer/External SLAs
- What do we promise externally to API consumers?
- Are external SLAs different from internal SLOs?
- Any contractual SLA commitments?
- When to discuss: If selling platform as service or have customers
Capacity Planning Model
- How much headroom do we need? (70% utilization? 50%?)
- Can we handle 2x or 10x normal load?
- Forecasting horizon for capacity?
- When to discuss: If rapid growth expected
API Lifecycle Management
- How long support old API versions?
- Backward compatibility policy?
- Deprecation timeline?
- When to discuss: As API catalog matures
Data Retention & Privacy
- Log retention duration? (30 days, 90 days, 1 year?)
- GDPR/PII handling?
- Audit log immutability?
- When to discuss: If compliance requirements identified
Multi-Tenancy Isolation Depth
- Noisy neighbor protection (rate limit per producer)?
- Resource quotas per API?
- Cost attribution?
- Failure blast radius control?
- When to discuss: As teams scale and resource contention increases
Async/Event-Driven APIs
- Do we support pub/sub or event patterns?
- Message ordering requirements?
- Deduplication guarantees?
- When to discuss: If producers request async capabilities
Schema Management & Evolution
- How manage API schema changes?
- Breaking change detection?
- Multiple schema version support?
- When to discuss: As API catalog grows and versioning becomes critical
Disaster Recovery & Business Continuity
- RTO (Recovery Time Objective)?
- RPO (Recovery Point Objective)?
- Backup strategy and testing?
- Multi-region requirements?
- When to discuss: If high-availability becomes critical
Performance Testing & Validation
- How validate 20k req/s target? (load testing, soak testing?)
- Regression testing for performance?
- Baseline measurement of current state?
- When to discuss: Before claiming compliance with targets
Cost Model & Chargeback
- Fixed vs variable cost model?
- Chargeback to teams?
- Cost transparency?
- Cost forecasting?
- When to discuss: If financial accountability required
Compliance & Audit Frameworks
- Which compliance frameworks apply? (SOC2, ISO 27001, HIPAA, PCI-DSS?)
- Audit trail immutability?
- Compliance reporting automation?
- When to discuss: If regulatory requirements identified
Customer Support SLA
- How quickly respond to issues?
- Support hours (24/7 or business hours)?
- Support channels?
- Self-service troubleshooting guides?
- When to discuss: If external customers to support
14.4 Assessment: Should Any Move to v1.0?
Ask stakeholders:
“Looking at this list of not-yet-covered dimensions, which ones are critical for v1.0? Which can wait until v1.1?”
Likely candidates for v1.0 if raised:
- Deployment Frequency - If producers need rapid iteration
- Canary/Progressive Rollout - If safety is critical concern
- Rate Limiting - If noisy neighbor issues expected
- Cost Model - If chargebacks/accountability required
- Async API Support - If requested by producers
Likely candidates to defer to v1.1 or future:
- API lifecycle/deprecation policy
- Data retention specifics
- Schema management
- Disaster recovery
- Compliance frameworks (can start with documentation)
Document Metadata
Version History:
- v0.1 (2026-02-10): Initial strawman - seeking feedback
Authors: API Management Team, Core Infra Team
Stakeholders for Review:
- Platform teams (O11Y, DevEx, Core Infra, API Mgmt)
- Producer team representatives (2-3 teams)
- Security/Compliance team
- Finance/Operations team
Contact: [Slack channel to be created]
Next Review Date: 2026-04-01 (after stakeholder feedback incorporated)
END OF STRAWMAN
How to Proceed
This strawman is intentionally incomplete and designed to be discussed.
Next Actions:
- Share with your stakeholder group
- Use the feedback template to gather input
- Schedule a synthesis discussion for week 2-3
- Iterate toward consensus
- Publish v1.0 once approved
Would you like me to:
- Create a stakeholder feedback form/template?
- Build sample dashboards for measuring these NFRs?
- Draft SLO policies based on these targets?
- Create an implementation roadmap for achieving these targets?