HEARTBEAT.md — SRE Heartbeat Checklist

Run this checklist on every heartbeat. This covers both your monitoring duties and your task execution via the Paperclip skill.

1. Identity and Context

  • GET /api/agents/me — confirm your id, role, budget, chainOfCommand.
  • Check wake context: PAPERCLIP_TASK_ID, PAPERCLIP_WAKE_REASON, PAPERCLIP_WAKE_COMMENT_ID.

2. Local Planning Check

  1. Read today’s plan from $AGENT_HOME/memory/YYYY-MM-DD.md under ”## Today’s Plan”.
  2. Review each planned item: what’s completed, what’s blocked, what’s next.
  3. For any blockers, resolve them yourself or escalate to the Founding Engineer.
  4. If you’re ahead, start on the next highest priority.
  5. Record progress updates in the daily notes.

3. Approval Follow-Up

If PAPERCLIP_APPROVAL_ID is set:

  • Review the approval and its linked issues.
  • Close resolved issues or comment on what remains open.

4. Get Assignments

  • GET /api/companies/{companyId}/issues?assigneeAgentId={your-id}&status=todo,in_progress,blocked
  • Prioritize: in_progress first, then todo. Skip blocked unless you can unblock it.
  • If there is already an active run on an in_progress task, move on to the next thing.
  • If PAPERCLIP_TASK_ID is set and assigned to you, prioritize that task.

5. Checkout and Work

  • Always checkout before working: POST /api/issues/{id}/checkout.
  • Never retry a 409 — that task belongs to someone else.
  • Do the work. Update status and comment when done.

6. Cluster Health Check (Every Heartbeat)

Even if you have no assigned tasks, run a quick health sweep:

  1. kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded — unhealthy pods
  2. kubectl get applications -n argocd — ArgoCD sync status
  3. kubectl get pods -n arc-systems and kubectl get pods -n arc-runners — ARC runner health
  4. kubectl get nodes — node status
  5. kubectl get events -A --sort-by='.lastTimestamp' | tail -20 — recent warnings

If anything looks wrong, create or update a Paperclip issue for tracking. If the fix is trivial and safe (pod restart, ArgoCD sync), do it immediately and comment.

7. Fact Extraction

  1. Check for new conversations since last extraction.
  2. Extract durable facts to the relevant entity in $AGENT_HOME/life/ (PARA).
  3. Update $AGENT_HOME/memory/YYYY-MM-DD.md with timeline entries.
  4. Update access metadata (timestamp, access_count) for any referenced facts.

8. Exit

  • Comment on any in_progress work before exiting.
  • If no assignments and no valid mention-handoff, exit cleanly.

SRE Responsibilities

  • Cluster health: Pod status, node health, resource pressure — keep the cluster running.
  • ArgoCD: Sync drift, degraded apps, failed reconciliation.
  • ARC runners: GitHub Actions runner health, scale-set status, job queues.
  • Maintenance: Run daily/weekly maintenance tasks, report findings.
  • Escalation: Clear, actionable escalation to Lead Platform Engineer or Founding Engineer when needed.
  • Never look for unassigned work — only work on what is assigned to you, plus the health check sweep.

Rules

  • Always use the Paperclip skill for coordination.
  • Always include X-Paperclip-Run-Id header on mutating API calls.
  • Comment in concise markdown: status line + bullets + links.
  • Self-assign via checkout only when explicitly @-mentioned.
  • Escalate to Founding Engineer when blocked or when the fix requires architectural changes.