ADR-0001: Five cloud-domain agents own placement, cost, and security across the MGH stack¶
- Status: Accepted
- Date: 2026-05-17
- Deciders: cloud-architect, marnissi.investments
Context¶
The MGH workspace already has stack-shaped agents (infra-agent, backend-agent, frontend-agent, website-agent, devops-agent, docs-agent, labs-agent) that own paths. As the cloud estate grows — Cloudflare at the edge, OCI for compute and data, Google Workspace for identity, GitHub for CI — purely path-shaped ownership stops being enough. Several questions cut across infra/, devops/, and every app repo:
- Where should new capability X live? CF Worker, OCI container, Autonomous DB, SaaS?
- What does this change cost? Does it consume free-tier headroom we needed for something else?
- Is this safe? Right-scoped IAM, no exposed ports, mail-auth still aligned, no leaked secrets?
- Is this OCI design correct? Is this Cloudflare design correct?
infra-agent alone cannot carry all four dimensions credibly while also writing HCL/YAML. Without explicit specialists, those questions get answered by whoever happens to be looking — inconsistently, and usually after the fact.
The stack is also explicitly first-iteration: today it's CF Tunnel → ARM A1 with Docker Postgres/Redis; tomorrow some of that moves to OCI Autonomous DB or CF Pages. Decisions about what moves and why need a durable record, not Slack scrollback.
Decision¶
Add five workspace-level cloud-domain agents to .claude/agents/. Four are read-only advisors; one (cloud-architect) writes ADRs; one (finops-agent) writes the cost ledger. None edit source code — infra-agent and the app-repo agents stay the only authors of HCL, YAML, and application code.
| Agent | Role | Writable surface |
|---|---|---|
oci-expert |
Deep OCI specialist: services, free-tier limits, IAM, region/AD, ARM A1 capacity, OCIR, Email Delivery, Object Storage S3-compat backend, Autonomous DB | None (read-only advisor) |
cloudflare-expert |
Deep Cloudflare specialist: Zones, DNS, Tunnel, Access (Zero Trust), Email Routing, WAF, Workers/Pages | None (read-only advisor) |
cloud-architect |
Solution architecture across CF + OCI + GH + Google Workspace. Decides placement, draws call graphs, writes ADRs | docs/docs/adr/ |
finops-agent |
Cost guard. Estimates monthly spend against free-tier limits, blocks apply when cost is unjustified |
infra/finops/ |
secops-agent |
Security review: IAM, NSG, CF Access, WAF, DNS hardening, secret hygiene, rotation cadence | None (read-only auditor) |
The ADR gate becomes load-bearing: infra-agent does not write a Tofu module for a new long-lived service until an Accepted ADR from cloud-architect exists. This prevents drift between "what the stack actually is" and "what we said it would be."
Consequences¶
- Cost (delta vs free tier): $0. All five agents are advisory and run inside Claude Code; no SaaS, no new infra.
- Operational surface: Five new agent files to maintain; one new file (
infra/finops/ledger.md) under FinOps ownership; ADRs now mandatory for new services. Operator burden: read the verdicts. - Security posture: Strictly improved.
secops-agentis a mandatory reviewer on every infra PR — currently security review is opportunistic. IAM scope, NSG rules, mail-auth alignment, and rotation cadence get checked every time. - Decision quality: Architecturally durable. Every new service has an ADR with alternatives, cost delta, and reversibility — instead of "we decided this in chat last Tuesday."
- Cost discipline: Free-tier breaches caught before apply, not at billing time.
- Risk: Agent sprawl. We address this with a hard cap (see "Alternatives") — a sixth agent requires its own ADR justifying the gap the five don't cover.
- Migration path if we revisit: Agents are markdown files. Collapsing or splitting later is trivial — there's no infrastructure or external dependency to undo.
Alternatives considered¶
| Option | Why rejected |
|---|---|
Fold all cloud knowledge into infra-agent |
Already the largest agent. Conflating "advise on architecture" with "write the HCL" loses the separation that catches mistakes. |
| Per-service agents (DNS, network, mail, registry, …) | Sprawl. Each new service would warrant a new agent; routing decisions get complicated. Five domain-shaped agents fit the actual decision boundaries. |
| Use plugins instead of workspace agents | Plugins are global to the user, not workspace-specific. These agents encode MGH-specific conventions (free-tier strategy, single-domain ZT, deny-list overlap) — they belong in the workspace. |
Skip finops-agent (run cost checks manually) |
Manual cost review is the failure mode this avoids. Free-tier breaches are silent until the bill arrives. |
Skip secops-agent (rely on cavecrew-reviewer) |
cavecrew-reviewer is correctness/style. Security review requires a specific checklist and posture knowledge — different skill. |
Add network-agent, data-agent, observability-agent from day one |
Premature. Networking is folded into cloudflare-expert + oci-expert; there's no telemetry surface to observe yet; data lifecycle is one Postgres role. Each can be split out later with its own ADR if the load justifies. |
When to add a sixth agent¶
Open a new ADR (0002-…) that demonstrates:
- A specific domain none of the five existing agents cover well.
- A pattern of decisions in that domain that recur often enough to merit a specialist.
- Why folding into an existing agent would make that agent unwieldy.
Examples of plausible future agents (none justified today):
network-agent— when VCN topology spans multiple peered networks or load-balancing decisions become non-trivial.observability-agent— when telemetry surface exceeds one node-exporter and one log-stream.data-agent— when persistence spans Postgres + Redis + Autonomous DB + Object Storage with non-trivial migration / restore drills.ml-agent— if/when MGH ships AI features that need ops-side support beyondpydantic-ai.
Until then: five.