ADR-0001: Five cloud-domain agents own placement, cost, and security across the MGH stack¶

Status: Accepted
Date: 2026-05-17
Deciders: cloud-architect, marnissi.investments

Context¶

The MGH workspace already has stack-shaped agents (infra-agent, backend-agent, frontend-agent, website-agent, devops-agent, docs-agent, labs-agent) that own paths. As the cloud estate grows — Cloudflare at the edge, OCI for compute and data, Google Workspace for identity, GitHub for CI — purely path-shaped ownership stops being enough. Several questions cut across infra/, devops/, and every app repo:

Where should new capability X live? CF Worker, OCI container, Autonomous DB, SaaS?
What does this change cost? Does it consume free-tier headroom we needed for something else?
Is this safe? Right-scoped IAM, no exposed ports, mail-auth still aligned, no leaked secrets?
Is this OCI design correct? Is this Cloudflare design correct?

infra-agent alone cannot carry all four dimensions credibly while also writing HCL/YAML. Without explicit specialists, those questions get answered by whoever happens to be looking — inconsistently, and usually after the fact.

The stack is also explicitly first-iteration: today it's CF Tunnel → ARM A1 with Docker Postgres/Redis; tomorrow some of that moves to OCI Autonomous DB or CF Pages. Decisions about what moves and why need a durable record, not Slack scrollback.

Decision¶

Add five workspace-level cloud-domain agents to .claude/agents/. Four are read-only advisors; one (cloud-architect) writes ADRs; one (finops-agent) writes the cost ledger. None edit source code — infra-agent and the app-repo agents stay the only authors of HCL, YAML, and application code.

Agent	Role	Writable surface
`oci-expert`	Deep OCI specialist: services, free-tier limits, IAM, region/AD, ARM A1 capacity, OCIR, Email Delivery, Object Storage S3-compat backend, Autonomous DB	None (read-only advisor)
`cloudflare-expert`	Deep Cloudflare specialist: Zones, DNS, Tunnel, Access (Zero Trust), Email Routing, WAF, Workers/Pages	None (read-only advisor)
`cloud-architect`	Solution architecture across CF + OCI + GH + Google Workspace. Decides placement, draws call graphs, writes ADRs	`docs/docs/adr/`
`finops-agent`	Cost guard. Estimates monthly spend against free-tier limits, blocks `apply` when cost is unjustified	`infra/finops/`
`secops-agent`	Security review: IAM, NSG, CF Access, WAF, DNS hardening, secret hygiene, rotation cadence	None (read-only auditor)

The ADR gate becomes load-bearing: infra-agent does not write a Tofu module for a new long-lived service until an Accepted ADR from cloud-architect exists. This prevents drift between "what the stack actually is" and "what we said it would be."

Consequences¶

Cost (delta vs free tier): $0. All five agents are advisory and run inside Claude Code; no SaaS, no new infra.
Operational surface: Five new agent files to maintain; one new file (infra/finops/ledger.md) under FinOps ownership; ADRs now mandatory for new services. Operator burden: read the verdicts.
Security posture: Strictly improved. secops-agent is a mandatory reviewer on every infra PR — currently security review is opportunistic. IAM scope, NSG rules, mail-auth alignment, and rotation cadence get checked every time.
Decision quality: Architecturally durable. Every new service has an ADR with alternatives, cost delta, and reversibility — instead of "we decided this in chat last Tuesday."
Cost discipline: Free-tier breaches caught before apply, not at billing time.
Risk: Agent sprawl. We address this with a hard cap (see "Alternatives") — a sixth agent requires its own ADR justifying the gap the five don't cover.
Migration path if we revisit: Agents are markdown files. Collapsing or splitting later is trivial — there's no infrastructure or external dependency to undo.

Alternatives considered¶

Option	Why rejected
Fold all cloud knowledge into `infra-agent`	Already the largest agent. Conflating "advise on architecture" with "write the HCL" loses the separation that catches mistakes.
Per-service agents (DNS, network, mail, registry, …)	Sprawl. Each new service would warrant a new agent; routing decisions get complicated. Five domain-shaped agents fit the actual decision boundaries.
Use plugins instead of workspace agents	Plugins are global to the user, not workspace-specific. These agents encode MGH-specific conventions (free-tier strategy, single-domain ZT, deny-list overlap) — they belong in the workspace.
Skip `finops-agent` (run cost checks manually)	Manual cost review is the failure mode this avoids. Free-tier breaches are silent until the bill arrives.
Skip `secops-agent` (rely on `cavecrew-reviewer`)	`cavecrew-reviewer` is correctness/style. Security review requires a specific checklist and posture knowledge — different skill.
Add `network-agent`, `data-agent`, `observability-agent` from day one	Premature. Networking is folded into `cloudflare-expert` + `oci-expert`; there's no telemetry surface to observe yet; data lifecycle is one Postgres role. Each can be split out later with its own ADR if the load justifies.

When to add a sixth agent¶

Open a new ADR (0002-…) that demonstrates:

A specific domain none of the five existing agents cover well.
A pattern of decisions in that domain that recur often enough to merit a specialist.
Why folding into an existing agent would make that agent unwieldy.

Examples of plausible future agents (none justified today):

network-agent — when VCN topology spans multiple peered networks or load-balancing decisions become non-trivial.
observability-agent — when telemetry surface exceeds one node-exporter and one log-stream.
data-agent — when persistence spans Postgres + Redis + Autonomous DB + Object Storage with non-trivial migration / restore drills.
ml-agent — if/when MGH ships AI features that need ops-side support beyond pydantic-ai.

Until then: five.