Skip to content

ADR-0001: Five cloud-domain agents own placement, cost, and security across the MGH stack

  • Status: Accepted
  • Date: 2026-05-17
  • Deciders: cloud-architect, marnissi.investments

Context

The MGH workspace already has stack-shaped agents (infra-agent, backend-agent, frontend-agent, website-agent, devops-agent, docs-agent, labs-agent) that own paths. As the cloud estate grows — Cloudflare at the edge, OCI for compute and data, Google Workspace for identity, GitHub for CI — purely path-shaped ownership stops being enough. Several questions cut across infra/, devops/, and every app repo:

  • Where should new capability X live? CF Worker, OCI container, Autonomous DB, SaaS?
  • What does this change cost? Does it consume free-tier headroom we needed for something else?
  • Is this safe? Right-scoped IAM, no exposed ports, mail-auth still aligned, no leaked secrets?
  • Is this OCI design correct? Is this Cloudflare design correct?

infra-agent alone cannot carry all four dimensions credibly while also writing HCL/YAML. Without explicit specialists, those questions get answered by whoever happens to be looking — inconsistently, and usually after the fact.

The stack is also explicitly first-iteration: today it's CF Tunnel → ARM A1 with Docker Postgres/Redis; tomorrow some of that moves to OCI Autonomous DB or CF Pages. Decisions about what moves and why need a durable record, not Slack scrollback.

Decision

Add five workspace-level cloud-domain agents to .claude/agents/. Four are read-only advisors; one (cloud-architect) writes ADRs; one (finops-agent) writes the cost ledger. None edit source code — infra-agent and the app-repo agents stay the only authors of HCL, YAML, and application code.

Agent Role Writable surface
oci-expert Deep OCI specialist: services, free-tier limits, IAM, region/AD, ARM A1 capacity, OCIR, Email Delivery, Object Storage S3-compat backend, Autonomous DB None (read-only advisor)
cloudflare-expert Deep Cloudflare specialist: Zones, DNS, Tunnel, Access (Zero Trust), Email Routing, WAF, Workers/Pages None (read-only advisor)
cloud-architect Solution architecture across CF + OCI + GH + Google Workspace. Decides placement, draws call graphs, writes ADRs docs/docs/adr/
finops-agent Cost guard. Estimates monthly spend against free-tier limits, blocks apply when cost is unjustified infra/finops/
secops-agent Security review: IAM, NSG, CF Access, WAF, DNS hardening, secret hygiene, rotation cadence None (read-only auditor)

The ADR gate becomes load-bearing: infra-agent does not write a Tofu module for a new long-lived service until an Accepted ADR from cloud-architect exists. This prevents drift between "what the stack actually is" and "what we said it would be."

Consequences

  • Cost (delta vs free tier): $0. All five agents are advisory and run inside Claude Code; no SaaS, no new infra.
  • Operational surface: Five new agent files to maintain; one new file (infra/finops/ledger.md) under FinOps ownership; ADRs now mandatory for new services. Operator burden: read the verdicts.
  • Security posture: Strictly improved. secops-agent is a mandatory reviewer on every infra PR — currently security review is opportunistic. IAM scope, NSG rules, mail-auth alignment, and rotation cadence get checked every time.
  • Decision quality: Architecturally durable. Every new service has an ADR with alternatives, cost delta, and reversibility — instead of "we decided this in chat last Tuesday."
  • Cost discipline: Free-tier breaches caught before apply, not at billing time.
  • Risk: Agent sprawl. We address this with a hard cap (see "Alternatives") — a sixth agent requires its own ADR justifying the gap the five don't cover.
  • Migration path if we revisit: Agents are markdown files. Collapsing or splitting later is trivial — there's no infrastructure or external dependency to undo.

Alternatives considered

Option Why rejected
Fold all cloud knowledge into infra-agent Already the largest agent. Conflating "advise on architecture" with "write the HCL" loses the separation that catches mistakes.
Per-service agents (DNS, network, mail, registry, …) Sprawl. Each new service would warrant a new agent; routing decisions get complicated. Five domain-shaped agents fit the actual decision boundaries.
Use plugins instead of workspace agents Plugins are global to the user, not workspace-specific. These agents encode MGH-specific conventions (free-tier strategy, single-domain ZT, deny-list overlap) — they belong in the workspace.
Skip finops-agent (run cost checks manually) Manual cost review is the failure mode this avoids. Free-tier breaches are silent until the bill arrives.
Skip secops-agent (rely on cavecrew-reviewer) cavecrew-reviewer is correctness/style. Security review requires a specific checklist and posture knowledge — different skill.
Add network-agent, data-agent, observability-agent from day one Premature. Networking is folded into cloudflare-expert + oci-expert; there's no telemetry surface to observe yet; data lifecycle is one Postgres role. Each can be split out later with its own ADR if the load justifies.

When to add a sixth agent

Open a new ADR (0002-…) that demonstrates:

  1. A specific domain none of the five existing agents cover well.
  2. A pattern of decisions in that domain that recur often enough to merit a specialist.
  3. Why folding into an existing agent would make that agent unwieldy.

Examples of plausible future agents (none justified today):

  • network-agent — when VCN topology spans multiple peered networks or load-balancing decisions become non-trivial.
  • observability-agent — when telemetry surface exceeds one node-exporter and one log-stream.
  • data-agent — when persistence spans Postgres + Redis + Autonomous DB + Object Storage with non-trivial migration / restore drills.
  • ml-agent — if/when MGH ships AI features that need ops-side support beyond pydantic-ai.

Until then: five.