Skip to main content

Command Palette

Search for a command to run...

Vibe Coding Terraform Configs The Hard Way

Updated
51 min read
Vibe Coding Terraform Configs The Hard Way

Using Claude Code to write Terraform config for a production GKE + Consul + HCP Vault stack turned out to be quite the adventure. This post is a follow-up to Taming Rogue Agents: Defeating ASI03 with SPIFFE Identity Federation - specifically, the lessons learned getting the infrastructure right: what to put in your claude.md, how to structure prompts, and why treating Claude Code as a module orchestrator rather than a free-range code generator makes the whole process significantly less painful.

For reference purposes, this is the repo I eventually ended up with.

The Target Architecture

This is the target architecture. The plan was to have Claude Code write the Terraform for it - which is where things got interesting.

Note
For the purposes of this blog post, all the .md files and settings.json reside under the project folder, consul-mcp-agents (repo top level folder) to be exact.

The Starting Point: A One-Shot Prompt

The starting point is a set of simple Python-based agents and an MCP server from this repo, kept deliberately demo-grade to make the concepts easy to follow. These are packaged as container images and deployed as pods onto a GKE cluster, which runs a Consul dataplane - with the Consul control plane hosted on an external GCP compute instance, reusing elements of this repo for creating the base GKE and Consul infrastructure.

Security is handled entirely through HCP Vault (used for ease of provisioning purposes), which serves three distinct roles:

  • certificate authority for Consul's mTLS

  • a GCP credentials engine for provisioning short-lived cloud credentials

  • a KV2 secrets engine for storing the mappings between users and their permitted tools or capabilities.

In summary - the initial prompt was to reuse python source code from one repo, base infrastructure from another repo, add HCP Vault for Consul's CA authority and store the user to tool/capability mappings in Vault's KV2 secrets engine. The end goal being to come up with a single coherant mono-repo for deploying the target infrastructure.

Project Level claude.md: Context Is Key

This is the claude.md I settled on. To recap, this file provides context about your project when a Claude Code session starts. I built it iteratively — at the end of every significant refactoring or debugging cycle, I asked Claude to update it. A good rule of thumb: keep your project-level claude.md under 200 lines. This file loads at the start of every session, so the longer it is, the more tokens you burn.

# CLAUDE.md — consul-mcp-agents

GKE + Consul Dataplane + HCP Vault Dedicated + MCP AI Agents. Vault PKI CA + dynamic GCP credentials, Consul mTLS mesh, LangChain agents on GKE.

## Commands

task --list              # show all tasks
task tf:plan             # plan with var file
task tf:apply            # docker rebuild + full apply
task consul:bootstrap-acl       # bootstrap Consul ACLs (idempotent); always syncs K8s secret
task consul:verify-auth-method  # confirm consul-k8s-component-auth-method exists on server
task consul:helm-clean          # uninstall Consul Helm + delete namespace (clean retry)
task consul:refresh-tls         # re-issue Consul server TLS cert; sync consul-ca-cert K8s secret
tflint --init --config .tflint.hcl  # install tflint plugins (first run)
tflint --config .tflint.hcl --chdir tf/scenarios/consul-mcp-gke  # local lint

## Terraform conventions

- **Module inputs**: all modules expose project_id, region, environment
- **Naming**: {environment}-{component}-{resource}
- **No default credentials**: vault_users, mcp_ttyd_credential, allowed_ingress_cidrs must be set in tfvars. Validation rejects 0.0.0.0/0.
- **Sensitive vars need taint**: Terraform wont detect changes to sensitive=true fields. Use terraform taint <resource> then apply.
- Always run terraform fmt -recursive before committing 
- GKE uses native google_container_cluster / google_container_node_pool — do not migrate to community module 
- CI runs fmt -check, validate, tfsec, and tflint on every PR touching tf/- fix locally before pushing

## Conventions

- Never hardcode GCP project IDs — always var.project_id 
- Vault policies: least-privilege, add paths explicitly
- PKI TTLs: root 10yr, intermediate 5yr, leaf 72h (auto-rotate)
- Consul Connect mTLS always on — no plaintext service communication

## Service mesh architecture 

- **Local/stdio**: MCP servers as subprocesses (transport: stdio). For local dev.
- **GKE/SSE**: Separate K8s Deployments. Agent connects via Consul upstreams on localhost:20000 (data), localhost:20001 (compute).

| Deployment | Purpose | Port |
| ---|---|---|
| `mcp-agent` | CLI + ttyd web terminal | 7681 |
| `mcp-data-server` | GCS + BigQuery MCP (SSE) | 8080 |
| `mcp-compute-server` | GCE MCP (SSE) | 8080 |

**Adding a new MCP server**: add entry to `local.mcp_servers` in `deployment.tf`, add upstream to agent annotation, add to `mcp_servers` map in `kv.tf` `yamlencode()` block. The `for_each` creates deployment, service, PDB, SA, ConfigMap, and intention automatically.

## Critical operational rules

### Vault K8s auth

- vault write auth/kubernetes/config replaces ALL fields — never omit kubernetes_ca_cert or issuer
- Never set audience on Vault K8s roles — causes "invalid audience (aud) claim" errors
- vault-reviewer SA/CRB/Secret are top-level in vault-config.tf. phase3:apply explicitly targets them alongside module.mcp_agents. If pods show Init:0/2 with 403: check kubectl get clusterrolebinding vault-reviewer, run task vault:configure-k8s-auth

### Consul TLS

- vault-agent writes TLS as vault:vault 0640. Consul reads via: (1) Packer usermod -aG vault consul, (2) vault-agent template chgrp consul post-render. Both required — losing either causes crash-loops.
- Do NOT systemctl reload consul to renew certs — use task consul:refresh-tls or vault-agent-cert-refresh.sh directly. SIGHUP causes Consul to re-initialize the Vault CA provider → new Connect intermediate CA → all pods crash.
- consul-ca-cert K8s secret has lifecycle { ignore_changes = [data] }. Terraform seeds it; consul:refresh-tls manages live data.

### Phase-gated apply

1. **Phase 1** (gke_cluster_ready = false): VPC, GKE, Consul VMs, HCP Vault
2. **Phase 2** (auto-detected by gke:ensure-ready): Consul Helm, VSO
3. **Phase 3**: MCP deployments + vault-reviewer resources
4. **Phase 4**: Full reconcile

Never set gke_cluster_ready = true before GKE exists — data source lookup will fail.

**Phase 3 prerequisite**: The Docker image must exist in Artifact Registry before phase3:apply. If pods show ImagePullBackOff, run task docker:build && task docker:push then kubectl rollout restart deployment -n mcp-agents. 

### Consul Helm ACL init

The consul-server-acl-init Job must complete before dataplane pods authenticate. wait_for_jobs = true in helm_release ensures this. Consul VMs must use the **private** GKE endpoint for TokenReview (gke:ensure-ready auto-populates it).

**Recovery** (auth method missing): task consul:helm-clean then task phase2:apply

### Recovery: cert expired, pods in CrashLoopBackOff
```
task consul:refresh-tls
kubectl rollout restart deployment -n consul
```
### Vault-agent HCL heredoc bug
- **Do NOT use \<<-\TMPL heredocs in vault-agent templates to render YAML/config files**. Vault-agent\'s HCL parser corrupts indentation: lines after the first get spurious extra whitespace, producing invalid YAML. Use yamlencode() in Terraform to store the complete YAML as a single KV string, and a one-liner vault-agent template to read/write it: "{{ with secret \"path\" }}{{ .Data.data.settings_yaml }}{{ end }}".
- The KV secrets mcp-agents/config and mcp-agents/policies each store a single pre-rendered YAML field (settings_yaml / policies_yaml), not individual config fields.

### Provider Gotchas
- **HCP Vault admin token expires in 6h** — re-run terraform apply -target=module.hcp_vault to regenerate
- **Consul provider needs IAP tunnel** — task consul:tunnel first, set consul_address_override = "https://localhost:18501" with insecure_https = true 
- **K8s provider identity bug**: terraform apply/destroy may hit Unexpected Identity Change. The fix script (tf-fix-k8s-identity.sh) unconditionally removes all matched resources from state before apply/destroy — Terraform recreates them cleanly. Runs automatically in phase3:apply (scoped to consul_ resources only); for destroy use task tf:fix-k8s-identity. IMPORTANT: When passing custom patterns to the script, scope to the specific resource type affected (e.g. consul_), not the entire module — removing all K8s resources from state forces re-import of every deployment, service, configmap, PDB, and SA.

### Python patterns

- **RBAC at LLM level** — tool filtering in agent.py. MCP servers trust the caller. Always add new tools to capabilities.yaml role definitions.
- **SQL validation** — _is_read_only_query() uses sqlglot AST. Falls back to regex.
- **Adding a new MCP tool**: add to server\'s list_tools() + call_tool(), add to capabilities.yaml, rebuild: task docker:build && task docker:push && task mcp:restart

### Security rules

- BigQuery: DML/DDL blocked via sqlglot AST. Cost capped by BQ_MAX_BYTES_BILLED (1 GB default).
- Compute: `ALLOWED_MACHINE_TYPES` whitelist (e2/n2 types)
- All tool handlers validate required args before GCP calls
- Error sanitization: only `type(exc).__name__` returned to LLM, details logged server-side
- `allowed_ingress_cidrs` has no default, rejects `0.0.0.0/0`
- IAM: least-privilege SAs (`storage.objectAdmin` not `storage.admin`, `compute.instanceAdmin.v1` not `compute.admin`)
- Docker: Chainguard Wolfi-based images (`cgr.dev/chainguard/wolfi-base`, `cgr.dev/chainguard/python:latest-dev`) — zero/near-zero CVEs vs. dozens in `python:3.11-slim`. Runtime python-3.14 apk version must match builder `latest-dev` Python version.
- Docker binary verification: Vault uses GPG-signed `SHA256SUMS` from HashiCorp (key `C874011F0AB405110D02105534365D9472D7468F`); ttyd uses upstream `SHA256SUMS` from GitHub release. No hardcoded checksums — version bumps only require changing the `ARG` version.

### CI/CD

**GitHub Actions pipeline** (.github/workflows/terraform.yml) runs on PRs and pushes to main that touch tf/ files. Four jobs, no cloud credentials required:

| Job | Tool | Purpose |
| `fmt` | `terraform fmt -check -recursive` | Catch formatting drift |
| `validate` | `terraform init -backend=false && validate` | Syntax + internal consistency |
| `tfsec` | `aquasecurity/tfsec-action` | Static security analysis (CIS, OWASP) |
| `tflint` | `tflint` + google plugin | Provider-aware linting (invalid types, deprecated args) |

- **No cloud credentials needed** — validation uses `-backend=false`
- **Applies remain manual** via Taskfile (task `tf:plan` / `task tf:apply`) — the phase-gated workflow is too complex for automated apply
- **tfsec runs with `soft_fail: true`** initially — tighten to hard fail after baseline findings are resolved
- **tflint config**: `.tflint.hcl` at repo root, uses the google provider ruleset
- To run the same checks locally before pushing: `terraform fmt -check -recursive tf/ && tflint --init --config .tflint.hcl && tflint --config .tflint.hcl --chdir tf/scenarios/consul-mcp-gke`

## Deployment discipline

All fixes must be automated. No "run this manually after deploy" — encode in Taskfile tasks, Terraform, or Packer.

**Every change must**: update docs if applicable, state the apply commands:

- Terraform: `task tf:plan` then `task tf:apply`
- Docker/Python: `task docker:build` && `task docker:push` && `task mcp:restart`
- Packer: task packer:build then `task tf:apply`
- Full: `task all`

## Keeping CLAUDE.md current

Update this file whenever you:

- Add or remove a Taskfile task (update Commands)
- Discover a new failure mode or operational pitfall (add to Critical operational rules)
- Change a Terraform convention, naming scheme, or module interface
- Add a new security control or change an existing one
- Resolve a recurring mistake — capture the prevention rule here

Do not wait to be asked. If a change would have made this session easier had it been written down before, write it down now.

There are elements of this file which are bespoke to the architecture in the diagram at the top of this post and GCP. However, there is no reason why this cannot be used as a template and customized for specific projects.

terraform.md: Complementing claude.md

This is a rules file that works in conjunction with the claude.md file. Again, there are elements of this which are bespoke to the project such as GCP project ids and references to project specific elements such as kubernetes_deployment.mcp_server . However, there is absolutely no reason why this cannot be used as a template for other projects.

---
paths:
  - "tf/**/*.tf"
  - "tf/**/*.tf.tpl"
---

# Terraform Development Rules

## Variables and outputs
- Always include `description` on variables and outputs
- Add `type` constraints — prefer specific types (`string`, `number`, `list(string)`) over `any`
- Add `validation {}` blocks for variables with format requirements (CIDRs, SA emails, zones)
- Mark outputs containing secrets with `sensitive = true`

## Resource patterns
- Prefer `for_each` over `count` for named resources — enables targeted operations and readable state addresses
- Use `depends_on` only for implicit dependencies that Terraform cannot infer from references
- Use `locals {}` to define maps that drive `for_each` — keep resource definitions DRY

## Naming
- Resources: `{component}_{purpose}` (e.g. `kubernetes_deployment.mcp_server`)
- Variables: snake_case, descriptive (e.g. `vault_k8s_agent_role`, not `role`)
- Outputs: match the resource attribute they expose (e.g. `server_deployment_names`)

## Module conventions
- Standard inputs: `project_id`, `region`, `environment` where applicable
- Resource naming pattern: `{environment}-{component}-{resource}`
- Outputs should expose stable identifiers that downstream modules can reference
- Use `templatefile()` for multi-line configs — keep templates in `templates/` subdirectory

## Safety
- Never hardcode GCP project IDs — use `var.project_id`
- Never set defaults for credential variables (`sensitive = true` vars should have no default)
- Validate CIDR inputs — reject `0.0.0.0/0` for ingress rules
- Use `lifecycle { prevent_destroy = true }` on stateful resources (databases, buckets with data)

## Style
- Run `terraform fmt -recursive` before committing (automated by PostToolUse hook)
- Group related resources in the same file by function (e.g. `rbac.tf`, `service.tf`)
- Keep `locals {}` at the top of the file that uses them
- Use comments to explain "why", not "what" — the HCL is the "what"

Context Windows: Keep Them Clean

The context window is Claude's working memory that holds your entire conversation - every message, every file Claude reads, and every command output. LLM performance degrades as context fills, and Claude may start "Forgetting" earlier instructions or making more mistakes. At session startup time, only the names and metadata for skills are loaded, also, sub agents are used to execute skills - meaning that the execution of a skill does not result in anything being added to the current context window.

There are also tools such as Chunk Hound that claim to index your repository's code more efficiently, reducing the number of tokens an AI coding assistant burns through when traversing it. This is on my TODO list.

Skills: What I Use

This is tf-infra, the skill that I currently use when creating terraform configs:

tf-infra is the main skill I currently use:
---
name: tf-infra
description: Plan, apply, and validate Terraform infrastructure across the GCP + HashiCorp stack (Vault, Consul, Nomad, GKE, HCP). Handles module authoring, scenario wiring, tfvars management, state bootstrapping, and post-apply validation. Use when working on Terraform modules or scenarios, troubleshooting infrastructure state, or scaffolding new environments.
argument-hint: "[scenario:path] [module:name] [action:plan|apply|destroy|validate] [env:dev|staging|prod]"
allowed-tools: Bash, Read, Write, Edit, Glob, Grep
model: claude-sonnet-4-6
---

# Terraform Infrastructure Assistant

You are working on a GCP + HashiCorp infrastructure codebase using Terraform modules and scenarios.

## Arguments

$ARGUMENTS

Parse from `$ARGUMENTS`:
- `scenario:PATH` — path to the scenario directory (relative to repo root or absolute)
- `module:NAME` — name of a module under `tf/modules/`
- `action:VERB` — what to do: `plan`, `apply`, `destroy`, `validate`, `fmt`, `scaffold`
- `env:NAME` — environment name (default: `dev`)

---

## Step 0: Pre-flight checks

Before doing anything else, verify the required environment variables are present for the action being performed.

**Always required:**
```bash
: "${GOOGLE_CREDENTIALS:?Required: set GOOGLE_CREDENTIALS or GOOGLE_APPLICATION_CREDENTIALS}"
```

**Required for HCP resources:**
```bash
: "${HCP_CLIENT_ID:?Required for HCP resources}"
: "${HCP_CLIENT_SECRET:?Required for HCP resources}"
```

If any required variable is missing, stop immediately and tell the user which variable is absent and where to obtain it. Do not proceed.

Also check that `.terraform.lock.hcl` is committed in the target directory. If it is absent or gitignored, warn the user — the lock file must be committed to ensure provider version consistency across runs.

---

## Step 1: Locate and understand the target

If a `scenario:` or `module:` argument was given, read its directory structure and key files (`main.tf`, `variables.tf`, `outputs.tf`) to understand current state before making any changes.

If no argument was given, list available scenarios and modules:

```bash
find . -name "main.tf" | sed 's|/main.tf||' | sort
```

**Workspace selection:** When an `env:` argument is provided, select the matching workspace before any Terraform command:
```bash
terraform -chdir=<TARGET_DIR> workspace select <ENV> || terraform -chdir=<TARGET_DIR> workspace new <ENV>
```
If no `env:` is given, default to `dev`. Always confirm the active workspace before running plan or apply:
```bash
terraform -chdir=<TARGET_DIR> workspace show
```

---

## Step 2: Validate conventions before any edit

Before editing any `.tf` file, check:

1. **Formatting** — run `terraform fmt -check -recursive` in the relevant directory. Fix any formatting issues first.
2. **Standard variables** — every module must expose `project_id`, `region`, and `environment`. Add them if absent.
3. **Naming pattern** — resources follow `{environment}-{component}-{resource}`. Flag deviations.
4. **No hardcoded values** — no GCP project IDs, regions, or credentials in `.tf` files. All must come from variables.

---

## Step 3: Execute the requested action

### `validate` (default if no action specified)
```bash
terraform -chdir=<TARGET_DIR> init -backend=false
terraform -chdir=<TARGET_DIR> validate
terraform fmt -check -recursive <TARGET_DIR>
```

### `plan`
First verify the tfvars file exists:
```bash
test -f <TARGET_DIR>/<ENV>.tfvars || { echo "ERROR: <ENV>.tfvars not found in <TARGET_DIR>"; exit 1; }
```
Then init and plan:
```bash
terraform -chdir=<TARGET_DIR> init
terraform -chdir=<TARGET_DIR> plan -var-file=<ENV>.tfvars -out=tfplan
```
If you suspect state drift (resources modified outside Terraform), run a refresh first and surface the diff:
```bash
terraform -chdir=<TARGET_DIR> apply -refresh-only -var-file=<ENV>.tfvars
```
Only proceed to plan once the user has reviewed any drift. Use `-refresh=false` only when the user explicitly confirms the state is known-clean and they want a faster plan.

### `apply`
Show a human-readable summary of the plan output first:
- List resources to **add**, **change**, and **destroy** by name
- Highlight any destroys or replacements in the output — these require extra attention

Then present the following confirmation prompt to the user:
```
Workspace : <ENV>
Target    : <TARGET_DIR>
Add       : N  Change: N  Destroy: N

Type 'yes' to apply or anything else to cancel.
```
Only run apply after receiving an explicit `yes`:
```bash
terraform -chdir=<TARGET_DIR> apply tfplan
```

### `destroy`
Always confirm with the user. Show what will be destroyed:
```bash
terraform -chdir=<TARGET_DIR> plan -destroy -var-file=<ENV>.tfvars
```
Present the same confirmation prompt as apply (with Destroy count highlighted). Wait for explicit `yes` before running destroy.

### `fmt`
```bash
terraform fmt -recursive <TARGET_DIR>
```

### `scaffold`

**Registry check (must run first):** Before creating any new module, search for existing modules in this order: private registry first, then public registry. Only scaffold custom if neither has a suitable match.

#### Phase 1: Private registry (via HashiCorp Terraform MCP server)

Check whether the `terraform-mcp-server` tools are available by attempting to use the `ToolSearch` tool with query `+terraform`. If MCP tools like `mcp__terraform__searchPrivateModules` or `mcp__terraform__searchModules` are available, use them:

```
mcp__terraform__searchPrivateModules(query: "<MODULE_NAME>", provider: "google")
```

If the MCP server is not configured (tools not found), skip to Phase 2 and note to the user:
> Private registry search skipped — HashiCorp Terraform MCP server not configured. Add it to `.claude/mcp.json` to enable private module discovery.

If a private module is found, present it with:
1. Module name, namespace/organization, description, and version
2. An example `module` block using the private registry source:
   ```hcl
   module "<MODULE_NAME>" {
     source  = "app.terraform.io/<ORGANIZATION>/<NAME>/<PROVIDER>"
     version = "~> <LATEST_MAJOR>"

     project_id  = var.project_id
     region      = var.region
     # ... module-specific inputs
   }
   ```
3. Ask the user whether to adopt it, check the public registry instead, or scaffold custom

If the user accepts a private module, stop here — do not scaffold.

#### Phase 2: Public registry

Only run this if no suitable private module was found (or the MCP server is unavailable):

```bash
curl -s "https://registry.terraform.io/v1/modules/search?q=<MODULE_NAME>&provider=google&limit=5" | jq '.modules[] | {name: .name, namespace: .namespace, description: .description, downloads: .downloads, verified: .verified}'
```

Also check for provider-official modules (e.g. `terraform-google-modules`):
```bash
curl -s "https://registry.terraform.io/v1/modules/search?q=<MODULE_NAME>&namespace=terraform-google-modules&limit=5" | jq '.modules[] | {name: .name, namespace: .namespace, description: .description, downloads: .downloads, verified: .verified}'
```

Evaluate results by: verified status, download count, description match, and whether inputs/outputs align with the user's needs. If a suitable module is found:

1. Present the match to the user with name, source, description, and download count
2. Show an example `module` block wiring it into the project with standard variables:
   ```hcl
   module "<MODULE_NAME>" {
     source  = "<NAMESPACE>/<NAME>/<PROVIDER>"
     version = "~> <LATEST_MAJOR>"

     project_id  = var.project_id
     region      = var.region
     # ... module-specific inputs
   }
   ```
3. Ask the user whether to adopt the registry module or proceed with a custom scaffold

#### Phase 3: Custom scaffold (fallback)

Only create a new module if no suitable registry module exists in either registry **or** the user explicitly chooses to scaffold a custom one.

**Custom scaffold file structure** (when proceeding):
- `main.tf` — resource definitions with a header comment block
- `variables.tf` — standard vars (`project_id`, `region`, `environment`) plus module-specific vars
- `outputs.tf` — outputs relevant to callers
- `README.md` — module name, description, inputs table, outputs table, example usage

---

## Step 4: HashiCorp-specific context

### Vault PKI pattern (vault-pki-consul module)
- Root CA → 10yr validity, never exported
- Intermediate CA → 5yr validity
- Leaf certs → 72h, auto-rotated by Consul Connect
- Always set `allow_any_name = false` on leaf issuers

### Vault GCP secrets engine
- Use `impersonated_account` type, not `access_token` directly
- TTL should be `300s` (5 minutes) for agent credentials
- Service account naming: `{environment}-{agent-name}@{project}.iam.gserviceaccount.com`

### Consul Dataplane mode
- Consul VMs run the control plane externally; GKE pods use the dataplane sidecar
- gRPC port 8502 must be open between GKE nodes and Consul VMs
- mTLS is always enabled — never set `tls.enabled = false`

### Vault Kubernetes auth — critical rules
- **`vault write` replaces all fields** — `vault write auth/kubernetes/config` sets every unspecified field to its default (empty). Always include `kubernetes_ca_cert`, `issuer`, `token_reviewer_jwt`, and `kubernetes_host` in a single write. Partial writes silently wipe existing values.
- **Never set `audience` on Vault K8s roles** — projected SA tokens carry a cluster-specific audience. Setting `audience=vault` on the role causes Vault to validate the JWT `aud` claim locally, producing "invalid audience (aud) claim" errors. Leave the field unset.
- **vault-reviewer SA** — Vault requires a dedicated SA bound to `system:auth-delegator` to call the K8s TokenReview API. Manage it in Terraform (`vault-config.tf`); if deleted out-of-band, `terraform apply` recreates it and wires the JWT automatically.

### HCP Vault
- Use the `hcp` provider, not `vault` provider, for HCP-managed resources
- Cluster tier: `dev` for development, `plus` for production
- HVAC connection uses the cluster public endpoint

### Phase-gated Terraform applies
Some scenarios use a boolean gate variable (e.g. `gke_cluster_ready`) to prevent Kubernetes/Helm resources from being created before the cluster exists:
- **Phase 1** — gate variable `false`: creates base infrastructure (VPC, cluster, VMs, Vault). No Helm or K8s provider resources.
- **Phase 2** — set gate variable `true` in tfvars, then apply: enables Helm releases, K8s resources, and Vault K8s auth config.

Never set the gate to `true` before the cluster exists — data source lookups will fail and corrupt the plan.

### Nomad (terraform-gcp-nomad)
- Sentinel policies live in `tf/modules/nomad-sentinel-policy/`
- Job specs are separate from infrastructure Terraform
- Use `nomad_job` data source to reference running jobs, not hardcoded IDs

---

## Step 5: Post-action validation

After any apply or destructive change:
1. Run `terraform output` to confirm expected outputs are present
2. Run any smoke tests defined in `Taskfile.yml` or `scripts/`
3. Report: what was created/modified/destroyed, any outputs, and next steps

---

## Refactoring resources safely

When renaming or moving resources without destroying them, use a `moved` block instead of deleting and recreating:

```hcl
moved {
  from = google_compute_instance.old_name
  to   = google_compute_instance.new_name
}
```

Rules:
- Always add `moved` blocks when renaming resources or extracting them into modules
- Run `plan` after adding a `moved` block — the plan should show 0 destroys
- Remove `moved` blocks only after the apply has succeeded and state is clean
- Never use `terraform state mv` when a `moved` block is the correct tool

---

## Common pitfalls to avoid

- Never run `terraform apply` without showing the plan first
- Never commit `.terraform/` directories or `*.tfstate` files
- Always commit `.terraform.lock.hcl` — it pins provider versions across runs
- HCP provider requires `HCP_CLIENT_ID` and `HCP_CLIENT_SECRET` env vars (validated in Step 0)
- GCP provider requires `GOOGLE_CREDENTIALS` or `GOOGLE_APPLICATION_CREDENTIALS` (validated in Step 0)
- If `terraform init` fails on backend, check GCS bucket exists and IAM permissions
- Never use `-refresh=false` unless the user explicitly confirms state is known-clean
- Always check the active workspace with `terraform workspace show` before plan or apply
- **Sensitive variables are invisible to Terraform** — changes to `sensitive = true` fields (e.g. tokens, passwords, Vault config) are never detected by plan. If a sensitive value changes, force a re-apply with:
  ```bash
  terraform taint '<resource_address>'
  terraform -chdir=<TARGET_DIR> apply -var-file=<ENV>.tfvars
  ```
  Use `terraform state list | grep <module>` to find the exact resource address.

settings.json: What I Use

This is the settings.json file that I currently use:

{
  "permissions": {
    "allow": [
      "Bash(terraform init:*)",
      "Bash(terraform plan:*)",
      "Bash(terraform fmt:*)",
      "Bash(terraform validate:*)",
      "Bash(terraform state:*)",
      "Bash(terraform output:*)",
      "Bash(terraform taint:*)",
      "Bash(terraform console:*)",
      "Bash(task:*)",
      "Bash(kubectl get:*)",
      "Bash(kubectl describe:*)",
      "Bash(kubectl logs:*)",
      "Bash(helm get:*)",
      "Bash(helm list:*)",
      "Bash(gcloud container:*)",
      "Bash(gcloud compute:*)",
      "Bash(gcloud iam:*)",
      "Bash(gcloud artifacts:*)",
      "Bash(python3:*)",
      "Bash(d2:*)",
      "Bash(wc:*)",
      "Bash(ls:*)",
      "Bash(cd:*)",
      "WebSearch"
    ],
    "ask": [
      "Bash(terraform apply:*)",
      "Bash(terraform destroy:*)",
      "Bash(helm uninstall:*)",
      "Bash(kubectl delete:*)",
      "Bash(rm:*)"
    ]
  },
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": "if echo \"$TOOL_INPUT\" | grep -q '\"file_path\".*tf/.*\\.tf'; then cd /Users/chris.adkin/Projects/consul-mcp-agents && terraform fmt -recursive tf/ 2>/dev/null && echo 'terraform fmt: ok' || true; fi"
          }
        ]
      },
      {
        "matcher": "Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": "if echo \"\(TOOL_INPUT\" | grep -q '\"file_path\".*tf/.*\\.tf'; then MODULE_DIR=\)(echo \"\(TOOL_INPUT\" | grep -o '\"file_path\"[[:space:]]*:[[:space:]]*\"[^\"]*\"' | sed 's/.*\"\\(.*\\)\"/\\1/' | xargs dirname); cd \"\)MODULE_DIR\" 2>/dev/null && terraform validate 2>&1 | tail -5 || true; fi"
          }
        ]
      }
    ]
  },
  "mcpServers": {
    "terraform": {
      "command": "npx",
      "args": ["-y", "terraform-mcp-server"],
      "env": {}
    }
  }
}

Note that this enforces Claude Code to always confirm whether certain actions should be performed. terraform destroy is particularly topical given the current posting that is doing the rounds on social media channels regarding claude dropping a database via terraform.

Note the hooks in the settings.json file for running terraform fmt and terraform validate
and finally, there is a section for the official HashiCorp terraform MCP server.
All of these files - the claude.md, the rules, the settings.json represent the accumulated context from building the original repo. Which raised an obvious question: how much of that repo could Claude Code reconstruct if you handed it a sufficiently detailed specification and nothing else ?.

Prompts: An Experiment

When tackling the initial config creation task in Claude Code, it quickly became clear that a significant amount of friction came down to prompt quality - which raises an obvious question: what does a good prompt actually look like? - this was the missing piece from the puzzle.

Claude Code's answer once it had analyzed the existing repo was the prompt.md file below - all glorious 711 lines of it. It's included in its entirety for reference purposes in order to provide a proper sense of both the scope and depth of what a well-crafted prompt should look like.

# prompt.md — Recreation Prompt for consul-mcp-agents

This prompt contains enough detail to recreate the entire consul-mcp-agents repository from scratch. It is organized by component, with implementation-level specifics for each file.

---

## 1. Project Overview

Build a production-grade infrastructure project that deploys LangChain-based AI agents on GKE, secured by Consul service mesh (mTLS) and HCP Vault Dedicated (PKI CA + dynamic GCP credentials). The agents interact with GCP services (GCS, BigQuery, GCE) via MCP (Model Context Protocol) servers. Access control is enforced at three layers: Vault policies (per-user), Consul intentions (per-service), and LLM tool filtering (per-role).

### High-level architecture

```
User (browser) → ttyd web terminal → vault-mcp-agents CLI
  → Vault userpass login → role determination (operator/analyst/viewer)
  → Agent selection (data_agent or compute_agent)
  → Vault GCP token generation (5-min OAuth2 via impersonation)
  → MCP connection (stdio subprocess or SSE via Consul Connect)
  → LLM REPL (Anthropic Claude or OpenAI) with filtered tool list
  → MCP tool calls → GCP API operations → results back to user
```

### Infrastructure layers

| Layer | Component | Implementation |
|---|---|---|
| Secrets + PKI | HCP Vault Dedicated | Root CA (EC P-384, 10yr) → Intermediate CA (EC P-256, 5yr) → 72h leaf certs |
| Service mesh | Consul (external VMs, dataplane mode) | mTLS between all services, intentions for access control |
| Compute | GKE (private cluster, Calico, Workload Identity) | 3 deployments: agent, data-server, compute-server |
| Credentials | Vault GCP secrets engine | 5-min OAuth2 tokens via SA impersonation |
| Application | Python 3.11 CLI + MCP servers | Click CLI, asyncio, Starlette/uvicorn SSE transport |
| Access control | Consul intentions + Vault policies + capabilities.yaml | Three roles: operator (full), analyst (read+query), viewer (read-only) |

---

## 2. Repository Structure

```
consul-mcp-agents/
├── CLAUDE.md                    # AI assistant instructions
├── README.md                    # User-facing documentation
├── Taskfile.yml                 # Task orchestration (replaces Makefile)
├── pyproject.toml               # Python package definition
├── .gitignore
├── tf/
│   ├── modules/
│   │   ├── network/             # VPC, subnets, Cloud NAT, firewall
│   │   ├── hcp-vault/           # HVN, Vault Dedicated cluster, admin token
│   │   ├── vault-pki-consul/    # Root/Intermediate PKI, GCP auth, policies
│   │   ├── vault-config/        # K8s auth, GCP secrets engine, KV, userpass
│   │   ├── consul/              # Consul server GCE VMs
│   │   ├── gke-consul-dataplane/ # GKE cluster, node pool, Consul Helm
│   │   └── mcp-agents-k8s/     # 3 K8s deployments, RBAC, intentions
│   └── scenarios/
│       └── consul-mcp-gke/      # Main scenario wiring all modules
├── src/vault_mcp_agents/
│   ├── __init__.py
│   ├── main.py                  # Click CLI entry point
│   ├── config.py                # Dataclass config loaders
│   ├── vault_client.py          # hvac wrapper
│   ├── agent.py                 # Session orchestrator + LLM REPLs
│   └── mcp/
│       ├── __init__.py
│       ├── data_server.py       # GCS + BigQuery MCP server (8 tools)
│       └── compute_server.py    # GCE MCP server (6 tools)
├── config/
│   └── settings.yaml            # Default config for local dev
├── policies/
│   └── capabilities.yaml        # Role-based tool access matrix
├── docker/
│   ├── Dockerfile               # Multi-stage Python + Vault + ttyd
│   └── entrypoint.sh            # Container startup script
├── packer/
│   ├── gcp-almalinux-consul-server.pkr.hcl
│   ├── configs/
│   │   ├── consul-server.hcl
│   │   └── vault-agent-consul.hcl.tmpl
│   └── scripts/
│       ├── provision-vault-agent.sh
│       └── provision-consul.sh
├── docs/diagrams/
│   └── generate_diagrams.py     # Matplotlib-based architecture diagrams
└── scripts/                     # (optional helper scripts)
```

---

## 3. Terraform Modules

### 3.1 Module: network (`tf/modules/network/`)

**Purpose**: VPC infrastructure for Consul VMs and GKE.

**Files**: `main.tf`, `variables.tf`, `outputs.tf`

**Resources**:
- `google_compute_network` — custom-mode VPC (no auto subnets)
- `google_compute_subnetwork` — primary subnet with two secondary IP ranges:
  - `pods` — for GKE pod IPs (e.g., `/22`)
  - `services` — for GKE service IPs (e.g., `/22`)
  - Private Google Access enabled, flow logs (10-min aggregation, 50% sampling)
- `google_compute_router` + `google_compute_router_nat` — Cloud NAT for private instance egress (no public IPs)
- Firewall rules:
  - `consul-internal` — ports 8300-8503 TCP+UDP between consul-server tagged instances
  - `iap-ssh` — SSH (22) + HTTPS (8501) from IAP range `35.235.240.0/20`
  - `gke-master-to-nodes` — ports 443, 8443, 10250 from GKE master CIDR to nodes

**Standard inputs**: `project_id`, `name_prefix`, `region`, `subnet_cidr`, `pods_cidr`, `services_cidr`, `gke_master_cidr`

**Outputs**: `network_name`, `network_self_link`, `subnet_self_link`, `pods_range_name`, `services_range_name`, `router_name`

### 3.2 Module: hcp-vault (`tf/modules/hcp-vault/`)

**Purpose**: HCP Vault Dedicated cluster with HVN.

**Files**: `main.tf`, `variables.tf`, `outputs.tf`

**Resources**:
- `hcp_hvn` — HashiCorp Virtual Network (cloud_provider = "aws", HCP requirement)
- `hcp_vault_cluster` — Vault Dedicated cluster (tier validated: plus_small/plus_medium/plus_large/starter_small)
- `hcp_vault_cluster_admin_token` — ephemeral admin token (6h expiry, used only during Terraform apply)

**Critical notes**:
- HCP Vault runs on AWS — no GCP VPC peering possible. All connectivity via public HTTPS endpoint.
- Admin token expires after 6h. Subsequent applies >6h apart need `terraform apply -target=module.hcp_vault` to regenerate.

**Outputs**: `vault_public_endpoint_url`, `vault_private_endpoint_url` (same as public), `admin_token` (sensitive), `cluster_id`, `hvn_id`

### 3.3 Module: vault-pki-consul (`tf/modules/vault-pki-consul/`)

**Purpose**: Vault PKI certificate hierarchy for Consul Connect mTLS.

**Files**: `pki.tf`, `auth-gcp.tf`, `policy.tf`, `variables.tf`, `outputs.tf`

**Certificate hierarchy**:
```
Root CA (vault_mount "connect-root")
  - Type: pki, max_lease_ttl = 87600h (10 years)
  - Root cert: internal (key_type=ec, key_bits=384), self-signed
  - Subject: CN=Consul Connect Root CA

Intermediate CA (vault_mount "connect-intermediate")
  - Type: pki, max_lease_ttl = 43800h (5 years)
  - CSR signed by root CA
  - Subject: CN=Consul Connect Intermediate CA

Roles:
  - "consul-connect": allow_any_name=true, allow_subdomains=true, generate_lease=true,
    max_ttl=72h, key_type=ec, key_bits=256 — for service mesh leaf certs
  - "consul-server-tls": allowed_domains=["{datacenter}.consul"], allow_subdomains=true,
    max_ttl=72h, key_type=ec, key_bits=256 — for Consul server gossip/RPC TLS
```

**GCP auth backend**:
- `vault_gcp_auth_backend` — mount at `auth/gcp`, credentials from SA key JSON
- `vault_gcp_auth_backend_role` — type=iam, bound to consul-server SA email
- Required because HCP Vault has no GCP ADC; explicit credentials needed for JWT verification

**Policies**:
- `consul-server-policy` — full PKI access (issue, sign, read CA chain)
- `consul-connect-ca-policy` — minimal for Consul Connect CA provider

**Inputs**: `vault_addr`, `root_pki_path`, `intermediate_pki_path`, `datacenter`, `org_name`, `consul_server_sa_email`, `vault_gcp_auth_credentials_json`

### 3.4 Module: vault-config (`tf/modules/vault-config/`)

**Purpose**: Application-level Vault configuration — auth methods, secrets engines, policies, demo users.

**Files**: `auth-k8s.tf`, `gcp-engine.tf`, `auth-userpass.tf`, `kv.tf`, `variables.tf`, `outputs.tf`

**Kubernetes auth backend** (`auth-k8s.tf`):
- Gated on `var.gke_endpoint != ""` (count = 0 before GKE exists)
- Config requires: `kubernetes_host`, `kubernetes_ca_cert`, `token_reviewer_jwt`, `issuer`, `disable_iss_validation=true`, `disable_local_ca_jwt=true`
- Roles:
  - `mcp-server` — bound SAs: mcp-server, mcp-data-server, mcp-compute-server (namespace: mcp-agents)
  - `mcp-agent` — bound SA: mcp-agent (namespace: mcp-agents)
  - Token TTL: 3600s, max: 86400s
- NEVER set `audience` on roles — causes "invalid audience claim" errors with projected tokens

**GCP secrets engine** (`gcp-engine.tf`):
- Mount at `gcp`, credentials from SA key JSON
- Default and max lease: 300s (5 minutes)
- Impersonated accounts:
  - `data-agent-gcp` → data-agent SA (roles: storage.objectAdmin, bigquery.dataEditor, bigquery.jobUser)
  - `compute-agent-gcp` → compute-agent SA (roles: compute.instanceAdmin.v1)
- Impersonator SA needs: serviceAccountKeyAdmin, serviceAccountTokenCreator

**KV v2 secrets** (`kv.tf`):
```
secret/
  mcp-agents/
    config        → settings.yaml (vault addr, auth, LLM config, agent/server defs)
    policies      → capabilities.yaml (role→tool matrix)
    llm-keys      → {anthropic_api_key, openai_api_key}
  consul/
    acl-token     → {token: <bootstrap-token>}
```

**Userpass auth** (`auth-userpass.tf`):
- Three demo users (alice=operator, bob=analyst, carol=viewer)
- Each mapped to corresponding Vault policy
- Policies grant: KV read, GCP secrets engine token generation for allowed rolesets

### 3.5 Module: consul (`tf/modules/consul/`)

**Purpose**: External Consul control plane on GCE VMs (no Consul servers in GKE).

**Files**: `main.tf`, `variables.tf`, `outputs.tf`, `templates/consul-server-startup.sh.tpl`

**Resources**:
- `data.google_compute_image` — latest from `almalinux-consul-server-vault` family
- `google_compute_instance` (count = var.instance_count, validated to 1/3/5):
  - No public IP, Shielded VM, tags: consul-server, vault-client
  - GCP metadata attributes: vault-address, vault-gcp-auth-role, vault-root-pki-path, vault-inter-pki-path, consul-datacenter, consul-bootstrap-expect, consul-retry-join
  - Startup script: renders runtime-config.hcl (node_name, bootstrap_expect, retry_join via GCP internal DNS)
  - `lifecycle { ignore_changes = [boot_disk.image] }` — prevents recreation on Packer rebuild

**SA is NOT created in this module** — accepted as `consul_server_sa_email` input. Created in the scenario's `vault-pki.tf` to allow Vault GCP auth binding before VMs boot.

### 3.6 Module: gke-consul-dataplane (`tf/modules/gke-consul-dataplane/`)

**Purpose**: Private GKE cluster configured for Consul dataplane mode.

**Files**: `cluster.tf`, `consul-helm.tf`, `variables.tf`, `outputs.tf`

**GKE cluster** (`cluster.tf`):
- Uses native `google_container_cluster` + `google_container_node_pool` — NOT the community module
- Private cluster: private nodes, public endpoint (for kubectl access)
- Calico network policy (required for Consul Connect intentions)
- Workload Identity enabled
- Release channel: STABLE (auto-managed versions)
- Node pool: e2-standard-4, max_pods_per_node=32, UBUNTU_CONTAINERD, 100GB SSD
- Shielded nodes, auto-repair, auto-upgrade, max_surge=1, max_unavailable=0
- Workload Identity bindings for data-agent and compute-agent SAs

**Consul Helm** (`consul-helm.tf`, gated on `var.gke_endpoint != ""`):
- `kubernetes_namespace.consul` + bootstrap ACL token secret + CA cert secret
- `helm_release.consul` — chart version 1.9.2 (default)
- Config: TLS enabled, HTTPS only, external servers (Consul VM IPs), port 8501/8503
- Server/client disabled (dataplane mode — only connect-inject + sidecar proxies)
- Connect inject: enabled (explicit opt-in via pod annotation)
- Sync catalog: K8s→Consul, **excludes mcp-agents namespace** (uses connect-inject instead; without exclusion, sync-catalog registers duplicate services → connect-init fails)

### 3.7 Module: mcp-agents-k8s (`tf/modules/mcp-agents-k8s/`)

**Purpose**: MCP AI agent Kubernetes deployments (3 services with Consul sidecars).

**Files**: `namespace.tf`, `rbac.tf`, `deployment.tf`, `service.tf`, `consul.tf`, `vault-agent-config.tf`, `variables.tf`, `outputs.tf`

**Three deployments** (all gated on `var.gke_cluster_ready`):

1. **mcp-agent** — CLI + ttyd web terminal
   - Init container: vault-agent (exit-after-auth)
   - Main container: Python CLI on port 7681
   - Consul annotation: `connect-service-upstreams = "mcp-data-server:20000,mcp-compute-server:20001"`
   - vault-agent renders: settings.yaml (SSE URLs), capabilities.yaml, LLM API keys
   - Service: LoadBalancer (port 80 → 7681)

2. **mcp-data-server** — GCS + BigQuery MCP server
   - Init container: vault-agent (exit-after-auth)
   - Main container: Python SSE server on port 8080 (`MCP_TRANSPORT=sse`)
   - vault-agent renders: GCP access token
   - Service: ClusterIP (port 8080)

3. **mcp-compute-server** — GCE MCP server
   - Same pattern as data-server
   - Service: ClusterIP (port 8080)

**vault-agent ConfigMaps** (3 separate, in `vault-agent-config.tf`):
- Agent config: templates for settings.yaml, capabilities.yaml, anthropic-key, openai-key, .ready sentinel
- Data-server config: template for GCP access token
- Compute-server config: template for GCP access token

**Consul intentions** (`consul.tf`):
- `mcp-agent → mcp-data-server`: allow
- `mcp-agent → mcp-compute-server`: allow
- Default: deny

**PodDisruptionBudget**: min_available=1 for mcp-agent and each mcp-server

### 3.8 Scenario: consul-mcp-gke (`tf/scenarios/consul-mcp-gke/`)

**Purpose**: Wires all 7 modules together with phased apply and explicit dependencies.

**Files**: `versions.tf`, `locals.tf`, `variables.tf`, `network.tf`, `hcp-vault.tf`, `vault-pki.tf`, `vault-config.tf`, `consul.tf`, `gke.tf`, `mcp-agents.tf`, `artifact-registry.tf`, `outputs.tf`

**Provider configuration** (`versions.tf`):
- Terraform >= 1.9.0
- Providers: google (~5.0), hcp (~0.94), vault (~4.0), kubernetes (~2.0), helm (~2.0), consul (~2.0)
- Vault provider: uses HCP admin token
- Kubernetes/Helm providers: use `data.google_client_config.default.access_token` + cluster endpoint from `data.google_container_cluster.main` (gated on `gke_cluster_ready`)
- Consul provider: uses first Consul server internal IP + Vault CA + bootstrap token

**Naming** (`locals.tf`):
```
name_prefix = "{random_pet}-{datacenter}"    # e.g., "happy-panda-dc1"
short_prefix = "{random_pet}"
sa_prefix = trimsuffix(substr(short_prefix, 0, 14), "-")  # GCP SA ID ≤ 30 chars
common_labels = {project, environment, managed_by, scenario, datacenter}
```

**Phase gating**:
- `gke_cluster_ready = false` → `module.mcp_agents` count=0, `data.google_container_cluster` count=0
- Set to `true` in tfvars after GKE cluster exists (automated by `task gke:enable-phase-gate`)
- `gke_endpoint = ""` → Consul Helm count=0, Vault K8s auth count=0

**Dependency order (enforced by Terraform targets in Taskfile)**:
```
Phase 1a: network, hcp_vault, SAs
Phase 1b: vault_pki, vault_config
Phase 1c: consul (VMs)
Phase 2a: gke (cluster only, no K8s resources)
Phase 2b: gke (Consul Helm + K8s resources after kubeconfig + phase gate)
Phase 3:  vault K8s auth, mcp_agents
Phase 4:  full apply (reconcile)
```

**Scenario-level resources** (not in modules):
- `google_service_account.consul_server` — Consul VM identity (bound to Vault GCP auth)
- `google_service_account.vault_gcp_verifier` + key — Vault GCP auth credentials
- `kubernetes_service_account.vault_reviewer` + CRB + Secret — Vault K8s auth TokenReview
- `google_artifact_registry_repository.mcp` — Docker image repo

---

## 4. Python Application

### 4.1 pyproject.toml

```toml
[project]
name = "vault-mcp-agents"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
    "anthropic>=0.40.0,<1.0",
    "openai>=1.0.0,<3.0",
    "mcp>=1.6.0,<2.0",
    "hvac>=2.1.0,<3.0",
    "google-cloud-storage>=2.0.0,<3.0",
    "google-cloud-bigquery>=3.0.0,<4.0",
    "google-cloud-compute>=1.14.0,<2.0",
    "google-auth>=2.0.0,<3.0",
    "pyyaml>=6.0,<7.0",
    "click>=8.0,<9.0",
    "rich>=13.0,<14.0",
    "starlette>=0.27.0,<1.0",
    "uvicorn>=0.24.0,<1.0",
    "httpx-sse>=0.4.0,<1.0",
    "sqlglot>=25.0.0,<27.0",
]

[project.scripts]
vault-mcp-agents = "vault_mcp_agents.main:cli"
```

### 4.2 main.py — CLI Entry Point

Click command with two required options:
- `--config` (Path, must exist) → path to settings.yaml
- `--policies` (Path, must exist) → path to capabilities.yaml

Calls `asyncio.run(run_agent_session(config, policies))`. Catches `KeyboardInterrupt` for clean exit.

### 4.3 config.py — Configuration Loading

Dataclass-based configuration with YAML loader:

**Dataclasses**:
- `VaultConfig` — address, namespace, auth_method, gcp_secrets_mount, agent_approle_mount
- `GcpConfig` — project_id, region
- `LlmConfig` — provider (anthropic/openai), model, temperature
- `AgentDef` — description, mcp_server (reference key), gcp_impersonated_account (Vault roleset name)
- `McpServerDef` — transport (stdio/sse), command, args, url
- `Settings` — vault, gcp, llm, agents (dict), mcp_servers (dict)
- `AgentRolePolicy` — allowed_tools (list), max_gcp_token_ttl
- `RolePolicy` — vault_policy, agents (dict of AgentRolePolicy)
- `Policies` — roles (dict of RolePolicy)

**Functions**:
- `load_settings(path: Path) -> Settings` — yaml.safe_load → map to nested dataclasses
- `load_policies(path: Path) -> Policies` — yaml.safe_load → map to nested dataclasses

### 4.4 vault_client.py — Vault Integration

Class `VaultClient` wrapping `hvac.Client`:

- `__init__(address, namespace="")` → creates hvac.Client
- `login_userpass(username, password) -> dict` → hvac auth, stores token, returns auth dict
- `token -> Optional[str]` — property returning current token
- `get_policies() -> list[str]` — token lookup, extract policies
- `determine_role(policies) -> Optional[str]` — checks for "operator-policy", "analyst-policy", "viewer-policy" in order
- `generate_gcp_token(gcp_mount, roleset) -> Optional[str]` — reads `{mount}/token/{roleset}`, returns OAuth2 token string. Retries up to 3 times with exponential backoff (`_MAX_RETRIES=3`, `_RETRY_BACKOFF_BASE=1.0s`). Returns None after all attempts fail.

### 4.5 agent.py — Session Orchestrator

Main function: `async def run_agent_session(config: Path, policies: Path) -> None`

**Flow**:
1. Load settings + policies from YAML
2. `_prompt_login(vault)` → Rich console prompts for username/password
3. `vault.login_userpass()` → get policies, display to user
4. `vault.determine_role(policies)` → operator/analyst/viewer (exit if unrecognized)
5. `_select_agent(settings, role, policies)` → numbered menu of available agents for role
6. Look up `AgentRolePolicy.allowed_tools` for selected agent+role
7. `vault.generate_gcp_token(gcp_mount, gcp_impersonated_account)` → set `GOOGLE_ACCESS_TOKEN` + `GCP_PROJECT_ID` env vars
8. `_connect()` — asynccontextmanager:
   - If `transport == "sse"` and url set → `sse_client(url)` (HTTP to remote server)
   - Else → `stdio_client(StdioServerParameters(command, args, env))` (subprocess)
9. `ClientSession(read, write)` → `session.initialize()`
10. Select REPL based on `settings.llm.provider`:
    - `_run_anthropic_repl(session, allowed_tools, model, temperature, api_key)`
    - `_run_openai_repl(session, allowed_tools, model, temperature, api_key)`

**LLM REPL loop** (both providers follow same pattern):
1. Fetch tools via `session.list_tools()`, filter to `allowed_tools`
2. Build provider-specific tool schema
3. Loop: Rich prompt for user input → API call → check for tool_use/tool_calls
4. If tool call: `session.call_tool(name, arguments)` → append result → continue API loop
5. If text response: print to console, back to user input
6. Special commands: "exit", "quit" to break

### 4.6 data_server.py — GCS + BigQuery MCP Server

`mcp.server.Server` instance with `@server.list_tools()` and `@server.call_tool()` handlers.

**Transport detection**:
```python
if __name__ == "__main__":
    transport = os.environ.get("MCP_TRANSPORT", "stdio")
    if transport == "sse":
        _run_sse()    # Starlette app on 0.0.0.0:{MCP_PORT}
    else:
        asyncio.run(_run_stdio())  # mcp stdio_server
```

**SSE mode**: Starlette app with routes:
- `GET /sse` → SSE connection endpoint
- `POST /messages/` → message handling (SseServerTransport)
- uvicorn server on `0.0.0.0:{MCP_PORT}` (default 8080)

**GCP credential flow**: `_storage_client()` / `_bq_client()` check `GOOGLE_ACCESS_TOKEN` env var → `OAuthCredentials(token=...)` if set, else ADC fallback.

**7 tools**:

| Tool | Args | Returns |
|---|---|---|
| `list_buckets` | (none) | JSON array of bucket names |
| `read_object` | bucket, object | Text content of object |
| `write_object` | bucket, object, content, [content_type] | "Written N bytes to gs://..." |
| `delete_object` | bucket, object | "Deleted gs://..." |
| `list_datasets` | (none) | JSON array of dataset IDs |
| `create_dataset` | dataset_id, [location] | "Dataset 'X' created in Y" |
| `query_bigquery` | query, [max_results=100] | JSON array of result rows (DML/DDL blocked via sqlglot AST) |

**Security**: `_validate_required()` validates args before GCP calls. `_is_read_only_query()` uses sqlglot AST parsing to block DML/DDL (immune to comment/string bypass). `maximum_bytes_billed` cost cap via `BQ_MAX_BYTES_BILLED` env var (default 1GB). All sync GCP calls wrapped in `asyncio.to_thread()`. Errors return `type(exc).__name__` only; details logged server-side.

### 4.7 compute_server.py — GCE MCP Server

Same SSE/stdio transport pattern as data_server.

**GCP credential flow**: `_instance_client()` checks `GOOGLE_ACCESS_TOKEN` → `compute_v1.InstancesClient(credentials=...)` or default.

**6 tools**:

| Tool | Args | Returns |
|---|---|---|
| `list_instances` | [zone] | JSON array of instance summaries |
| `get_instance` | instance, [zone] | JSON instance detail |
| `start_instance` | instance, [zone] | "Start operation submitted: ..." |
| `stop_instance` | instance, [zone] | "Stop operation submitted: ..." |
| `create_instance` | instance, [machine_type=e2-micro], [zone] | "Create operation submitted: ..." |
| `delete_instance` | instance, [zone] | "Delete operation submitted: ..." |

`create_instance` uses Debian 12 cloud image, default VPC network. Machine type validated against `ALLOWED_MACHINE_TYPES` whitelist (env var, default e2/n2 types). Returns GCP operation name (async).

**Security**: `_validate_required()` validates args before GCP calls. `ALLOWED_MACHINE_TYPES` whitelist blocks unauthorized instance types. All sync GCP calls wrapped in `asyncio.to_thread()`. Errors return `type(exc).__name__` only; details logged server-side.

Helper `_instance_summary(inst)` extracts: name, status, machine_type (short), zone (short), network_ip, creation_timestamp.

---

## 5. Configuration Files

### 5.1 config/settings.yaml (local dev default)

```yaml
vault:
  address: "http://127.0.0.1:8200"
  namespace: ""
  auth_method: "userpass"
  gcp_secrets_mount: "gcp"
  agent_approle_mount: "approle"

gcp:
  project_id: ""
  region: "us-central1"

llm:
  provider: "anthropic"
  model: "claude-sonnet-4-6"
  temperature: 0

agents:
  data_agent:
    description: "Handles GCS and BigQuery operations"
    mcp_server: "data_server"
    gcp_impersonated_account: "data-agent-gcp"
  compute_agent:
    description: "Handles GCE instance and infrastructure operations"
    mcp_server: "compute_server"
    gcp_impersonated_account: "compute-agent-gcp"

mcp_servers:
  data_server:
    transport: stdio
    command: python
    args: ["-m", "vault_mcp_agents.mcp.data_server"]
  compute_server:
    transport: stdio
    command: python
    args: ["-m", "vault_mcp_agents.mcp.compute_server"]
```

GKE variant (rendered by vault-agent): `transport: sse`, `url: "http://localhost:20000/sse"` (data) / `"http://localhost:20001/sse"` (compute).

### 5.2 policies/capabilities.yaml

Three roles with per-agent tool access:

- **operator**: all 7 data tools + all 6 compute tools
- **analyst**: list_buckets, read_object, query_bigquery, list_datasets + list_instances, get_instance
- **viewer**: list_buckets, read_object, list_datasets + list_instances

All roles: `max_gcp_token_ttl: "5m"` (dual-layer TTL enforcement with Vault's 300s lease).

Each role maps to a `vault_policy` name: operator-policy, analyst-policy, viewer-policy.

---

## 6. Docker

### 6.1 Dockerfile (multi-stage)

**Stage 1 — builder**:
- Base: `python:3.11-slim`
- Install: build-essential, git
- Create venv at `/opt/venv`, install package

**Stage 2 — runtime**:
- Base: `python:3.11-slim`
- Install: ca-certificates, curl, unzip, tini, bash
- Download Vault binary (ARG `VAULT_VERSION=1.19.0`) → `/usr/local/bin/vault` (SHA256 verified)
- Download ttyd binary (ARG `TTYD_VERSION=1.7.7`) → `/usr/local/bin/ttyd` (SHA256 verified)
- User: mcpuser (UID 1000)
- Copy venv from builder
- Copy app source to `/app`
- Volumes: `/vault/secrets`, `/tmp`
- Env: `MCP_CONFIG_PATH`, `MCP_POLICIES_PATH`, `MCP_TRANSPORT=stdio`, `MCP_PORT=8080`, `PYTHONDONTWRITEBYTECODE=1`, `PYTHONUNBUFFERED=1`
- Expose: 7681
- Entrypoint: `tini --`
- CMD: `/app/docker/entrypoint.sh`

### 6.2 entrypoint.sh

Four-step startup:

1. **Wait for vault-agent** — polls `/vault/secrets/.ready` up to `VAULT_AGENT_TIMEOUT` seconds (default 120). If timeout: exits with error unless `ALLOW_BUNDLED_FALLBACK=true`, in which case falls back to bundled `/app/config/` files.
2. **Source LLM API keys** — reads raw key values from `/vault/secrets/anthropic-key` and `/vault/secrets/openai-key` via `cat` (not `source`).
3. **Validate config** — exits with error if config file not found.
4. **Start ttyd** — wraps `vault-mcp-agents` CLI on port 7681 (`--writable --once`). Optional `TTYD_CREDENTIAL` for basic auth. Env vars exported to child shell (avoids interpolating secrets into `bash -c` strings — shell injection prevention). Displays ASCII login banner before starting CLI.

---

## 7. Packer Image

### 7.1 gcp-almalinux-consul-server.pkr.hcl

- Plugin: `github.com/hashicorp/googlecompute >= 1.1.0`
- Source: AlmaLinux 8, e2-standard-2, 30GB pd-ssd
- Variables: `consul_version` (default 1.20.2), `vault_version` (default 1.19.0)
- Output image family: `almalinux-consul-server-vault`

Build steps:
1. Upload `consul-server.hcl` and `vault-agent-consul.hcl.tmpl` to `/tmp/`
2. Run `provision-vault-agent.sh` (install vault binary, systemd services, first-boot renderer)
3. Run `provision-consul.sh` (install consul binary, systemd services, user/group setup)
4. Validation script (check versions, systemd enabled)

### 7.2 consul-server.hcl

Consul server config baked into image:
- `server = true`, `client_addr = "0.0.0.0"`, `bind_addr = "{{ GetInterfaceIP \"eth0\" }}"`
- TLS: enabled (verify_incoming=false for clients, verify_outgoing=true, verify_server_hostname=true for RPC)
- ACL: enabled, default_policy=deny, enable_token_persistence=true
- Connect: enabled (CA provider config rendered separately by vault-agent)
- UI: enabled
- Performance: raft_multiplier=1
- Telemetry: prometheus_retention_time=60s

### 7.3 vault-agent-consul.hcl.tmpl

Template with `__PLACEHOLDER__` values replaced at first boot from GCP instance metadata.

**Auto-auth**: GCP IAM method → writes Vault token to `/opt/vault/vault-token`

**Templates rendered**:
1. `/etc/consul.d/connect-ca.hcl` — Consul Connect CA provider config pointing to Vault PKI (root + intermediate paths, token from file, leaf_cert_ttl=72h)
2. `/etc/consul.d/tls-certs.hcl` — uses `pkiCert` to issue server TLS cert (CN=server.{dc}.consul), writes cert/key/CA-chain to `/etc/consul.d/tls/` with `chgrp consul` post-render + `systemctl reload consul`
3. `/tmp/vault-agent-ready` — sentinel file signaling all configs rendered

### 7.4 provision-vault-agent.sh

- Downloads Vault binary, installs to `/usr/local/bin`, sets `cap_ipc_lock=+ep`
- Creates vault user (system, home=/opt/vault)
- Creates `/usr/local/bin/render-vault-agent-config.sh` — reads GCP instance metadata, `sed` replaces __PLACEHOLDERS__ in template
- Creates systemd services:
  - `vault-agent-config.service` (Type=oneshot, runs renderer)
  - `vault-agent.service` (runs vault agent, Requires vault-agent-config, Before consul)

### 7.5 provision-consul.sh

- Downloads Consul binary, installs to `/usr/local/bin`, sets `cap_ipc_lock=+ep`
- Creates consul user (system, home=/opt/consul)
- Group membership: `usermod -aG consul vault` + `usermod -aG vault consul` (bidirectional access for rendered TLS files)
- Directory permissions: `/etc/consul.d/` mode 0770, `/etc/consul.d/tls/` owned consul:vault mode 2770 (setgid)
- Creates systemd service: `consul.service` (Requires vault-agent, After vault-agent)

---

## 8. Taskfile.yml — Orchestration

Task-based orchestration replacing Makefile. Key tasks:

| Task | Purpose |
|---|---|
| `all` | Full deployment: preflight → token → backend → packer → phase1 → ACL bootstrap → phase2 → post-apply → phase4 → summary |
| `destroy` | tf:destroy + packer:destroy |
| `post-apply` | kubeconfig → vault K8s auth → docker build+push → phase3 → URL |
| `preflight` | Check CLI tools + GCP auth + Docker credential helper |
| `phase1:apply` | Network + HCP Vault + PKI + Vault Config + Consul VMs |
| `phase2:apply` | GKE cluster + Consul Helm |
| `phase3:apply` | MCP agent K8s resources |
| `phase4:apply` | Full reconcile apply |
| `tf:fix-k8s-identity` | Workaround for K8s provider identity bug |
| `consul:bootstrap-acl` | Bootstrap ACLs, update tfvars + K8s secret |
| `gke:enable-phase-gate` | Verify GKE RUNNING + reachable, set gke_cluster_ready=true |
| `docker:build` | Multi-platform Docker build (linux/amd64) |
| `docker:push` | Auth to Artifact Registry + push |

**Helper**: `tfvars:set` (internal) — idempotent key=value writer for terraform.tfvars, used by multiple tasks.

---

## 9. Architecture Diagrams

`docs/diagrams/generate_diagrams.py` generates PNG diagrams at 300 DPI with dark background (#0D1117):

1. **overall-architecture.png** — Full stack from HCP Vault → GKE → users
2. **vault-pki-chain.png** — Root CA → Intermediate CA → leaf certs with TTLs
3. **architecture-gke.png** — GKE-focused architecture view
4. **deployment-sequence.png** — 6 deployment phases with task commands

Color palette: Vault purple, HCP teal, Consul pink, GCP blue, GKE green, MCP orange, User gold.

---

## 10. Key Design Decisions

1. **External Consul control plane** (VMs, not in GKE) — decouples mesh control plane from compute; Consul servers can manage multiple clusters; dataplane mode minimizes GKE resource usage.

2. **Vault PKI as Consul Connect CA** (not Consul built-in) — centralized certificate management, CRL/OCSP revocation, consistent PKI hierarchy across services, audit logging.

3. **Three separate deployments** (not one pod) — SSE transport requires separate network endpoints for Consul service mesh intentions. Stdio mode uses subprocess within agent pod; SSE mode uses Consul Connect upstreams to separate pods.

4. **RBAC at LLM level** (not MCP API level) — simpler implementation, MCP servers are generic. Tool list filtered before passing to LLM. Trust model: Consul intentions prevent unauthorized service-to-service access; within an authorized connection, tool filtering is advisory (enforced by LLM prompt, not API).

5. **GCP SA impersonation** (not direct credentials) — Vault generates short-lived OAuth2 tokens via impersonation chain. No long-lived GCP credentials in pods. 5-minute TTL limits blast radius.

6. **Native GKE resources** (not community module) — the cluster is Consul-dataplane-opinionated (Calico, pod density, per-namespace Workload Identity). ~200 lines, fully readable. Migration cost exceeds benefit.

7. **Phase-gated Terraform** — GKE cluster must exist before K8s/Helm resources can be created. `gke_cluster_ready` boolean gates `data.google_container_cluster` lookup and `module.mcp_agents` count.

---

## 11. Critical Implementation Notes

- **HCP Vault runs on AWS** — no GCP VPC peering. All traffic via public HTTPS endpoint.
- **Admin token expires in 6h** — re-apply from HCP module to regenerate.
- **Consul provider needs IAP tunnel** from outside VPC — `task consul:tunnel` on localhost:18501.
- **SA IDs capped at 30 chars** — `sa_prefix = trimsuffix(substr(short_prefix, 0, 14), "-")`.
- **Never set `audience` on Vault K8s roles** — projected SA tokens use cluster-specific audience.
- **`vault write auth/kubernetes/config` replaces ALL fields** — never omit kubernetes_ca_cert or issuer.
- **Terraform can't detect sensitive value changes** — use `terraform taint` after manual updates.
- **Consul sync-catalog excludes mcp-agents namespace** — prevents duplicate service registration.
- **vault-agent writes as vault:vault 0640** — Consul reads via group membership + post-render chgrp.
- **GKE lifecycle ignores** node_version, initial_node_count, min_master_version (auto-managed by release channel).
- **Consul VM lifecycle ignores** boot_disk image (prevents recreation on Packer rebuild).

## 12. Required GCP APIs

Enable before deploying: compute, container, iam, storage, artifactregistry, cloudresourcemanager, servicenetworking.

## 13. Testing & CI/CD

The repo currently has **no automated tests or CI/CD pipeline**. Verification is manual:
- `terraform validate` + `terraform fmt -recursive` for Terraform
- `task preflight` for environment readiness
- `task consul:check-acl` for Consul health
- `kubectl get pods -n mcp-agents` for deployment status
- Manual agent session test via ttyd web terminal

This raised another interesting question: if Claude Code used this prompt to create the repo from scratch, how would it compare to the original repo ?.

Using the prompt file, the claude.md file, the settings.json and terraform.md file as the basis of a test environment, it was time to perform an experiment. For reference purposes, the original repo was produced by Claude Code, but it had undergone considerable testing and deep analysis using Opus 4.6.

use the prompt.md file in the current working directory and the settings.json and claude.md file in ~/Projects/mcp-consul-agents/.claude to create a new repo => ~/Projects/mcp-consul-agents-prompt and populate it with packer files and terraform configs to reflect the specification in the prompt.md file

Once the new consul-mcp-agents-prompt repo was complete, claude code was asked to:

perform a deep code analysis comparison between ~/Projects/consul-mcp-agents and ~/Projects/consul-mcp-agents-prompt and recommend which one is the most production grade repo ? 

These are the criteria by which Claude Code compared the two repos, copied and pasted from Claude Code verbatim:

  1. Terraform correctness — do the configs parse, plan, and apply without errors?

  2. Security posture — are IAM chains complete? Are blast radiuses appropriate?

  3. Completeness — does the repository represent a full, deployable system?

  4. Operational runability — can an operator actually deploy this?

  5. Code quality — naming, variable consistency, safety, readability

  6. Known pitfall handling — does the AI reproduce the lessons that took months to learn?

  7. Packer image quality — build security, reproducibility, validation

This is the detailed gap analysis table Claude Code produced:

Area Original AI-Generated
Backend configuration String placeholder (works) var.* reference (fails init)
GCP secrets engine vault_gcp_secret_backend only Double-mount conflict
GCP secrets engine Created with IAM bindings Never created
Vault policy paths Parameterised via variable Hardcoded mount path
Issuer URL Computed and passed Missing (empty string)
Consul TLS scope Precisely scoped to server subdomain Overly broad (all subdomains)
Intermediate CA TTL 5 years (correct) 5 years (correct)
Binary verification No checksum SHA256 verified
Packer network Public build VM IAP-only, no public IP
substr safety Safe min(length(...)) guard Potential panic
Python application Complete Absent
Dockerfile Present Absent
Deployment automation task all No Taskfile
Input validation Minimal Additional validation blocks
tfvars example Present Present
vault-reviewer SA placement Correct (scenario root) Correct (scenario root)
TLS chgrp Correct Correct
audience omission Correct Correct

Finally, this is the conclusion Claude Code gave:

The honest takeaway is not that AI-generated infrastructure is bad. It is that current AI code generation is strong at producing plausibly correct structure and weak at producing certainly correct behaviour under operational conditions.

The AI's Terraform is structurally coherent. It uses the right resource types, reasonable variable design, and good for_each patterns. A Terraform practitioner reading it would recognise it as competent work. But it fails to deploy — not because of one oversight, but because of five independent blockers that compound. The backend syntax error alone is something that any engineer who has actually run terraform init would not make, because they have seen this error message.

The known pitfall handling is the most revealing dimension. The AI correctly reproduced three non-obvious operational lessons (vault-reviewer SA placement, TLS chgrp, audience omission) but regressed on the issuer URL — which is arguably the most operationally painful of the four, because the failure mode (pods stuck in Init:0/1) is hard to connect to the root cause without reading Vault audit logs.

The completeness gap is fundamental. A specification document can describe architecture. It cannot substitute for the feedback loop of actually building, deploying, debugging, and iterating on a running system. The Python application, the Dockerfile, the entrypoint script, the Taskfile — these are not artefacts that exist before the infrastructure is deployed. They emerge from the act of making the infrastructure work.

Where the AI won — SHA256 verification, IAP-only builds, correct CA TTLs, input validations — these are patterns with clear prior art in documentation and examples. They are things the original developer should have done and did not, possibly because operational urgency crowded out security hygiene. The AI, unburdened by a 2am Consul restart, wrote the stricter version.

The practical model: AI-generated infrastructure is useful for bootstrapping structure and surfacing patterns, but requires rigorous review against the specific failure modes of the target platform. For a stack this complex — PKI hierarchies, Workload Identity, projected token audiences, vault-agent write permissions — that review is not a skim. It is an audit.

The diff does not lie. Five deployment blockers, one missing application, one issuer URL. Also: SHA256 verification the original author forgot to write.

Both of those things can be true at the same time.

Large Language Models:

Do They Have a Technology Sweet Spot?

A working assumption of mine is that most large language models have a sweet spot - a set of programming languages and technology stacks they handle particularly well. Aside from the fact that my original prompt lacked sufficient structure and detail, I suspect my work fell somewhat outside that sweet spot. To illustrate the point, here are some of the debugging issues captured in Claude Code's project-level MEMORY.md file:

  • vault-agent v1.19.x pkiCert renewal bug: never re-renders after initial boot. Fixed by upgrading to v1.21.3 (matches Vault server). Belt-and-suspenders: systemd timer calls vault-agent-cert-refresh.sh every 60h.

  • SA prefix mismatch bug: scenario vault-pki.tf SA uses sa_prefix (capped at 14 from short_prefix); consul module SA used its own prefix (16 from name_prefix). Fix: remove SA creation from consul module; accept consul_server_sa_email as input var. Vault GCP role binds to scenario-level SA; VM uses same SA.

  • vault-reviewer SA deleted / JWT stale: vault-agent-init gets 403 permission denied → check kubectl get sa vault-reviewer -n kube-system. If missing: recreate SA, create kubernetes.io/service-account-token Secret for it, bind to system:auth-delegator, get new JWT, update terraform.tfvars:token_reviewer_jwt, run vault write auth/kubernetes/config ... token_reviewer_jwt=<new> (see below).

  • Vault-agent HCL heredoc indentation bug: (resolved 2026-03-14) - Symptom: vault-agent <<-TMPL heredoc templates render YAML with spurious extra indentation (+2 to +4 spaces on all lines after the first), causing yaml.parser.ParserError in the Python config loader.

Claude Code As A Module Orchestrator

The prompt.md experiment settled the question I started with, but perhaps not in the way I expected. Claude Code can reconstruct a repo from a sufficiently detailed specification. The difference was that the repo constructed from the prompt lacked the operational knowledge baked into the original: the phase-gating logic, the vault-agent TLS permission chain, the sync-catalog exclusion that prevents connect-init failures at deploy time. Those aren't things you derive from a spec. They're things you learn from a crash loop at 11pm.

That gap points to where AI coding tools actually create value in infrastructure work. The productivity gains don't come from handing the AI a blank canvas - they come from giving it a well-constrained problem space.

Modules are that constraint. A module with clear inputs, outputs, and a single responsibility is something Claude Code can reason about reliably. A sprawling multi-provider scenario with implicit dependencies and phase-gated applies is not. Working module-by-module also makes the following more predictable:

  • Cloud provider resource costs

  • Blast radii sizes

  • Security posture

  • Robustness of deployment processes and deployed infrastructure

You are also significantly less likely to end up in long, convoluted debugging loops.

My working model going forward is to treat Claude Code as a Terraform module orchestrator rather than an infrastructure architect. In practice this means:

  • Templatize and reuse claude.md, rules files, and settings.json permission boundaries - these compound in value across projects

  • Prefer modules that already exist in a registry over generating new ones from scratch

  • For regulated industries, a private module registry isn't optional - it's the only way to control what the AI is building on top of

  • When new modules do need to be written from scratch, require a tight specification before writing a line of code

  • The final repo deployed infrastructure across three phases - per below. In retrospect, a single prompt covering the entire workflow was probably insufficient - a detailed prompt per phase would likely have yielded better results. Equally, fully validating each phase before moving to the next would have been a sounder approach.

    • phase1:apply: Phase 1: Network + HCP Vault + PKI + Vault Config + Consul VMs

    • phase2:apply: Phase 2: GKE cluster + Consul Helm

    • phase2:apply: Phase 2: GKE cluster + Consul Helm

The broader lesson isn't really about Claude Code - it applies to any tool: the more clearly you define your constraints, context, and expected outcomes upfront, the better the result.