openova/.github
e3mrah 3a5d9fc102
fix(infra,catalyst-api provisioner): tftpl CI guard + bucket-name suffix (Fix #101 followup, Fix #111) (#1331)
Two infrastructure-hardening fixes that together eliminate ~30 min
of provision-cycle waste per regression event documented in Fix #101.

## Fix A — CI guard against unescaped tftpl shell expansion

Adds a grep-based step to .github/workflows/infra-hetzner-tofu.yaml
that scans every infra/hetzner/*.tftpl for unescaped \${VAR:-default}
inside YAML comment lines. Uses PCRE negative-lookbehind so correctly
escaped \$\${VAR:-default} (templatefile() literal-dollar) does not
trip the guard.

Background: PR #1311 (Fix #73) added a YAML comment with bare
\${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL
\${...} sequences regardless of YAML/HCL/shell context; the colon
in the interpolation hits HCL's reserved conditional grammar and
crashes 'tofu plan' with "Template interpolation doesn't expect
a colon at this location". Prov #9 (4204f0b0c5e37a80) wasted
~30 min before PR #1328 fixed the one offender. Without the guard,
the next operator who adds a similar comment repeats the incident.

Documented in infra/hetzner/README.md so editors learn the \$\$
escape pattern before they trip the CI gate.

## Fix B — bucket-name suffix to escape global Hetzner namespace

Hetzner Object Storage bucket names share a GLOBAL namespace
across every tenant. The previous BucketNameForSovereign(fqdn)
derivation 'catalyst-<fqdn-with-dashes>' would collide on the
second CreateDeployment for the same FQDN (re-provision after
wipe, two operators on adjacent pools, race conditions) and the
second 'tofu apply' would fail with BucketAlreadyExists.

Change BucketNameForSovereign signature to (fqdn, deploymentID)
and append the first 8 chars of the deployment-id as a suffix:

  catalyst-omantel-omani-works-b3b837a2

newID() already returns 16-hex random — the leading 8 chars are
32 bits of fresh entropy, enough to make collisions cryptographically
negligible. Backward-compat: empty deploymentID (legacy on-disk
records) falls back to first-8-hex of sha256(fqdn) so wipes of
pre-Fix-111 Sovereigns remain deterministic.

Call-sites updated:
  - handler/deployments.go: id := newID() moved before
    bucket-name derivation; uses hetzner.BucketNameForSovereign
  - handler/wipe.go: passes dep.ID to PurgeBuckets and to
    BucketNameForSovereign in the report
  - hetzner/buckets.go: PurgeBuckets signature now takes
    deploymentID; bucketSuffix() handles the fallback

Tests:
  - hetzner/buckets_test.go: 6-case TestBucketNameForSovereign
    table covers canonical newID() shape, collision avoidance,
    uppercase normalisation, empty + non-hex fallback paths.
    New TestBucketNameForSovereign_CollisionAvoidance asserts
    the Fix #111 invariant directly.
  - handler/deployments_test.go:
    TestCreateDeployment_DerivesObjectStorageBucketFromFQDN
    now asserts the suffixed shape against the actual dep.ID.
  - All produced names re-validated against the S3 bucket-naming
    RFC (mirrored regex from provisioner.s3BucketNamePattern).

## Claimed TCs

_None directly — infrastructure hardening; eliminates 30+ min
wasted per cycle from regressions like PR #1311 + bucket-collision_

## Verification

- go test ./internal/hetzner/... -run "Bucket" → 9/9 PASS
- go test ./internal/handler/ -run "DerivesObjectStorageBucket" → PASS
- go vet ./... → clean
- go build ./... → clean
- yaml.safe_load on workflow → clean
- pre-existing handler-package fails (whoami, continuum-switchover)
  are unrelated and present on origin/main

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 23:31:56 +04:00
..
workflows fix(infra,catalyst-api provisioner): tftpl CI guard + bucket-name suffix (Fix #101 followup, Fix #111) (#1331) 2026-05-10 23:31:56 +04:00
dependabot.yml chore(ci): add Dependabot for npm and GitHub Actions dependency updates 2026-03-19 13:42:02 +01:00