5bd68ae0f6
75 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
22855e62d8
|
feat(openova-flow): catalyst-api proxy + cloud-init thread (Agent #3 — integrator, infra-side) (#1396)
Final integration piece for OpenovaFlow infrastructure path — catalyst-api proxy + cloud-init substitution for SOVEREIGN_DEPLOYMENT_ID + SOVEREIGN_REGION_KEY, so bp-openova-flow-emitter (slot 57) emits distinct region tags on every FlowNode and the snapshot returns 2× per HR on a multi-region Sovereign. Builds on PR #1389 (TS core + canvas packages on disk), PR #1390 (Go server + flux adapter + bootstrap-kit slots 56/57), PR #1394 (catalyst- ui temporary revert until npm workspaces land), PR #1395 (chart no-op). ## Scope vs original Agent #3 brief The brief planned a 4-section PR (proxy + cloud-init + FlowPage rewire + runbook). Section 3 (catalyst-ui rewire of @openova/flow-*) is deferred: PR #1394 reverted Agent #1's UI wiring because the Docker UI build has no node_modules for the cross-workspace canvas source. Founder note on #1394: "Agent #3 (or a follow-up) will re-wire them properly once npm workspaces are configured at repo root." This PR ships the infrastructure half (proxy + cloud-init + runbook). The canvas-side rewire is a separate follow-up PR that needs npm workspaces, not surgical edits to FlowPage. ## What ships ### 1. catalyst-api proxy /api/v1/flows/{deploymentId}/{snapshot,stream,events} products/catalyst/bootstrap/api/internal/handler/openova_flow_proxy.go: - GET /snapshot — JSON pass-through, headers + status forwarded - GET /stream — unbuffered SSE pass-through using http.Flusher (NOT httputil.ReverseProxy; that buffers and breaks text/event-stream) - POST /events — body forwarded byte-for-byte - Upstream URL from env OPENOVA_FLOW_SERVER_URL (default Sovereign in-cluster Service DNS) Routes registered in cmd/api/main.go inside the auth-gated chi.Group. 11 table-driven tests cover snapshot/events/stream pass-through, upstream 404/400/unreachable propagation, empty-deploymentId guard, SSE frames arrive AS EMITTED, and env-default fallback. ### 2. Cloud-init threads SOVEREIGN_DEPLOYMENT_ID + SOVEREIGN_REGION_KEY - infra/hetzner/cloudinit-control-plane.tftpl — two new postBuild. substitute keys alongside SOVEREIGN_FQDN/SOVEREIGN_LB_IP - infra/hetzner/main.tf — primary CP renders var.region as region key; secondary CP renders each.key (e.g. "hel1-1") from for_each over local.secondary_regions - infra/hetzner/variables.tf — new sovereign_deployment_id var (string, default "" for tofu mocks) - provisioner.go writeTfvars — writes vars["sovereign_deployment_id"] = req.DeploymentID - bootstrap-kit slot 57 — swap placeholder ${SOVEREIGN_FQDN} / literal "primary" for the new ${SOVEREIGN_DEPLOYMENT_ID} / ${SOVEREIGN_REGION_KEY} envsubst keys ### 3. Deployment record flag handler/deployments.go State() — emits `openovaFlowEnabled: true` on every deployment. The catalyst-ui rewire (follow-up PR) will read this to enable the openova-flow-server adapter; legacy provisions without the flag will keep the bridge once the rewire lands. ### 4. Verification runbook docs/runbooks/openova-flow-multi-region-verify.md — prov #34 POST body (multi-region cpx42 fsn1+hel1, qaTestEnabled=true, sovereignFQDN=omantel.biz), step-by-step kubectl/curl gates, visual canvas checks (gated on the follow-up UI rewire), and a failure-class triage table. ## Canonical-seam citations 1. SSE pattern — products/catalyst/bootstrap/api/internal/handler/ deployments.go:1244-1287 (StreamLogs): identical Content-Type + Cache-Control + X-Accel-Buffering header set; identical http.Flusher.Flush() after each write; identical r.Context().Done() cancel path. 2. postBuild.substitute pattern — infra/hetzner/cloudinit-control-plane.tftpl:884-893 (SOVEREIGN_FQDN + SOVEREIGN_LB_IP): same indentation, same KEY: ${var} form, dual emission at primary + secondary CP for_each in main.tf. ## Verification ``` $ go build ./... (clean) $ go vet ./... (clean) $ go test ./internal/handler/ -run TestFlowProxy -count=1 -race ok github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/handler 1.410s $ go test ./internal/provisioner/... -count=1 ok github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/provisioner 0.025s ``` 3 pre-existing test failures (TestHandleWhoami_NoRBACOmitsFields, TestHandleWhoami_PinSessionRBACClaims, TestUnstructuredToUserAccess_NilApplicationsBecomesEmpty) reproduce on main HEAD without this PR — unrelated baseline state. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
4e6bec7022
|
fix(infra): body-supplied SKUs win over QA defaults (Fix #183) (#1386)
* fix(catalyst-ui): delete malformed `import type from react` line (Fix #181) Fix #180 PR #1383 merged with sed -i error: produced `import type from 'react'` (empty import binding) which is a syntax error. Main build broken. This PR removes the malformed line entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): pin LB private IPs + revert hel1 zone (Fix #182) Root cause of prov #32 FATAL "hcloud/inlineAttachServerToNetwork: attach server to network: IP not available" on hcloud_server.control_plane[0]: hcloud_load_balancer_network.{main,secondary} both attached to the shared network WITHOUT an explicit `ip` argument. Hetzner auto-allocates the first free IP from the first matching-zone subnet. In the multi-region prov #32 the secondary LB-network (hel1) completed first at t+16s and took 10.0.1.2 from the only eu-central subnet existing at that moment (`main` = 10.0.1.0/24) — stealing the IP the primary CP claims explicitly via `ip = "10.0.1.${count.index + 2}"`. Fix: pin LB anchors to top-of-subnet (.254) so they live outside the CP/worker IP range (.2..N for CPs, .10+ for workers). Also revert Fix #179 (`hel1 = "eu-north"`). Hetzner /v1/locations API on 2026-05-11 returns network_zone=eu-central for hel1. Fix #179 caused prov #32's secondary subnet to fail with `invalid input in field 'network_zone' [network zone does not exist]`. The original prov #29/#30 "IP not available on secondary[hel1-1]" was the same LB-IP collision — this PR resolves both. Multi-region apply now lands cleanly: 10.0.1.2 -> primary CP (cp1) 10.0.1.254 -> primary LB anchor 10.0.10.2 -> secondary CP (hel1-1) 10.0.10.254 -> secondary LB anchor (hel1-1) Refs: openova-private prov-loop session 2026-05-11 Wave 26 * fix(infra): body-supplied SKUs win over QA defaults (Fix #183) Fix #157 introduced `effective_cp_size = coalesce(var.qa_control_plane_size, var.control_plane_size)` when qa_fixtures_enabled='true'. Because qa_control_plane_size has a non-empty default (cpx32), coalesce always returned the QA default and silently overrode whatever the body supplied in `controlPlaneSize`. Founder-supplied body for prov #32 specified `controlPlaneSize: "cpx42"` explicitly (cheapest viable for the founder's collapsed-CP+worker single-node-per-region topology with workerCount=0). The QA-default override downgraded that to cpx32 at plan time — the explicit choice never made it onto the hardware. Fix #183 — invert the coalesce so body wins: effective_cp_size = local.qa_mode ? coalesce(var.control_plane_size, var.qa_control_plane_size) : var.control_plane_size `provisioner.go` writeTfvars already emits control_plane_size / worker_size only when the body's field is non-empty (so `var.control_plane_size` inherits variables.tf's cost-optimised default when the body left it blank). That means `coalesce(var.control_plane_size, var.qa_*)` always has a non-empty first arg in normal flow; the QA-default fallback only fires on a zero-override QA call that intentionally leaves the SKU empty. No change to customer-Sovereign behaviour (qa_fixtures_enabled='false' branch already used `var.control_plane_size` verbatim). Refs: openova-private prov-loop session 2026-05-11 Wave 26 --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
515c3cf38d
|
fix(infra): pin LB private IPs + revert hel1 zone (Fix #182) (#1385)
* fix(catalyst-ui): delete malformed `import type from react` line (Fix #181) Fix #180 PR #1383 merged with sed -i error: produced `import type from 'react'` (empty import binding) which is a syntax error. Main build broken. This PR removes the malformed line entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): pin LB private IPs + revert hel1 zone (Fix #182) Root cause of prov #32 FATAL "hcloud/inlineAttachServerToNetwork: attach server to network: IP not available" on hcloud_server.control_plane[0]: hcloud_load_balancer_network.{main,secondary} both attached to the shared network WITHOUT an explicit `ip` argument. Hetzner auto-allocates the first free IP from the first matching-zone subnet. In the multi-region prov #32 the secondary LB-network (hel1) completed first at t+16s and took 10.0.1.2 from the only eu-central subnet existing at that moment (`main` = 10.0.1.0/24) — stealing the IP the primary CP claims explicitly via `ip = "10.0.1.${count.index + 2}"`. Fix: pin LB anchors to top-of-subnet (.254) so they live outside the CP/worker IP range (.2..N for CPs, .10+ for workers). Also revert Fix #179 (`hel1 = "eu-north"`). Hetzner /v1/locations API on 2026-05-11 returns network_zone=eu-central for hel1. Fix #179 caused prov #32's secondary subnet to fail with `invalid input in field 'network_zone' [network zone does not exist]`. The original prov #29/#30 "IP not available on secondary[hel1-1]" was the same LB-IP collision — this PR resolves both. Multi-region apply now lands cleanly: 10.0.1.2 -> primary CP (cp1) 10.0.1.254 -> primary LB anchor 10.0.10.2 -> secondary CP (hel1-1) 10.0.10.254 -> secondary LB anchor (hel1-1) Refs: openova-private prov-loop session 2026-05-11 Wave 26 --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7aa1b24c0d
|
fix(infra/hetzner): hel1 network_zone is eu-north not eu-central (#179) (#1381)
prov #29 + prov #30 both failed at +90s with: Error: hcloud/inlineAttachServerToNetwork: attach server to network: IP not available (ip_not_available, ...) with hcloud_server.secondary_control_plane["hel1-1"] Root cause: `local.hetzner_network_zones` hardcoded `hel1 = "eu-central"`. Helsinki is physically in Hetzner's eu-north zone (Finland), not eu-central (Falkenstein/Nuremberg). Hetzner subnets are zone-bound: when the secondary hel1 subnet is created with network_zone=eu-central, the subnet exists but attaching a server in location=hel1 (physical eu-north) returns ip_not_available because cross-zone attach isn't supported. Fix: hel1 -> eu-north. Caught live on prov #29 + #30 (omantel.biz 2-region fsn1+hel1 reprov, both failed at the same line 872 secondary CP attach). Per CLAUDE.md ARCHITECT-FIRST: Hetzner publishes zone-region mapping at https://docs.hetzner.com/cloud/general/locations/; hel1 is unambiguously listed under eu-north. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
8308f53e32
|
fix(infra/hetzner): auto-flip QA Sovereigns to cpx32/cpx42 nodes (Fix #157) (#1360)
12 of 12 fresh Sovereign provisions in the 2026-05-10 bounded-cycle session wedged on the production cpx22 CP / cpx32 worker defaults (memory entry: "provision #5 cpx22 OOM" + handover doc). Root cause: the CP's documented ~3.5GB k3s+cilium+flux+cert-manager+sealed-secrets working set leaves zero RAM headroom for Flux source-controller's ~700MB burst during the 44-slot bootstrap-kit apply, while two cpx32 workers (8GB each) cannot satisfy the simultaneous request set from bp-keycloak (2Gi JVM) + bp-harbor (~2.5Gi across 6 sub-components) + bp-cnpg primary + bp-openbao 3-replica Raft once the qaFixtures Continuum + CNPGPair + status-seeder Jobs queue. Mirrors the Fix #123 pattern (wildcard_cert_use_staging) — auto-flips ONLY when qa_fixtures_enabled='true'. Customer-facing Sovereigns (SME / marketplace / admin / console) provision with qa_fixtures_ enabled='false' so coalesce() in main.tf falls back to the existing cpx22/cpx32 defaults; the production code path is untouched. - variables.tf: qa_control_plane_size (default cpx32), qa_worker_size (default cpx42) with the same Hetzner SKU regex validation as the production size variables. - main.tf: locals.qa_mode + locals.effective_cp_size + locals. effective_worker_size; hcloud_server.control_plane and .worker read the effective locals so QA Sovereigns auto-flip and customer Sovereigns plan-clean unchanged. - tests/multi_region.tftest.hcl: three new run blocks pin the contract — qa_mode=false keeps cpx22/cpx32, qa_mode=true flips to cpx32/cpx42 defaults, qa_mode=true respects explicit operator overrides (no hardcoded SKU per docs/INVIOLABLE-PRINCIPLES.md #4). Per principle 17 (isolated worktree) shipped from .claude/worktrees/ qa-node-sizing-157. Per principle 4 (target-state) attacks the systemic OOM-cascade root cause rather than another per-blueprint timeout bandaid. Per principle 16 (canonical seam) the SKU choice lives in variables.tf defaults + per-resource selection in main.tf; no other path mutates server_type. Per principle 18 no SKU is hardcoded — every value is operator-overridable. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
901afa2a95
|
fix(infra/hetzner): add skip_region_validation=true to aws provider for Hetzner regions (#135) (#1344)
Fix #133 (PR #1343) swapped aminueza/minio for hashicorp/aws to bypass DeleteBucketPolicy AccessDenied. Worked for the bucket creation API, but the aws provider's region validator runs at provider-init time and rejects Hetzner regions (fsn1/nbg1/hel1) before any S3 call: Error: invalid AWS Region: fsn1 provider["registry.opentofu.org/hashicorp/aws"] Reproduced on prov #19 (02c23fc20df90629) — failed at `tofu plan` in 96s. Companion to the existing skip_credentials_validation + skip_metadata_api_check + skip_requesting_account_id flags that already disable the other AWS-specific preflight checks the Hetzner endpoint can't satisfy. skip_region_validation=true tells the provider not to compare the region string against AWS's hardcoded region list; the region is still passed through to the S3 SDK (used as the SigV4 signing region) which is what Hetzner expects. Per CLAUDE.md principle 16: same canonical seam as the other skip_* flags in the same provider block — this is the missing fourth flag in the standard "non-AWS S3-compatible backend" pattern. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
5d43cf7b53
|
fix(infra/hetzner): swap aminueza/minio for hashicorp/aws to escape AccessDenied wedge (#133) (#1343)
Root cause of provisions #13 / #17 failing in <2 min at `tofu apply` with: [FATAL] [ACL] Unable to create bucket (catalyst-omantel-biz-<id>): unable to remove bucket policy: Access Denied. `aminueza/minio v3.34.0`'s `minio_s3_bucket` Create handler calls `DeleteBucketPolicy` post-create as part of state normalization (the provider treats "no policy" as the canonical zero state and forcibly clears any inherited policy). Hetzner Object Storage's standard read/write credentials don't grant `s3:DeleteBucketPolicy`, so the call fails AccessDenied EVERY TIME -- the bucket IS created on Hetzner's side but tofu marks the resource as failed and rolls back the apply, blocking every fresh Sovereign provision from reaching Phase 1. The wedge is deterministic, not flaky. Provider swap rationale -- `hashicorp/aws` configured against Hetzner's S3 endpoint speaks vanilla S3 and does NOT do any post-create policy normalization. A successful CreateBucket is the terminal state for `aws_s3_bucket` Create. Hetzner officially documents AWS CLI / SDK as a supported S3 client (see https://docs.hetzner.com/storage/object-storage/getting-started/using-s3-api-tools/), so this is the canonical-vendor path, not a workaround. Changes: * `versions.tf` -- drop `aminueza/minio`, add `hashicorp/aws ~> 5.0` pointed at `https://<region>.your-objectstorage.com` with `s3_use_path_style = true` and the four `skip_*` flags that disable AWS-specific preflight calls (STS, IMDS) Hetzner doesn't implement. * `main.tf` -- `minio_s3_bucket.main` -> `aws_s3_bucket.main` (no force_destroy preserved). Add `aws_s3_bucket_acl.main` for `private` (the bucket-level acl arg was removed in aws-provider 5.x). Updated comment block explains the AccessDenied root cause inline so future readers don't repeat the journey. * `outputs.tf` -- `minio_s3_bucket.main.bucket` -> `aws_s3_bucket.main.bucket`. * `variables.tf` -- prose-only updates pointing at the new provider + the fix-#133 root-cause note. * `tests/multi_region.tftest.hcl` -- override_resource swap from `minio_s3_bucket.main` to `aws_s3_bucket.main` + `aws_s3_bucket_acl.main` so the offline tftest mock path still bypasses provider validation. * `cloudinit-control-plane.tftpl` -- two comment lines updated to reference the new resource name (no behavioural change). * `.terraform.lock.hcl` -- removed (regenerated by `tofu init` against the new provider set; CI's `tofu init -backend=false` step relocks deterministically). Idempotency / state migration: * Fresh-provision-only path -- existing prov state lives in PDM and is recycled per provision. New provs: `tofu init` pulls the aws provider, `tofu apply` creates `aws_s3_bucket` with the same name Hetzner already owns and gets BucketAlreadyOwnedByYou (200, no-op in the AWS SDK). Idempotent. * Long-lived Sovereigns (sme/marketplace/admin/console -- protected per ADR-0001 §9.4) are NOT re-applied; their tofu state is stable. No `state mv` runbook is required. Test plan: * `tofu fmt -check -recursive` -- expected pass (manual indent matches fmt output). * `tofu validate` (CI's infra-hetzner-tofu workflow) -- expected pass. * `tofu test` against `tests/multi_region.tftest.hcl` -- expected pass on all 5 scenarios (mock_provider for hcloud + override_resource for the two new aws resources). * `tofu apply` is NOT runnable from this env (no Hetzner creds); CI's test-hetzner-e2e workflow exercises the live path on PR merge. Refs #133. Co-authored-by: Claude (e3mrah) <noreply@anthropic.com> |
||
|
|
90aa2767da
|
fix(bp-cert-manager-powerdns-webhook,bp-catalyst-platform): staging ClusterIssuer for QA Sovereigns (Fix #123, LE rate-limit bypass) (#1339)
Root cause (qa-loop iter-1 wedge, 2026-05-10):
Let's Encrypt production hit the 5-certs/168h rate limit on
*.omantel.biz (retry after 2026-05-11 22:08 UTC). Cilium-envoy
could not get a wildcard cert -> console.omantel.biz TLS handshake
failed -> iter-1 Test Executor could not run. Customer Sovereigns
are unaffected (one cert per registered domain in their lifetime),
but QA Sovereigns wipe + re-provision dozens of times in a session
and exhaust the production ceiling within hours.
Fix (target-state, NOT workaround):
- bp-cert-manager-powerdns-webhook 1.1.0 ships a SECOND ClusterIssuer
(letsencrypt-dns01-staging-powerdns) alongside the existing
production one. Same DNS-01 webhook config (same PowerDNS endpoint,
same API key) -> only the ACME directory URL + account key differ.
Both ClusterIssuers are real cert-manager resources; LE treats them
as wholly independent issuers so a rate-limit hit on production
does NOT block staging issuance.
- bp-catalyst-platform 1.4.136 adds wildcardCert.useStaging (bool,
default false). When true, sovereign-wildcard-certs.yaml renders
Certificate(s) with issuerRef.name pointing at the staging issuer
instead of production.
- bootstrap-kit slot 13 wires WILDCARD_CERT_USE_STAGING via envsubst,
same passthrough pattern as QA_FIXTURES_ENABLED.
- catalyst-api auto-stamps wildcard_cert_use_staging="true" on QA
Sovereigns (Request.QATestEnabled=true) so the per-Sovereign
overlay flips both QA fixtures + staging certs from one wizard
toggle.
- tofu var wildcard_cert_use_staging propagates through main.tf
into the cloudinit postBuild.substitute block on both primary +
secondary regions.
Result:
cilium-envoy on a fresh QA Sovereign gets a staging-signed wildcard
cert in <2min (no production rate limit). curl -sk + Playwright
(ignoreHTTPSErrors:true) accept the cert; iter-1 Executor can run
within minutes of provision. Customer Sovereigns (QATestEnabled=
false) keep getting real-trusted production certs.
Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every ACME URL
+ issuer name is values-overridable. Operators wiring a private
staging ACME (e.g. internal Smallstep CA) override via per-Sovereign
overlay without rebuilding any Blueprint. Staging is the documented
LE pattern (https://letsencrypt.org/docs/staging-environment/), not a
band-aid.
_None directly -- infrastructure fix; bypasses Let's Encrypt 5/168h rate limit on QA Sovereigns by using staging ACME endpoint, enabling iter-1 to run within minutes of fresh provision_
Co-authored-by: alierenbaysal <159913086+alierenbaysal@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
3a5d9fc102
|
fix(infra,catalyst-api provisioner): tftpl CI guard + bucket-name suffix (Fix #101 followup, Fix #111) (#1331)
Two infrastructure-hardening fixes that together eliminate ~30 min of provision-cycle waste per regression event documented in Fix #101. ## Fix A — CI guard against unescaped tftpl shell expansion Adds a grep-based step to .github/workflows/infra-hetzner-tofu.yaml that scans every infra/hetzner/*.tftpl for unescaped \${VAR:-default} inside YAML comment lines. Uses PCRE negative-lookbehind so correctly escaped \$\${VAR:-default} (templatefile() literal-dollar) does not trip the guard. Background: PR #1311 (Fix #73) added a YAML comment with bare \${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL \${...} sequences regardless of YAML/HCL/shell context; the colon in the interpolation hits HCL's reserved conditional grammar and crashes 'tofu plan' with "Template interpolation doesn't expect a colon at this location". Prov #9 (4204f0b0c5e37a80) wasted ~30 min before PR #1328 fixed the one offender. Without the guard, the next operator who adds a similar comment repeats the incident. Documented in infra/hetzner/README.md so editors learn the \$\$ escape pattern before they trip the CI gate. ## Fix B — bucket-name suffix to escape global Hetzner namespace Hetzner Object Storage bucket names share a GLOBAL namespace across every tenant. The previous BucketNameForSovereign(fqdn) derivation 'catalyst-<fqdn-with-dashes>' would collide on the second CreateDeployment for the same FQDN (re-provision after wipe, two operators on adjacent pools, race conditions) and the second 'tofu apply' would fail with BucketAlreadyExists. Change BucketNameForSovereign signature to (fqdn, deploymentID) and append the first 8 chars of the deployment-id as a suffix: catalyst-omantel-omani-works-b3b837a2 newID() already returns 16-hex random — the leading 8 chars are 32 bits of fresh entropy, enough to make collisions cryptographically negligible. Backward-compat: empty deploymentID (legacy on-disk records) falls back to first-8-hex of sha256(fqdn) so wipes of pre-Fix-111 Sovereigns remain deterministic. Call-sites updated: - handler/deployments.go: id := newID() moved before bucket-name derivation; uses hetzner.BucketNameForSovereign - handler/wipe.go: passes dep.ID to PurgeBuckets and to BucketNameForSovereign in the report - hetzner/buckets.go: PurgeBuckets signature now takes deploymentID; bucketSuffix() handles the fallback Tests: - hetzner/buckets_test.go: 6-case TestBucketNameForSovereign table covers canonical newID() shape, collision avoidance, uppercase normalisation, empty + non-hex fallback paths. New TestBucketNameForSovereign_CollisionAvoidance asserts the Fix #111 invariant directly. - handler/deployments_test.go: TestCreateDeployment_DerivesObjectStorageBucketFromFQDN now asserts the suffixed shape against the actual dep.ID. - All produced names re-validated against the S3 bucket-naming RFC (mirrored regex from provisioner.s3BucketNamePattern). ## Claimed TCs _None directly — infrastructure hardening; eliminates 30+ min wasted per cycle from regressions like PR #1311 + bucket-collision_ ## Verification - go test ./internal/hetzner/... -run "Bucket" → 9/9 PASS - go test ./internal/handler/ -run "DerivesObjectStorageBucket" → PASS - go vet ./... → clean - go build ./... → clean - yaml.safe_load on workflow → clean - pre-existing handler-package fails (whoami, continuum-switchover) are unrelated and present on origin/main Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0843f02269
|
fix(infra/hetzner): escape ${VAR:-default} in tftpl comment (PROV-9 BLOCKER) (#1328)
PR #1311 (Fix #73) added a YAML comment in cloudinit-control-plane.tftpl
line 933 that referenced the envsubst placeholder
\${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL \${...}
sequences regardless of YAML/HCL/shell context, and the colon inside
the interpolation makes it choke with:
Extra characters after interpolation expression; Template
interpolation doesn't expect a colon at this location.
Result: every prov-* attempt since #1311 merged tofu-plans EXIT 1 in
~2 seconds. Prov #9 (4204f0b0c5e37a80) failed at 18:51 UTC with this
error before any Hetzner resource was created.
Fix: change \${QA_FIXTURES_ENABLED:-false} to \$\${QA_FIXTURES_ENABLED:-false}
(HCL escape — \$\$ renders as a literal \$ in the cloud-init output, which
envsubst then interprets at apply time). Same precedent: commit
|
||
|
|
b22975cb4b
|
fix(catalyst-api provisioner): qaTestEnabled flag auto-sets QA_FIXTURES_ENABLED for QA Sovereigns (qa-loop bounded-cycle Fix #73) (#1311)
Provision #7 came up zero-touch but the bp-catalyst-platform qaFixtures stack stayed off because the chart template defaults to ${QA_FIXTURES_ENABLED:-false} and the catalyst-api provisioner never threaded the toggle. Result: ~140 of the qa-loop matrix's TCs were inherently fixture-blocked on every QA Sovereign. Canonical seam: provisioner.Request struct. New fields: - QATestEnabled bool `json:"qaTestEnabled"` (default false) - QAFixturesNamespace string `json:"qaFixturesNamespace,...` (default derived) - QAOrganization string `json:"qaOrganization,...` (default derived) When QATestEnabled=true, writeTfvars emits qa_fixtures_enabled="true" + qa_test_session_enabled="true" plus qa_fixtures_namespace + qa_organization derived from SovereignFQDN's first label per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): omantel.biz -> qa-omantel / omantel-platform qa.example.com -> qa-qa / qa-platform demo.openova.io -> qa-demo / demo-platform Customer Sovereigns provision with QATestEnabled=false (default) -> no qa-fixture artifacts on production tenants. Wiring: 1. internal/provisioner/provisioner.go Request struct + writeTfvars() + deriveQAFixturesNamespace + deriveQAOrganization + firstFQDNLabel 2. infra/hetzner/variables.tf 4 new tofu vars (string, true|false validated) 3. infra/hetzner/cloudinit-control-plane.tftpl QA_FIXTURES_ENABLED / QA_TEST_SESSION_ENABLED / QA_FIXTURES_NAMESPACE / QA_ORGANIZATION substitute envvars on bootstrap-kit Kustomization 4. infra/hetzner/main.tf pass new vars into both templatefile invocations (primary + per-secondary-region) 5. internal/provisioner/provisioner_test.go 3 new tests: - default-disabled invariant - enabled derivation matrix - operator-override-wins QA Sovereign provision command (catalyst-api): POST /api/v1/deployments { "sovereignFQDN": "omantel.biz", "qaTestEnabled": true, ... } Verified: go test ./products/catalyst/bootstrap/api/internal/provisioner/... ok (0.019s) Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
fcfed6408c
|
feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101) (#1226)
* feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101) Follow-up to #1223. The Flux Kustomization on every Sovereign points at clusters/_template/bootstrap-kit/ and post-build-substitutes per- Sovereign vars (SOVEREIGN_FQDN, MARKETPLACE_ENABLED, ...). The per-Sovereign overlay file at clusters/<sov>/bootstrap-kit/01-cilium.yaml that #1223 added is therefore dead code (Flux doesn't read that path). The canonical mechanism is to extend the template with envsubst placeholders + thread the values through tofu vars. Wires four layers end-to-end: 1. clusters/_template/bootstrap-kit/01-cilium.yaml — adds `cluster.name: ${CLUSTER_MESH_NAME:=}` and `cluster.id: ${CLUSTER_MESH_ID:=0}` plus `clustermesh.useAPIServer: true` + NodePort 32379. Empty defaults = single-cluster Sovereign (no peer connects); the cilium subchart accepts empty cluster.name when id=0. 2. infra/hetzner/cloudinit-control-plane.tftpl — adds CLUSTER_MESH_NAME / CLUSTER_MESH_ID to the bootstrap-kit Kustomization's postBuild.substitute block (alongside SOVEREIGN_FQDN, MARKETPLACE_ENABLED, PARENT_DOMAINS_YAML). 3. infra/hetzner/variables.tf — declares cluster_mesh_name (string, default "") and cluster_mesh_id (number, default 0, validated 0-255). 4. infra/hetzner/main.tf — primary cloud-init passes var.cluster_mesh_{name,id} verbatim. Secondary regions (when var.regions[i>0] is non-empty per slice G3) auto-derive each peer's name as `<sovereign-stem>-<region-code-no-digits>` and increment id from var.cluster_mesh_id+1. Per-region override via the new RegionSpec.ClusterMeshName field. 5. products/catalyst/bootstrap/api/internal/provisioner/provisioner.go — adds ClusterMeshName + ClusterMeshID to Request and threads them into writeTfvars(); RegionSpec gains ClusterMeshName for per-peer override. Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), the chart-side default is intentionally empty — operator request OR per-Sovereign overlay must supply the values when ClusterMesh is enabled. The allocation registry lives at docs/CLUSTERMESH-CLUSTER-IDS.md (introduced in #1223). Refs: #1101 (EPIC-6), qa-loop iter-6 fix-33 follow-up to #1223 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): escape $ in tftpl comments referencing envsubst placeholders `tofu validate` reads `${CLUSTER_MESH_NAME}` inside YAML comments as a template variable reference; the comment was meant to refer to the Flux envsubst placeholder consumed downstream by the bootstrap-kit cilium HelmRelease. Escaped both refs with `$$` per Terraform's templatefile escape syntax so the comment renders verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(infra): replace coalesce with conditional in secondary_region_cluster_mesh_name coalesce errors when every arg is empty (the not-in-mesh path). Switch to a conditional that yields '' when both the per-region override AND var.cluster_mesh_name are empty. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7ca4abddd2
|
feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) (#1159)
* feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) Implements the server side of the Cloudflare KV lease-witness pattern that K-Cont-3's CFKVClient (in core/controllers/continuum/internal/ witness/cloudflarekv/) speaks to. The Worker fronts a Cloudflare Workers KV namespace with read-then-CAS-write semantics enforced via the If-Match header — exact contract per K-Cont-3 #1158 report (item d) and the canonical-seams "Cloudflare KV Worker contract" entry. Routes: GET /lease/<slot-url-encoded> → 200 + LeaseState | 404 | 401 PUT /lease/<slot> → 200 + LeaseState | 412 + state | 401 DELETE /lease/<slot> → 204 | 412 | 401 All 7 K-Cont-3 trap behaviors verified by 46 vitest tests: 1. If-Match: 0 = first-acquire-on-empty-slot 2. Generation increments unconditionally (incl. Release) 3. 412 includes current state body 4. TTL eviction is server-authoritative in stamping (Worker doesn't auto-evict — controller's IsHeldBy decides) 5. X-Holder mismatch on DELETE returns 412 (stale region can't evict new primary) 6. Bearer token validation against env-bound allow-list 7. Optional X-Lease-Slot header logged for KV granularity Files: products/continuum/cloudflare-worker/{package.json, tsconfig.json, wrangler.toml, vitest.config.ts, .eslintrc.cjs, .gitignore, DESIGN.md, src/{index,auth,kv,types}.ts, src/handlers/{get,put,delete}.ts, test/{handlers,contract,env.d}.ts} infra/cloudflare-worker-leases/{versions,variables,main,outputs}.tf + README.md .github/workflows/cloudflare-worker-leases-build.yaml (event-driven, NO cron — push-on-paths + PR + workflow_dispatch) Tests: 46/46 vitest pass (handlers 37 + contract 9). ESLint clean. tsc --noEmit clean. wrangler deploy --dry-run produces 9.47 KiB bundle. Per the brief: tofu module ships ready for operator action — no auto-deploy. Operator runbook in DESIGN.md §"Operator runbook — deploy a new Sovereign". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(continuum/cf-worker-tofu): K-Cont-4 — adopt CF v5 inline secret_text binding (was v4 separate resource) `tofu validate` failed on `cloudflare_workers_secret` — that resource was REMOVED in cloudflare/cloudflare v5 (it consolidated into the inline `bindings = [...]` array on `cloudflare_workers_script` with `type = "secret_text"`). Same security guarantee — encrypted at rest in CF, never visible via dashboard read API once written. `tofu fmt` also wanted versions.tf alignment + the .terraform.lock.hcl pinning the resolved cloudflare/cloudflare v5.19.1 (mirrors infra/hetzner/ which commits its lock file). Per Inviolable Principle #5 the bearer token value still flows from TF_VAR_bearer_tokens_csv extracted at apply time from a K8s SealedSecret — never inlined here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
8988cd9e4f
|
feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095) (#1131)
Slice G1 of EPIC-0 (#1095, Group G "Multi-cluster substrate"). Today infra/hetzner/main.tf only realises regions[0] end-to-end — every wizard payload's regions[1..N] entries silently no-op. EPIC-6 (#1101) Continuum DR demo needs 3 regions (mgmt + fsn + hel per docs/EPICS-1-6-unified-design.md §3.8 + §11), so this slice closes the gap. Architecture: hybrid singular-path + secondary-region overlay. - The legacy singular path (var.region + count = local.control_plane_count) STAYS untouched — every existing Sovereign state (omantel, otech*) keeps its resource addresses (hcloud_server.control_plane[0], hcloud_load_balancer.main, etc) and produces a no-op plan diff. - New regions (regions[1+]) are realised via a parallel for_each set keyed by "{cloudRegion}-{index}" (e.g. fsn1-1, hel1-2). Each secondary region gets its own /24 subnet inside the shared /16 hcloud_network, its own CP server, its own workers, and its own lb11 load balancer. The shared hcloud_firewall + hcloud_ssh_key (one tenant boundary per Sovereign). Why hybrid not full for_each: a wholesale refactor would change every existing resource address (hcloud_server.control_plane[0] → hcloud_server.control_plane["mgmt"]), forcing every running Sovereign to run `tofu state mv` for ~12 resources or face destructive recreates. The brief explicitly bans that. Hybrid is purely additive — secondary resources are NEW addresses no existing state carries. No `tofu state mv` runbook required. Existing Sovereigns provisioned with var.regions = [] or len(var.regions) == 1 produce identical plans before and after this PR. Slice G3 (out of scope here) wires Cilium ClusterMesh between secondary regions and adds per-cluster GitOps path differentiation; today every secondary CP renders an identical Flux Kustomization pointed at clusters/<sovereign_fqdn>/. Tests: tests/multi_region.tftest.hcl exercises 5 scenarios offline via mock_provider + override_resource (no real Hetzner): - legacy_no_regions_payload (var.regions=[]) - single_region_entry_does_not_double_provision (len==1) - three_region_mgmt_fsn_hel (EPIC-6 shape) - same_region_duplicates_produce_distinct_keys - non_hetzner_regions_are_filtered_out (oci entries skipped) All 5 pass. CI workflow infra-hetzner-tofu.yaml runs validate + fmt -check + test on every PR touching infra/hetzner/**. Per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled": push-on-merge + pull-request-on-touch + workflow_dispatch only. No cron. Validation: $ tofu validate Success! The configuration is valid. $ tofu fmt -check -recursive exit=0 $ tofu test tests/multi_region.tftest.hcl... pass run "legacy_no_regions_payload"... pass run "single_region_entry_does_not_double_provision"... pass run "three_region_mgmt_fsn_hel"... pass run "same_region_duplicates_produce_distinct_keys"... pass run "non_hetzner_regions_are_filtered_out"... pass Success! 5 passed, 0 failed. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
8e312cd244
|
fix(infra/hetzner): strip any-indent comments, gate user_data ≤ 30 KiB at plan-time (#966) (#967)
Live blocker. Provisioning otech114 (deployment 5c3eea37d3aacda6, fsn1) failed at `tofu apply` with: Error: invalid input in field 'user_data' (invalid_input): [user_data => [Length must be between 0 and 32768.]] with hcloud_server.control_plane[0] on main.tf line 309 Hetzner Cloud's HARD 32 KiB cap on user_data was breached after #921 inlined a base64-encoded worker cloud-init (~4.8 KB) into the CP cloud- init for cluster-autoscaler's HCLOUD_CLOUD_INIT key, on top of #827's multi-domain substitutions. Rendered size: ~37 KB. Root cause: the prior strip regex `(?m)^[ ]{0,2}# .*\n` was scoped to indent-0/2 comments only — leaving ~14 KB of indent-6+ comments INSIDE write_files content blocks (e.g. flux-bootstrap.yaml's triplicate Kustomization documentation). Those comments are inert: every write_files entry is YAML / JSON / key=value config (no shell scripts), and parsers ignore `#`-prefixed lines entirely. Changes: 1. New strip regex `(?m)^[ ]*#( |$).*\n` strips ANY-indent comment lines that start with `#` followed by space or EOL. Preserves: - `#cloud-config` line 1 (no space after `#`) - `#!`-shebangs (no space after `#`) - `#pragma`-style directives (`#` followed by non-space non-EOL) Applied to both `local.control_plane_cloud_init` and `local.worker_cloud_init`. 2. Plan-time guardrail via `lifecycle.precondition` on `hcloud_server.control_plane` and `hcloud_server.worker`. Fails plan (not apply) when `length(local.<*>_cloud_init) > 30720` bytes (30 KiB = 32 KiB hard cap minus 10% future-additions buffer). Future bloat- creep that silently re-eats the headroom now fails fast at plan-time BEFORE the network/LB/firewall/SSH-key resources get created. Verified rendered sizes (Python simulation of templatefile + strip, substitutions match real otech114 inputs): CP cloud-init: 79404 bytes raw → 21144 bytes stripped (margin: 11624 under hard cap, 9576 under guardrail) Worker cloud-init: 3254 bytes raw → 2410 bytes stripped (b64-encoded for HCLOUD_CLOUD_INIT: 3216 bytes) `#cloud-config` first-line preserved. All 18 write_files entries and 43 runcmd entries parse intact. YAML/JSON/conf contents valid post-strip (comments are documentation only at the file-format level). Closes #966 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d1431bed09
|
fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965)
Closes #921 — bp-cluster-autoscaler-hcloud chart shipped without HCLOUD_CLUSTER_CONFIG / HCLOUD_CLOUD_INIT, so cluster-autoscaler 1.32.x FATALs at startup with "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT is not specified" on every Sovereign (otech112 evidence). HelmRelease reports Ready=True (Helm install succeeded) but the Pod CrashLoopBackOffs invisibly behind the False-positive condition. Closes #916 — wizard let operators dispatch unbuildable topologies (otech109: cpx32 worker in `ash`) because PROVIDER_NODE_SIZES did not encode regional orderability. Hetzner rejected the worker creation 41s into `tofu apply` after Phase-0 had already created the CP + network + LB + firewall. Chart fix (issue #921): - Add `clusterAutoscalerHcloud.{clusterConfig,cloudInit}` values to the umbrella chart (base64-encoded per upstream contract). - Render `hetzner-node-config` Secret unconditionally with both keys so the upstream Deployment's secretKeyRef references resolve cleanly during `helm template` AND in the live cluster regardless of overlay state. - Wire HCLOUD_CLUSTER_CONFIG + HCLOUD_CLOUD_INIT extraEnvSecrets onto the upstream chart's deployment. - Tofu Phase 0 base64-encodes the Phase-0 worker cloud-init and stamps it under `flux-system/cloud-credentials.hcloud-cloud-init`; the bootstrap-kit overlay lifts that key via Flux `valuesFrom` into `clusterAutoscalerHcloud.cloudInit`. Autoscaler-spawned workers thus receive the IDENTICAL bootstrap as the Phase-0 worker fleet. - Bump bp-cluster-autoscaler-hcloud chart 1.0.0 → 1.1.0. - Chart-test smoke gate (chart/tests/hetzner-node-config.sh) verifies Secret + env var wiring + no-regression of HCLOUD_TOKEN — runs in CI's blueprint-release "Run chart integration tests" step. Wizard fix (issue #916): - Add `availableRegions?: string[]` to NodeSize interface; encode cpx32 = ['fsn1','nbg1','hel1'], cpx21/cpx31 = [] (orderable nowhere new) per Hetzner /v1/server_types vs POST /v1/servers gap. - Add `isSkuAvailableInRegion()` + `suggestAlternativeSkus()` helpers. - StepProvider filters SKU dropdowns by selected region; auto-swaps current SKU to recommended default when region change drops it out of orderability. - Mirror the matrix Go-side in sku_availability.go; gate `provisioner.Request.Validate()` with same predicate so a stale wizard build OR direct API caller bypassing the UI cannot dispatch otech109's failure mode. - Two-sided enforcement covers both r.Regions[] (multi-region) and the legacy singular path. Tests: 13 vitest cases on the wizard side + 38 Go subtests on the API side. Chart smoke renders + helm template gates the env wiring at publish time. Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2ff50f0591
|
fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955)
Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on fresh Sovereign): #952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar} anonymously and gets 403 Forbidden. Fix: - Templatize spec.imagePullSecrets on Deployment + channel-seed Job. - Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`. - Add `newapi` to flux-system/ghcr-pull's reflector reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl so bp-reflector mirrors the source Secret into the namespace automatically on every fresh Sovereign. - Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay. #953 — services-build.yaml's image-rewrite loop only matched the hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8 sme-services templates use `image: "{{ ... }}/services-<svc>:{{ .Values.images.smeTag }}"`. Each services-build run bumped only auth.yaml while reporting "update sme service images to ${SHA}", leaving the live Pod on stale bytes (PR #951's #941 fix never reached services-catalog despite the merge + chart bump chain). Fix: - After the hardcoded loop, also bump `images.smeTag` in products/catalyst/chart/values.yaml with a strict regex match (`^ smeTag: "<sha>"$`); refuse to auto-bump if the line shape changes (defends against silent drift if a contributor renames the field). - Mirror the change into the retry-path `rewrite()` function so a reset-to-origin/main retry does not recreate the original bug. Tests: - platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases asserting the Deployment and channel-seed Job carry the default ghcr-pull reference, that an empty override suppresses the block, and that custom secret names propagate (Inviolable Principle #4). - tests/integration/services-build-rewrite.sh — 3 cases reproducing the workflow's rewrite logic on a sandboxed copy of the live chart, asserting both auth.yaml's hardcoded line AND values.yaml's smeTag get bumped, that helm-render of the catalyst chart with the bumped values produces all 8 SME-service Deployments at the new SHA, and that an idempotent re-bump to a second SHA also lands cleanly. Refs: #952 #953 (umbrella #915 — alice signup gate 5). Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e08d8721e1
|
fix(pdm/dynadot): pre-register glue records before set_ns (#900) (#906)
Multi-domain Day-2 add-domain on a Sovereign was failing with Dynadot's "'ns1.<sov>.omani.works' needs to be registered with an ip address before it can be used" error. Dynadot rejects set_ns whenever the NS hostnames aren't registered as account-level "host records" first. This change wires the glue pre-registration into the PDM dynadot adapter as an optional registrar.GlueRegistrar interface, threads the Sovereign's load-balancer IPv4 from cloud-init through Flux postBuild into the chart's `global.sovereignLBIP`, and forwards it via catalyst-api's pdmFlipNS to PDM's /set-ns endpoint as a new `glueIP` field. PDM's SetNS handler calls RegisterGlueRecord for each out-of-bailiwick NS before SetNameservers, with idempotent get_ns → register_ns / set_ns_ip semantics so retries are free. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7bfd6df588
|
fix(catalyst-api,bp-catalyst-platform,infra): unblock multi-domain Day-2 add-domain flow on Sovereigns (#879) (#884)
5 stacked wiring bugs blocked the Day-2 add-parent-domain happy path on a fresh post-handover Sovereign — surfaced live on otech103, 2026-05-05 — plus a 6th gap (ghcr-pull reflector for catalyst-system). All six fixed in one PR so a single chart bump + cloud-init re-render closes the gap end-to-end. Bug 1 (chart, api-deployment.yaml): wire POOL_DOMAIN_MANAGER_URL= https://pool.openova.io. The in-cluster Service default only resolves on contabo; on Sovereigns every Day-2 POST died with NXDOMAIN. Bug 2 (chart + code): wire CATALYST_PDM_BASIC_AUTH_USER / _PASS env from a new pdm-basicauth Secret, and have pdmFlipNS SetBasicAuth from those envs. The PDM public ingress at pool.openova.io is gated by Traefik basicAuth; calls without Authorization: Basic returned 401. optional=true so contabo + CI + older Sovereigns degrade to a clear 401 log line. Per Inviolable Principle #10, the credentials only ever live in Pod env + are read once per call by pdmFlipNS — never enter a logged struct or persisted record. Bug 3 (code, parent_domains.go): pdmFlipNS body now includes the required nameservers field (computed from expectedNSFor). PDM's SetNSRequest schema requires it; the previous body got 422 missing-nameservers. Bug 4 (code, parent_domains.go): lookupPrimaryDomain falls back to SOVEREIGN_FQDN env after CATALYST_PRIMARY_DOMAIN. On a post-handover Sovereign no Deployment record is persisted, so without this fallback GET /parent-domains returned {"items":[]} and the propagation panel showed expectedNs:null. SOVEREIGN_FQDN is already wired by api-deployment.yaml from the sovereign-fqdn ConfigMap. Bug 5 (chart, httproute.yaml): catalyst-ui /auth/* PathPrefix narrowed to Exact /auth/handover. The previous PathPrefix collided with OIDC PKCE redirect_uri /auth/callback — catalyst-api 404s on that path because it only registers /api/v1/auth/callback, breaking login post-handover-JWT- cookie expiry. Exact match keeps /auth/handover routed to catalyst-api while every other /auth/* path falls through to catalyst-ui's React Router for client-side OIDC. Bug 6 (cloud-init): ghcr-pull + harbor-robot-token + new pdm-basicauth Reflector annotations enumerate explicit allowed/auto-namespaces (sme, catalyst, catalyst-system, gitea, harbor) instead of empty-string. The ambiguous empty-string interpretation caused otech103 to require a manual catalyst-system mirror creation; explicit list back-ports the verified working state. Provisioner wiring: Request.PDMBasicAuthUser/Pass + Provisioner fields + tfvars emission so the contabo catalyst-api can stamp the credentials onto every Sovereign provision request. variables.tf adds matching pdm_basic_auth_user / pdm_basic_auth_pass tofu vars (sensitive, default empty) so older provisioner builds that pre-date this change keep rendering valid cloud-init (the Secret renders with empty values and Pod start is unaffected). Chart bumped 1.4.11 -> 1.4.12, lockstep slot 13 pin updated. Closes the architectural blockers tracked in #879; the catalyst-api image rebuild + chart republish run via the existing CI pipelines (services- build.yaml + blueprint-release.yaml) on this commit's SHA. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e96741a0ca
|
feat(powerdns,cert-manager): multi-zone bootstrap + per-zone wildcard cert (#827) (#838)
A franchised Sovereign now supports N parent zones, NOT one. The operator brings 1+ parent domains at signup (`omani.works` for own use, `omani.trade` for the SME pool, etc.) and may add more post-handover via the admin console (#829). bp-powerdns 1.2.0 (platform/powerdns/chart): - New `zones: []` values key listing parent domains to bootstrap - New Helm post-install/post-upgrade hook Job (templates/zone-bootstrap-job.yaml) that POSTs each entry to /api/v1/servers/localhost/zones at install time. Idempotent on HTTP 409 — re-runs after upgrades or chart bumps never fail. - Default-values render skips when zones is empty (legacy behavior). bp-catalyst-platform 1.4.0 (products/catalyst/chart): - New `parentZones: []` + `wildcardCert.{enabled,namespace,issuerName}` values - New templates/sovereign-wildcard-certs.yaml renders one cert-manager.io/v1.Certificate per zone (each `*.<zone>` + apex) via the letsencrypt-dns01-prod-powerdns ClusterIssuer. Each cert renews independently. Skips entirely when parentZones is empty so the legacy clusters/_template/sovereign-tls/cilium-gateway-cert.yaml retains ownership of `sovereign-wildcard-tls` (avoids helm-vs-kustomize ownership flap). - New `catalystApi.{powerdnsURL,powerdnsServerID}` values threaded into the catalyst-api Pod as CATALYST_POWERDNS_API_URL + CATALYST_POWERDNS_SERVER_ID env vars. catalyst-api (products/catalyst/bootstrap/api): - New internal/powerdns package with typed Client (CreateZone, ZoneExists). Idempotent on HTTP 409/412. - handler.pdmCreatePowerDNSZone (issue #829's stub) now uses the typed client when wired via SetPowerDNSZoneClient — the admin-console "Add another parent domain" flow now creates real zones in the Sovereign's PowerDNS at runtime. - main.go wires the client when CATALYST_POWERDNS_API_URL + CATALYST_POWERDNS_API_KEY are set. - Comprehensive unit tests (client_test.go: 9 cases incl. 201/409/412/500 + custom NS + custom serverID). Bootstrap-kit slot integration: - clusters/_template/bootstrap-kit/11-powerdns.yaml: bumps to bp-powerdns 1.2.0 and threads `zones: ${PARENT_DOMAINS_YAML}` from Flux postBuild.substitute. - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bumps to bp-catalyst-platform 1.4.0 and threads `parentZones: ${PARENT_DOMAINS_YAML}` (same source-of-truth string so the two slots stay in lockstep). - infra/hetzner: new `parent_domains_yaml` Terraform variable (defaults to single-zone array derived from sovereign_fqdn) → cloud-init renders the PARENT_DOMAINS_YAML Flux substitute. DoD verified end-to-end with helm template + envsubst: - Multi-zone overlay (omani.works + omani.trade) renders 2 PowerDNS zone-create API calls in the bootstrap Job AND 2 Certificate resources (`*.omani.works`, `*.omani.trade`) in bp-catalyst-platform. - Single-zone fallback (PARENT_DOMAINS_YAML defaults to `[{name: "<sov_fqdn>", role: "primary"}]`) keeps legacy provisioning paths working without per-overlay edits. Closes #827. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
05065b66d6
|
fix(provisioner+observer): document cpx21 availability + kubectl retry/LKG (closes #752, #753) (#756)
#752 — investigate cpx21/cpx31 availability in EU DCs Concrete proof gathered against the live Hetzner Cloud API on 2026-05-04. GET /v1/server_types LISTS cpx11/cpx21/cpx31/cpx41 with full EU prices in fsn1/nbg1/hel1, but POST /v1/servers rejects every order for those SKUs in those DCs with: {"error":{"code":"invalid_input", "message":"unsupported location for server type"}} Probed all 6 (SKU × DC) combinations end-to-end via real POST + immediate DELETE. cpx22 + cpx32 were also probed as a sanity check and returned ORDERED. The /v1/server_types price entry is misleading: Hetzner advertises prices for every (SKU, location) pair regardless of orderability. Conclusion: NO SKU bump-back. cpx22 + cpx32 (PR #744) remain the floor. README + variables.tf docstrings now carry the durable reproducer so future engineers don't re-attempt cpx21/cpx31. #753 — kubectl retry / LKG observer reliability /tmp/autopilot.sh updated (script lives outside the repo, on the VPS): • Every kubectl call carries --request-timeout=8s so a hung TLS handshake surfaces as a fast empty rather than a 30s+ stall. • Last-known-good (LKG) state held across transient flakes: hr/cert/nodes no longer flip to "0/0 nodes=0" on a single failed poll. • Only 3 consecutive transients count as a real failure; below the threshold the observer prints "hr=<LKG> (transient N/3)". UI side: the wizard's StatusPill / ApplicationPage drive off SSE from catalyst-api (useDeploymentEvents.ts), not direct kubectl polling, so no UI change needed. catalyst-api itself uses client-go (helmwatch / phase1_watch), not exec kubectl, so its observer is not subject to the same shell-out flake. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e855ab0dfe
|
fix(k3s): taint CP node-role.kubernetes.io/control-plane:NoSchedule when workers exist (#751) (#755)
Root cause of the "apiserver flake / cpx22 too small / 8 stuck HRs"
chain: the k3s server install in cloudinit-control-plane.tftpl set
--node-label but no --node-taint. By k3s default the server node is
fully schedulable, so on a 1-CP + N-worker Sovereign with the
37-HelmRelease bootstrap-kit + guest workloads (bp-keycloak / bp-cnpg /
bp-harbor / bp-catalyst-platform / SME microservices), the scheduler
distributes guest pods onto the CP. They eat its memory, crowd
kubelet/etcd/apiserver, kubectl flakes, Helm post-install hooks time
out, HelmReleases get stuck mid-reconcile.
Fix: add --node-taint node-role.kubernetes.io/control-plane=true:NoSchedule
to the INSTALL_K3S_EXEC string, so the CP is reserved for system +
bootstrap controllers. cilium agent (DaemonSet) and cilium-operator
default to {operator: Exists} tolerations upstream — they tolerate
the taint and continue to run on the CP. cert-manager and flux2 default
to tolerations: [] — on multi-node Sovereigns they correctly land on
workers, which is the desired separation. Guest workloads do not
tolerate the taint and are pushed to workers where they belong.
Conditional on worker_count > 0: a Catalyst-Zero / solo Sovereign has
only the CP, so tainting NoSchedule there leaves no schedulable node
and the cluster never becomes ready. The Tofu inline ternary
"\${worker_count > 0 ? \"--node-taint ...\" : \"\"}" omits the flag
entirely in solo mode — k3s default (CP fully schedulable) carries
everything.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ceeefd7829
|
fix(cloud-init): quote MARKETPLACE_ENABLED so postBuild.substitute is map[string]string (#746)
ROOT CAUSE FOUND for the post-PR-#710 zero-touch handover stall (otech85
through otech89). Cloud-init template emitted:
postBuild:
substitute:
SOVEREIGN_FQDN: otech89.omani.works
MARKETPLACE_ENABLED: false ← UNQUOTED YAML BOOL
Tofu interpolates `${marketplace_enabled}` (a string variable holding
"true"|"false") into the rendered cloud-init. Without quotes, kubectl's
YAML parser converts `false`/`true` into BOOL, so the rendered Kustomi-
zation manifest violates the kustomize.toolkit.fluxcd.io/v1
postBuild.substitute schema (map[string]string).
Live evidence on otech89 (and earlier otech85-88 with same SHA):
GitRepository CRD apply → succeeds (no postBuild, no schema issue)
3× Kustomization apply → silently rejected by validator
flux-system kustomize-controller has 0 reconciliable Kustomizations
bootstrap-kit never lands → 0 HRs ever Ready → wizard stalls forever
Quote the value: `MARKETPLACE_ENABLED: "${marketplace_enabled}"` so it
renders as `MARKETPLACE_ENABLED: "false"` (string) and passes the CRD
validator.
This is the bug that has been blocking the 2-cycle zero-touch verifi-
cation since PR #719 introduced MARKETPLACE_ENABLED. Six provisioning
cycles burned (otech85-89 + retries) chasing it. Closes #733 cycle-
verification (the SKU work itself was correct end-to-end).
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
|
||
|
|
468c3badf8
|
fix(cloud-init): tolerate Crossplane Provider apply failure + retry in background (#745)
Live observation on otech88 (DID b2c528023b50ec45, 2026-05-04
11:40:42Z): the new Sovereign's flux-system reaches Ready (GitRepository
artifact stored, all 6 Flux deployments Available) but no Kustomization
CRs appear — kustomize-controller has nothing to reconcile and
hr=True=0/0 forever.
The cloud-init runcmd applies in this order:
1. cloud-credentials-secret.yaml
2. crossplane-provider-hcloud.yaml — `pkg.crossplane.io/v1 Provider`
CRD doesn't exist yet (bp-crossplane is installed by Flux below),
so this apply errors with "no matches for kind Provider in version
pkg.crossplane.io/v1"
3. flux-bootstrap.yaml — should apply 1× GitRepository + 4×
Kustomization
Empirically, only the GitRepository lands. The four Kustomization
documents in the same multi-doc YAML are not created. The exact
mechanism of failure is on-host (cloud-init runcmd output is at
/var/log/cloud-init-output.log on the Sovereign — out of reach per
"no SSH" rule), but the symptom is consistent across otech87 and
otech88 reprovisions on the new cost-optimised SKUs.
This patch is a belt-and-braces hardening:
1. Tolerate the Crossplane Provider apply's failure (`|| true`) so
the runcmd cannot propagate a non-zero exit through to whatever
downstream step is failing.
2. Add a background retry for the Crossplane Provider CR. Polls
every 30s up to 30m for the Provider CRD to appear (i.e.
bp-crossplane reconciled by Flux), then `kubectl apply` succeeds
and the loop exits. Detached via `&` so cloud-init runcmd
completes without waiting for Crossplane to be Ready.
The intent is to remove any chance the Provider apply blocks Flux
bootstrap. If Kustomizations still don't appear after this fix, the
root cause is elsewhere and a follow-up patch will land.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
b02fc3788a
|
fix(provisioner): cost-optimized defaults use ORDERABLE SKUs — cpx22 CP + cpx32 workers (14% saving) (#744)
* fix(provisioner): emit regions=[] not null so OpenTofu validator accepts zero-override request Live failure on otech86 (DID 103c52d08510006f, 2026-05-04 11:12:43Z). After PR #742 fixed the empty SKU strings in tfvars, the next blocker appeared: writeTfvars was emitting `"regions": null` (Go nil slice marshals to JSON null) when the request had no per-region overrides. OpenTofu's variables.tf carries a validation block: validation { condition = alltrue([ for r in var.regions : contains(["hetzner", "huawei", "oci", "aws", "azure"], r.provider) ]) } The `for r in var.regions` iteration fails on null with: Error: Iteration over null value on variables.tf line 217, in variable "regions": The variables.tf default `[]` is what the validator expects; emit that shape explicitly via a coalesceRegions(req.Regions) helper that turns nil into an empty slice. Operator overrides round-trip unchanged. Tests: - TestWriteTfvars_EmitsRegionsAsEmptyArrayNotNull — proves regions serialises as JSON `[]`, never `null`, when the request has no per-region overrides. Builds on PR #742. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(provisioner): cost-optimized defaults use ORDERABLE SKUs (cpx22 CP + cpx32 workers, 14% saving) Live failure on otech87 (DID e47e1c0824f3fcbb, 2026-05-04 11:31:09Z): the cpx21 CP default from PR #741 fell apart at apply time — Error: Server Type "cpx21" is unavailable in "fsn1" and can no longer be ordered Hetzner cloud API confirms: cpx21 and cpx31 are listed in the catalog (`/v1/server_types`) but are NOT in the per-DC orderable list (`available_for_migration` on `/v1/datacenters`) for any EU DC (fsn1/nbg1/hel1). The wizard's catalog literally cannot be acted on for new Sovereigns in those regions. Smallest AMD-shared SKUs that ARE orderable in EU DCs as of 2026-05-04: • cpx11 (2 vCPU / 2 GB) — too small for the CP working set • cpx22 (2 vCPU / 4 GB) — fits the CP working set, ~€9.49/mo fsn1 • cpx32 (4 vCPU / 8 GB) — smallest 8 GB worker, ~€16.49/mo fsn1 • cpx42, cpx52, cpx62 — bigger and more expensive New default per Sovereign: | Component | Old | New | Savings | |-----------------|-----------------|------------------|---------| | Control plane | CPX32 (€16.49) | CPX22 (€9.49) | €7.00 | | Worker × 2 | CPX32 × 2 (€33) | CPX32 × 2 (€33) | €0 | | TOTAL | €49.47/mo | €42.47/mo | 14% | The 38% saving the issue brief proposed (cpx21+cpx31 = €20.5/mo) assumed those SKUs were orderable. They aren't in EU DCs. The 14% saving from cpx22 CP is the largest concrete optimisation that ships TODAY without compromising the multi-node horizontal-scale agreement (issue #733): still 1 CP + 2 workers from day one. Files changed: - infra/hetzner/variables.tf control_plane_size default cpx21 → cpx22 worker_size default cpx31 → cpx32 (back to the prior orderable choice) - products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts Replace fictional CPX21 € pricing (€5.49/mo) and CPX31 € pricing (€7.49/mo) with the actual fsn1 Hetzner API prices (€10.99 / €20.49). Mark both as "listed but NOT orderable in EU DCs" so the wizard surfaces the constraint instead of letting operators pick a non-orderable SKU. Move recommended:true from CPX21 → CPX22. defaultWorkerSizeId('hetzner') returns 'cpx32' (was 'cpx31'). - products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx Comment refresh — names the new orderable defaults. - products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts Recommended-Hetzner-SKU set assertion: ['cpx21'] → ['cpx22']. Builds on PR #741 (issue #740 chain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
994c2d1c2a
|
fix(provisioner): cost-optimized default sizes — cpx21 CP + cpx31 workers (38% saving) (#741)
The new Sovereign default after PR #736 / #738 / #739 was 1× CPX32 control plane + 2× CPX32 workers — €33/mo per Sovereign. CPX32 is over-provisioned for the CP working set: the CP carries only k3s (apiserver/etcd/scheduler/ controller-manager) + cilium-operator + flux controllers + cert-manager + sealed-secrets — NOT the heavy bp-keycloak/cnpg/harbor/openbao/grafana stack (those land on workers because the bootstrap-kit explicitly schedules them off the CP taint). CP RAM budget: etcd ~512 MB + control plane ~1.5 GB + cilium/flux/ cert-manager/sealed-secrets ~1 GB + OS ~512 MB ≈ 3.5 GB — fits CPX21's 4 GB. Workers stay at 8 GB on CPX31 since RAM is the binding constraint for the bootstrap-kit's worker pods, not vCPU. New default per Sovereign: | Component | Old | New | Savings | |-----------------|-----------------|-----------------|---------| | Control plane | CPX32 (€11/mo) | CPX21 (€5.5/mo) | €5.5 | | Worker × 2 | CPX32 × 2 (€22) | CPX31 × 2 (€15) | €7 | | TOTAL | €33/mo | €20.5/mo | 38% | Multi-node horizontal-scale agreement (issue #733) preserved: still 1 CP + 2 workers minimum from day one. Files changed: - infra/hetzner/variables.tf control_plane_size default cpx32 → cpx21 worker_size default cpx32 → cpx31 Validation regex unchanged (cxNN | cpxNN | ccxNN | caxNN). - products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts Add CPX11, CPX21, CPX31 catalog entries. Move recommended:true from CPX32 → CPX21 (control-plane default). Add defaultWorkerSizeId() — Hetzner returns 'cpx31', other providers fall through to defaultNodeSizeId() symmetric default. - products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx First-visit useEffect + handleSelectProvider now call defaultWorkerSizeId(provider) for the worker SKU instead of mirroring the CP SKU. Comment updated naming the cost-optimised pair. - products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts Recommended-Hetzner-SKU set assertion: ['cpx32'] → ['cpx21']. If a Sovereign exhibits CP RAM pressure with this default, the next safe stop UP is cpx31 (4 vCPU / 8 GB, ~€7.5/mo) — never back to cpx32. Closes #740. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e085a68585
|
fix(k3s): add 10.0.1.2 to --tls-san so Cilium can verify CP cert from workers (#739)
Issue #733 follow-up #2. After #738 changed Cilium's k8sServiceHost
from 127.0.0.1 to the CP private IP 10.0.1.2, Cilium's TLS verification
fails with:
Get "https://10.0.1.2:6443/api/v1/namespaces/kube-system":
tls: failed to verify certificate: x509: certificate is valid for
10.43.0.1, 127.0.0.1, 178.104.211.206, 2a01:..., ::1, not 10.0.1.2
k3s auto-generates the apiserver TLS cert with SANs covering the public
IP, the cluster service IP (10.43.0.1), and localhost — but NOT the
private subnet IP 10.0.1.2. Adding `--tls-san=10.0.1.2` to the k3s
server install command makes the cert valid for the address Cilium
(and any other in-cluster client) reaches the apiserver via.
The sovereign FQDN is also already in --tls-san, this just adds the
private subnet anchor that the multi-node Cilium config in #738
introduced.
Verified live on otech51 (deploy SHA
|
||
|
|
69de64ba19
|
fix(cilium): k8sServiceHost 127.0.0.1 → 10.0.1.2 so workers' Cilium can reach apiserver (#738)
Issue #733 follow-up. The default cpx32 multi-node Sovereign (1 CP + 2 workers) provisioned successfully, but worker nodes stuck NotReady because cilium-agent on workers crashloop'd: Get "https://127.0.0.1:6443/api/v1/namespaces/kube-system": dial tcp 127.0.0.1:6443: connect: connection refused Root cause: `k8sServiceHost: 127.0.0.1` works on the k3s SERVER node (supervisor binds localhost:6443) but FAILS on every k3s AGENT node (agent does NOT expose apiserver on localhost — only the supervisor on :6444). Pre-#733 every Sovereign was solo (worker_count=0), so this never fired. Fix: point Cilium at `10.0.1.2`, the CP's stable private IP on the Sovereign's 10.0.1.0/24 subnet (cp1=10.0.1.2 per main.tf network block). No-op on the CP (10.0.1.2 IS its own private IP) and works on workers (which already join the cluster via the same address per cloudinit-worker.tftpl `K3S_URL=https://${cp_private_ip}:6443`). Files: - infra/hetzner/cloudinit-control-plane.tftpl — bootstrap helm install values file written to /var/lib/catalyst/cilium-values.yaml - platform/cilium/chart/values.yaml — Flux bp-cilium HelmRelease values (cilium_values_parity_test.go enforces the two stay aligned) Verified live on otech50: 3× CPX32 servers running, 1 CP Ready, 2 workers registered with k3s but NotReady due to cilium init failure. After this fix workers should reach Ready, and the Phase-1 watcher sees all components Ready=True across the multi-node cluster. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7ec25b9736
|
feat(provisioner): default Sovereign to 3x CPX32 (1 CP + 2 workers) — restore horizontal scale (#736)
Issue #733. Every Sovereign provisioned this week launched with a single CPX52 control-plane and zero workers — completely discarded horizontal scalability. Restore the originally agreed shape: 1 CPX32 control plane + 2 CPX32 workers (3 nodes × 4 vCPU/8 GB = 12 vCPU/24 GB total — same aggregate footprint as a CPX52 vertical-scale, but with multi-node fault tolerance and the architectural shape clusters/_template/ was designed for). Changes: - infra/hetzner/variables.tf — defaults: control_plane_size cx42→cpx32, worker_size cx32→cpx32, worker_count 0→2. - infra/hetzner/main.tf — add hcloud_load_balancer_target.workers so the Hetzner LB targets every node (CP + workers); Cilium Gateway DaemonSet on every node serves ingress on its NodePort, so any node can absorb traffic for genuine horizontal scale. - infra/hetzner/README.md — sizing rationale rewritten around horizontal scale; CPX32 × 3 documented as canonical; CPX52 retained for solo dev. - ui model — INITIAL_WIZARD_STATE.workerCount 0→2. - ui StepProvider — first-visit + provider-change defaults workerCount 0→2. - ui providerSizes — `recommended: true` flag moves cpx52→cpx32; CPX52 description updated to "solo dev when worker_count=0". Constraints honoured: - Existing API requests with explicit controlPlaneSize: 'cpx52' / explicit workerCount: 0 keep working — only DEFAULTS change. - Sub-CPX32 SKUs (cx21/cx31) still allowed via dropdown. - Contabo single-node Catalyst-Zero is a different code path — unaffected. - No cron triggers added (event-driven only). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
4946ccd125
|
feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710) (#719)
Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS
operator with a single overlay toggle.
Changes
=======
products/catalyst/chart:
- Chart.yaml 1.2.7 → 1.3.0
- values.yaml: ingress.marketplace.enabled toggle (default false) +
marketplace.{brand,currency,paymentProvider,signupPolicy} surface
- templates/sme-services/marketplace-routes.yaml: HTTPRoute
marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin,
/ → marketplace; HTTPRoute *.<sov> → console (per-tenant wildcard)
- templates/sme-services/marketplace-reference-grant.yaml: cross-
namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services
- .helmignore: stop excluding sme-services/* and marketplace-api/* (only
*.kustomization.yaml + *.ingress.yaml remain Kustomize-only)
- All sme-services/* + marketplace-api/* manifests wrapped with
{{ if .Values.ingress.marketplace.enabled }} so non-marketplace
Sovereigns render the chart unchanged
clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
- chart version 1.2.7 → 1.3.0
- ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN}
- ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false}
infra/hetzner:
- variables.tf: marketplace_enabled var (string "true"/"false", default "false")
- main.tf: thread var into cloudinit-control-plane.tftpl
- cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED
on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations
products/catalyst/bootstrap/api/internal/provisioner/provisioner.go:
- Request.MarketplaceEnabled bool (json:"marketplaceEnabled")
- writeTfvars: marketplace_enabled = "true"|"false"
core/pool-domain-manager/internal/allocator/allocator.go:
- canonicalRecordSet adds "marketplace" prefix → marketplace.<sov>
resolves via PDM at zone-commit time (PR #710 explicit record so
caches don't depend on the *.<sov> wildcard alone)
DoD ready
=========
- helm template with ingress.marketplace.enabled=false → identical
manifest set to 1.2.7 (verified locally)
- helm template with ingress.marketplace.enabled=true → emits 17 extra
resources: 13 sme-services workloads + 2 marketplace-api + 1
HTTPRoute pair + 1 ReferenceGrant
- pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green
- catalyst-api builds, provisioner cloudinit_path_test green
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
|
||
|
|
6f3e15b1ec
|
fix(handover): provision JWK Secret on Sovereign + inject SOVEREIGN_FQDN env (Phase-8b followup) (#692)
Two handover bugs caught live on otech48 (2026-05-03): 1. Sovereign-side catalyst-api responded to GET /auth/handover with "server misconfiguration: public key unavailable". Root cause: the K8s Secret `catalyst-handover-jwt-public` (referenced by the chart's optional Secret-volume) was never materialised on the Sovereign, so the optional volume mount fell through and the JWK file was absent inside the container. 1.2.0 wired the mount but no provisioning step created the Secret. Fix mirrors the canonical pattern from PR #543 (ghcr-pull) and PR #680 (harbor-robot-token): cloud-init now writes the Secret manifest into catalyst-system NS and runcmd applies it BEFORE flux-bootstrap, so the Secret exists by the time bp-catalyst-platform reconciles. Also moves the chart volume mount off the catalyst-api PVC (mountPath /etc/catalyst/handover-jwt-public, no subPath) so a leftover empty directory in the PVC from pre-#606 installs cannot collide with the re-provisioned Secret mount. 2. /auth/handover validator rejected every valid JWT with 401 "invalid audience" because SOVEREIGN_FQDN was unset on Sovereigns — the audience check collapsed to the literal "https://console." prefix. The bp-catalyst-platform HelmRelease overlay was already setting `global.sovereignFQDN` but the chart template never plumbed it through to the Pod env. Added a SOVEREIGN_FQDN env reading `.Values.global.sovereignFQDN` (default "" so Catalyst-Zero installs, where catalyst-api is the SIGNER not the validator, stay clean). Bumps: - bp-catalyst-platform 1.2.4 -> 1.2.5 - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml HelmRelease pin Will be verified live on otech49 — fresh provision should reach https://console.otech49.omani.works/auth/handover?token=... and exchange to a Keycloak session WITHOUT manual Secret creation. Issue #606 followup. Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d0b574bd68
|
fix(hetzner-tofu): add powerdns_api_key to templatefile() vars (#687)
PR #686 added var.powerdns_api_key to variables.tf and referenced it as ${powerdns_api_key} in cloudinit-control-plane.tftpl, but missed wiring it into the templatefile() vars dict in main.tf. Result on otech48: Invalid value for "vars" parameter: vars map does not contain key "powerdns_api_key", referenced at ./cloudinit-control-plane.tftpl:273 This commit closes the gap: powerdns_api_key now flows from var -> templatefile vars -> cloud-init -> Secret manifest. Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
684759564e
|
fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager (PR #681 followup) (#686)
* fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns even with all of: - gatewayAPI.hostNetwork.enabled=true on the Cilium chart - securityContext.privileged=true on the cilium-envoy DaemonSet - securityContext.capabilities.add=[NET_BIND_SERVICE] - envoy-keep-cap-netbindservice=true in cilium-config ConfigMap - Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema) Repeatable error from cilium-envoy logs across otech45, otech46, otech47: listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed to bind or apply socket options: cannot bind '0.0.0.0:80': Permission denied The bind() syscall is intercepted by cilium-agent's BPF socket-LB program in a way that does not honour container capabilities. Even PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets "Permission denied". Cilium 1.19.3 → 1.16.5 made no difference (F1, PR #684 still ships — the version bump is sound for other reasons; the listener bind is just a separate fix). This commit moves the listeners to high ports (30080/30443) and lets the Hetzner LB do the public-facing port translation: HCLB :80 → CP node :30080 (cilium-gateway HTTP listener) HCLB :443 → CP node :30443 (cilium-gateway HTTPS listener) External users still hit `https://console.<sov>.omani.works/auth/handover` on port 443; the high port is invisible. High-port bind succeeds without NET_BIND_SERVICE because the kernel only gates ports below `net.ipv4.ip_unprivileged_port_start` (default 1024). Will be verified on otech48: the next fresh provision should serve console.otech48/auth/handover end-to-end without the 502/timeout chain seen on otech45–47. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager PR #681 followup. The new bp-cert-manager-powerdns-webhook (PR #681) calls contabo's authoritative PowerDNS at pdns.openova.io to write DNS-01 challenge TXT records for *.otech<N>.omani.works. That webhook needs an X-API-Key Secret in the Sovereign's cert-manager namespace — PR #681 didn't ship the materialization seam, so on otech43..otech47 the Secret was missing and the wildcard cert never issued. This commit closes the seam from contabo to the Sovereign: 1. bp-powerdns chart 1.1.7 to 1.1.8: Reflector annotations on openova-system/powerdns-api-credentials extended from "external-dns" to "external-dns,catalyst" so contabo catalyst-api can mount the API key. 2. bp-powerdns: api.basicAuth.enabled flips default true to false. Layered Traefik basicAuth + PowerDNS X-API-Key was double auth that blocked machine-to-machine API access from Sovereigns. The X-API-Key contract is unchanged. 3. bp-catalyst-platform 1.2.3 to 1.2.4: api-deployment.yaml adds CATALYST_POWERDNS_API_KEY env from powerdns-api-credentials/api-key secret (optional=true so Sovereign-side catalyst-api Pods that don't reflect this still start clean). 4. catalyst-api provisioner.go: new Provisioner.PowerDNSAPIKey field reads from CATALYST_POWERDNS_API_KEY env at New(). Stamps onto every Request before Validate(). Forwards as tofu var powerdns_api_key. 5. infra/hetzner/variables.tf: new var.powerdns_api_key (sensitive, default ""). 6. infra/hetzner/cloudinit-control-plane.tftpl: replaces the defunct dynadot-api-credentials Secret block (PR #681 dropped bp-cert-manager-dynadot-webhook) with a new cert-manager/powerdns-api-credentials Secret block. runcmd applies it BEFORE Flux reconciles bp-cert-manager-powerdns-webhook. End-to-end seam mirrors PR #543 ghcr-pull and PR #680 harbor-robot-token. Will be verified live on otech48 (next provision after this lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
369c229408
|
fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget (#685)
cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns even with all of: - gatewayAPI.hostNetwork.enabled=true on the Cilium chart - securityContext.privileged=true on the cilium-envoy DaemonSet - securityContext.capabilities.add=[NET_BIND_SERVICE] - envoy-keep-cap-netbindservice=true in cilium-config ConfigMap - Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema) Repeatable error from cilium-envoy logs across otech45, otech46, otech47: listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed to bind or apply socket options: cannot bind '0.0.0.0:80': Permission denied The bind() syscall is intercepted by cilium-agent's BPF socket-LB program in a way that does not honour container capabilities. Even PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets "Permission denied". Cilium 1.19.3 → 1.16.5 made no difference (F1, PR #684 still ships — the version bump is sound for other reasons; the listener bind is just a separate fix). This commit moves the listeners to high ports (30080/30443) and lets the Hetzner LB do the public-facing port translation: HCLB :80 → CP node :30080 (cilium-gateway HTTP listener) HCLB :443 → CP node :30443 (cilium-gateway HTTPS listener) External users still hit `https://console.<sov>.omani.works/auth/handover` on port 443; the high port is invisible. High-port bind succeeds without NET_BIND_SERVICE because the kernel only gates ports below `net.ipv4.ip_unprivileged_port_start` (default 1024). Will be verified on otech48: the next fresh provision should serve console.otech48/auth/handover end-to-end without the 502/timeout chain seen on otech45–47. Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
affcf37923
|
fix(bp-catalyst-platform): provision harbor-robot-token automatically on Sovereign install (RCA + permanent fix) (#680)
Caught live on otech43–46 — manual placeholder Secret was being created each iteration. RCA: The catalyst-api Pod template references the `harbor-robot-token` Secret via a REQUIRED (non-optional) secretKeyRef. On Sovereign clusters that Secret was never materialised — only `ghcr-pull` had the canonical cloud-init + Reflector auto-mirror seam (PR #543). The chart's old comment said "Reflector mirrors from openova-harbor namespace into catalyst" but `openova-harbor` doesn't exist on Sovereigns; that namespace lives only on contabo where the central Harbor source Secret is administered. Result: every fresh Sovereign's catalyst-api Pod stuck in CreateContainerConfigError until the operator hand-created a placeholder Secret. The token VALUE was already arriving on the Sovereign — Tofu var.harbor_robot_token is interpolated into /etc/rancher/k3s/registries.yaml at cloud-init time so containerd can authenticate against harbor.openova.io. We just never materialised the same value as a Kubernetes Secret for catalyst-api to mount. Permanent fix mirrors the canonical `ghcr-pull` seam: 1. infra/hetzner/cloudinit-control-plane.tftpl write_files block emits /var/lib/catalyst/harbor-robot-token-secret.yaml — a Secret in flux-system ns with auto-mirror Reflector annotations (`reflection-auto-enabled: "true"`). 2. runcmd applies it BEFORE flux-bootstrap, so the Secret exists before any Helm release reconciles. 3. bp-reflector (slot 05a, already deployed) propagates the Secret into every namespace — including catalyst-system — on first reconcile tick. catalyst-api's secretKeyRef resolves cleanly, Pod starts. 4. Token rotation flows through `var.harbor_robot_token` → re-render Tofu → re-apply cloud-init; Reflector propagates the rotation to all mirrored copies on the next watch tick. `harbor-robot-token` stays NOT optional in the chart: the architecture mandate is every Sovereign image pull goes through harbor.openova.io; falling through to docker.io is forbidden (anonymous rate-limit makes a fresh Hetzner IP unbootable). A missing token must surface immediately as Pod start failure, never silently mid-provision. Bumps: - bp-catalyst-platform 1.2.2 → 1.2.3 (chart-side change is a comment-only update on the secretKeyRef explaining the new seam; the Pod spec still references the same Secret name and key). - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml HelmRelease version pin → 1.2.3. No bootstrap-kit dependency changes — bp-reflector's slot-05a position is unchanged and was already a dependency for ghcr-pull. No expected-bootstrap-deps.yaml edits needed. Issue #557 follow-up. Closes the per-Sovereign manual workaround. Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
dd4148acb6
|
fix(cilium-gateway): hostNetwork mode + Hetzner LB→80/443 (chart 1.1.5) (#674)
The Cilium gateway-api L7LB nodePort chain was silently broken on otech45: TCP to LB:443 succeeds, but TLS handshake never completes. Root cause: Cilium 1.16.5's BPF L7LB Proxy Port (12869) doesn't match what cilium-envoy actually listens on (verified via /proc/net/tcp on the cilium-envoy pod — port 12869 not in listening sockets). The nodePort indirection (31443→envoy:12869) is broken at the redirect step. Fix: bind cilium-envoy directly to the host's :80 and :443 via gatewayAPI.hostNetwork.enabled=true. Hetzner LB forwards public 80→private:80 and 443→private:443 directly (no nodePort indirection). Two coordinated changes: 1. platform/cilium/chart/values.yaml: gatewayAPI.hostNetwork.enabled=true 2. infra/hetzner/main.tf: LB destination_port = 80/443 (was 31080/31443) bp-cilium chart bumped to 1.1.5. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
1734979d74
|
fix(infra): bump kernel inotify limits (bao init was hitting EMFILE) (#656)
* fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13)
* fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth)
The wizard's componentGroups.ts carried hand-maintained `dependencies:
[...]` arrays that deviated from the real Flux install graph in
clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced
this):
componentGroups.ts Flux HelmRelease.dependsOn
---------------------- ---------------------------
keycloak: [cnpg] keycloak: [cert-manager, gateway-api]
openbao: [] openbao: [spire, gateway-api, cnpg]
harbor: [cnpg, seaweedfs, harbor: [cnpg, cert-manager,
valkey] gateway-api]
Founder's directive: "all the real dependencies are related to real
flux related dependencies, if you are hosting irrelevant hardcoded
baseless wizard catalog dependencies, I dont know where they are
coming from. The single source of truth for the dependencies is
flux!!!" — 2026-05-03
This commit:
1. Adds scripts/generate-blueprint-deps.sh that parses every
bootstrap-kit HelmRelease and emits blueprint-deps.generated.json
keyed by bare component id (bp- prefix stripped on both source
and target side).
2. Commits the generated JSON.
3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts
thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id).
4. Patches componentGroups.ts so every RAW_COMPONENT's
`dependencies` field is OVERRIDDEN at module load with the
Flux-canonical list (the inline `dependencies: [...]` literals
are now ignored — Flux is canonical).
Follow-ups (not in this PR):
- CI drift check that re-runs the script and diffs the JSON.
- Strip the inline `dependencies: [...]` arrays entirely once the
drift check is green.
- Wire the FlowPage edge-rendering to match.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(flowpage): replace second hardcoded BOOTSTRAP_KIT_DEPS table with Flux SoT
PR #652 fixed the wizard catalog. FlowPage.tsx had a SECOND independent
hardcoded dep map at lines 105-155 that the founder caught — most
visibly:
keycloak: ['cert-manager', 'openbao'] ← FALSE; Flux says no openbao
The reason the founder kept seeing the spurious arrow on the Flow page.
Replace the local table with an import of BLUEPRINT_DEPS from
data/blueprintDeps.ts (single source of truth — generated from
clusters/_template/bootstrap-kit/*.yaml by
scripts/generate-blueprint-deps.sh).
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(jobs): don't regress status to pending after exec started
helmwatch_bridge.go's OnHelmReleaseEvent unconditionally overwrote the
Job's Status with jobStatusFromHelmState(state) on every event. Flux
oscillates HelmReleases between Reconciling and DependencyNotReady
while a dependency (e.g. bp-openbao waiting on bp-spire) isn't Ready
— helmwatch maps both back to HelmStatePending. The bridge then flips
the row to status='pending' even though an active Execution is
streaming exec log lines (startedAt + latestExecutionId already set).
Founder caught this on otech34's install-external-secrets job:
status='pending' on the Jobs page while Exec Log was actively
tailing.
Fix: monotonic guard — once activeExecID[component] != "" (Execution
allocated), refuse to regress nextStatus to StatusPending. Treat
ongoing-after-start as Running so the row reflects the live stream.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(jobs): cascade Failed status through dependsOn (fail-fast)
Founder caught on otech34: install-openbao=failed but
install-external-secrets stayed pending forever ('masking it and
waiting unnecessarily'). Flux's HelmRelease for external-secrets is
in DependencyNotReady, helmwatch maps that to StatePending,
bridge writes Status=pending — no signal that the upstream FAILED
rather than 'still installing'.
Add a post-rollup sweep in deriveTreeView that propagates Failed
through the dependsOn graph. Up to 8 sweeps cover the deepest
bootstrap-kit chain. Idempotent on read; reverses if openbao recovers
because it operates on the live snapshot.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(infra): bump kernel inotify limits — bp-openbao init was crashing 'too many open files'
Diagnosed live during otech35: openbao-init pod crash-looped 4×
on 'bao operator init' with:
failed to create fsnotify watcher: too many open files
Flux mapped to InstallFailed → RetriesExceeded → cascading through
external-secrets and external-secrets-stores. The wizard masked the
OS-level root cause behind a generic InstallFailed.
Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128 — far
too low for a 35-component bootstrap-kit (k3s kubelet + Flux helm-
controller + 11 CNPG operators + Reflector + Cert-Manager + bao +
keycloak-config-cli + ... each grabs instance slots). The instance
count exhausts within minutes; the next process to ask for an
inotify slot gets EMFILE.
Bump well above k8s/k3s production guidance so future blueprints
don't tickle the same wall:
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 1048576
fs.inotify.max_queued_events = 16384
Applied via /etc/sysctl.d/99-catalyst-inotify.conf + 'sysctl --system'
in runcmd. Permanent across reboots.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
---------
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
|
||
|
|
40ca4e4d50
|
fix(infra): registries.yaml mirror needs rewrite — Harbor proxy is /v2/proj/, not /proj/v2/ (#640)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
0ee309aa8b
|
fix(infra+api): wire handover_jwt_public_key end-to-end through tofu provisioning (#636)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
96a5e3a20e
|
fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service (#635)
PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
169ba2f20a
|
fix(infra): restore handover-jwt-public.jwk cloud-init write + variables.tf (#623)
PR #611 squash accidentally reverted the Phase-8b infra additions from PR #615
(
|
||
|
|
b5c9839da7
|
feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611)
Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables: UI: - AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server callback; sovereign → client-side OIDC token exchange via oidc.ts) - Router: sovereign console routes (/console/*), DETECTED_MODE index redirect, authCallbackRoute dedup fix, authHandoverRoute safety net - StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token before redirecting operator to Sovereign console (falls back to plain URL on error) API: - main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env - deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time - provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON - auth.go: /auth/handover endpoint for seamless single-identity flow Infra: - cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/ - variables.tf: handover_jwt_public_key variable (sensitive, default empty) Chart: - api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars Playwright CI fixes: - playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard - playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix - cosmetic-guards.spec.ts: provision URL /sovereign/provision/* → /provision/* - sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests). Co-authored-by: e3mrah <e3mrah@openova.io> |
||
|
|
92fdda42d7
|
feat(catalyst-api+infra): Phase-8b handover JWT minting on Catalyst-Zero (Closes #605)
Merge via self-merge per CLAUDE.md. Playwright UI smoke passes; cosmetic guards pre-existing failure on main (unrelated to this PR). Resolves #605. |
||
|
|
5a403e66b1
|
fix(tls): DNS-01 wildcard TLS chain — solverName pdns, NodePort 30053, dynadot test fix (#582)
* fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase
Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:
FATAL: database "registry" does not exist (SQLSTATE 3D000)
Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.
Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.
Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(tls): DNS-01 wildcard TLS chain — solverName, NodePort 30053, dynadot test fix
Five independent fixes that together complete the DNS-01 wildcard TLS chain
for per-Sovereign certificate autonomy:
1. cert-manager-powerdns-webhook solverName mismatch (root cause of #550 echo):
- values.yaml: `webhook.solverName: powerdns` → `pdns`
- The zachomedia binary's Name() returns "pdns" (hardcoded). cert-manager
calls POST /apis/<groupName>/v1alpha1/<solverName>; when solverName is
"powerdns" cert-manager gets 404 → "server could not find the resource".
2. cert-manager-dynadot-webhook solver_test.go mock format:
- writeOK() and error injection used old ResponseHeader-wrapped format
- Real api3.json returns ResponseCode/Status directly in SetDnsResponse
- This caused the image build to fail at
|
||
|
|
73ae746637
|
fix(cloud-init): install Gateway API v1.1.0 CRDs before cilium so operator registers gateway controller (#581)
Root cause (otech22 2026-05-02): Cilium operator checks for Gateway API CRDs at startup and disables its gateway controller if they are absent — a static, one-shot decision. Cloud-init installs k3s+Cilium first, then Flux reconciles bp-gateway-api minutes later, so the operator always starts without CRDs and never recovers. All 8 HTTPRoutes orphaned. Three-part permanent fix: 1. cloud-init: apply Gateway API v1.1.0 experimental CRDs (incl. TLSRoute) BEFORE the Cilium helm install. Cilium 1.16.x requires TLSRoute CRD to be present; without it the operator's capability check fails entirely and disables the gateway controller. 2. bp-cilium (1.1.2 → 1.1.3): add gatewayAPI.gatewayClass.create: "true" to force GatewayClass creation regardless of CRD presence at Helm render time. Upstream default "auto" skips GatewayClass when the gateway API CRDs are absent at install time (Capabilities check). 3. bp-gateway-api (1.0.0 → 1.1.0): downgrade CRDs from v1.2.0 to v1.1.0 and ship experimental channel (TLSRoute, TCPRoute, UDPRoute, BackendLBPolicy, BackendTLSPolicy). Gateway API v1.2.0 changed status.supportedFeatures from string[] to object[]; Cilium 1.16.5 writes the old string format and the v1.2.0 CRD rejects the status patch with "must be of type object: string", leaving GatewayClass permanently Unknown/Pending. v1.1.0 retains string schema. Upgrade path: bump bp-gateway-api + bp-cilium together when Cilium ≥ 1.17 adopts the v1.2.0 object schema for supportedFeatures. Closes #503 Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
9e53d9e127
|
feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (#557) (#563)
* docs(wbs): Mermaid DAG shows actual Phase-8a dependency cascade Per founder corrective: existing diagram missed the real blockers surfaced during otech10..otech22 burns. The image-pull-through gap (#557) and the cross-namespace secret gap (#543, #544) gate every workload pull from a public registry — without them, Sovereign hits DockerHub anonymous rate-limit on first provision and 30+ HRs are ImagePullBackOff/CreateContainerConfigError. Adds: - Phase 0b · Image pull-through (#557 + #557B Sovereign-Harbor swap + #557C charts global.imageRegistry templating). Edges to NATS / Gitea / Harbor / Grafana / Loki / Mimir / PowerDNS / Crossplane / cert-manager-powerdns-webhook / Trivy / Kyverno / SPIRE / OpenBao - Phase 0c · Cross-namespace secrets (#543 ghcr-pull Reflector + #544 powerdns-api-credentials reflect). Edges to bp-catalyst-platform and bp-cert-manager-powerdns-webhook - Phase 1 additions: #542 kubeconfig CP-IP fix and #547 helmwatch 38-HR threshold both gate Phase 8a integration test - Phase 0b → Phase 8b edge: post-handover Sovereign-Harbor swap is what makes "zero contabo dependency" DoD-met possible WBS now reflects the cascade observed live, not the pre-Phase-8a model. * feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-powerdns-webhook/sealed-secrets (PR 1/3, #560) - bp-cilium 1.1.1→1.1.2: global.imageRegistry stub added; upstream cilium subchart does not expose a single registry knob — per-Sovereign overlays wire specific image.repository fields alongside this value. - bp-cert-manager 1.1.1→1.1.2: global.imageRegistry stub added; upstream chart exposes per-component image.registry knobs documented in the comment. - bp-cert-manager-powerdns-webhook 1.0.2→1.0.3: global.imageRegistry stub added + deployment.yaml templated to prefix the webhook image repository when the value is non-empty. Verified: helm template with --set global.imageRegistry=harbor.openova.io produces harbor.openova.io/zachomedia/cert-manager-webhook-pdns:<appVersion>. - bp-sealed-secrets 1.1.1→1.1.2: global.imageRegistry stub added; upstream subchart exposes sealed-secrets.image.registry for overlay wiring. All four charts render clean with default values (empty imageRegistry). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (openova-io/openova#557) Add /etc/rancher/k3s/registries.yaml to Sovereign cloud-init so containerd transparently routes all five public-registry pulls through the central harbor.openova.io pull-through proxy (Option A of #557). - cloudinit-control-plane.tftpl: new write_files entry for /etc/rancher/k3s/registries.yaml (written BEFORE k3s install so containerd reads the mirror config at startup). Mirrors docker.io, quay.io, gcr.io, registry.k8s.io, ghcr.io through the respective harbor.openova.io/proxy-* projects. Auth via robot$openova-bot. - variables.tf: new harbor_robot_token variable (sensitive, default "") for the robot account token stored in openova-harbor/harbor-robot-token K8s Secret on contabo and forwarded by catalyst-api at provision time. - main.tf: wire harbor_robot_token into the templatefile() call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
ccc38987c2
|
fix(tls): bp-cert-manager-dynadot-webhook slot 49b + DNS-01 JSON bug (Closes #550) (#558)
Root cause: bootstrap-kit installs bp-cert-manager-powerdns-webhook (slot 49)
but the letsencrypt-dns01-prod ClusterIssuer wires to the dynadot webhook
(groupName: acme.dynadot.openova.io). Without slot 49b the APIService for
acme.dynadot.openova.io does not exist → cert-manager gets "forbidden" on
every ChallengeRequest → sovereign-wildcard-tls stays in Issuing indefinitely
→ HTTPS gateway has no cert → SSL_ERROR_SYSCALL on the handover URL.
Changes:
- core/pkg/dynadot-client: fix SetDnsResponse JSON key (was SetDns2Response,
API returns SetDnsResponse); change ResponseCode to json.Number (API returns
integer 0, not string "0"); update tests to match real API response format
- platform/cert-manager-dynadot-webhook/chart:
- rbac.yaml: add domain-solver ClusterRole + ClusterRoleBinding so
cert-manager SA can CREATE on acme.dynadot.openova.io (the "forbidden" fix)
- values.yaml: add certManager.{namespace,serviceAccountName}, clusterIssuer.*
and privateKeySecretRefName; add rbac.create comment for domain-solver
- certificate.yaml: trunc 64 on commonName (was 76 bytes, cert-manager rejects >64)
- clusterissuer.yaml: new template (skip-render default, enabled via overlay)
- deployment.yaml: add imagePullSecrets support (required for private GHCR)
- Chart.yaml: bump to 1.1.0
- clusters/_template/bootstrap-kit:
- 49b-bp-cert-manager-dynadot-webhook.yaml: new slot (PRE-handover issuer)
- kustomization.yaml: add 49b entry
- infra/hetzner:
- variables.tf: add dynadot_managed_domains variable
- main.tf: pass dynadot_{key,secret,managed_domains} to cloud-init template
- cloudinit-control-plane.tftpl: write cert-manager/dynadot-api-credentials
Secret + apply it before Flux reconciles bootstrap-kit
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|
|
b2307e290d
|
fix: bp-reflector + rename ghcr-pull-secret->ghcr-pull (Closes #543) (#554)
Part A — bp-reflector blueprint: - Add clusters/_template/bootstrap-kit/05a-reflector.yaml (slot 05a, dependsOn bp-cert-manager) — installs emberstack/reflector v7.1.288 via the bp-reflector OCI wrapper chart. - Register in bootstrap-kit/kustomization.yaml. - Add platform/reflector/chart/ wrapper (Chart.yaml + values.yaml): single replica, 32Mi memory, ServiceMonitor off by default. Part B — annotate flux-system/ghcr-pull + rename in charts: - infra/hetzner/cloudinit-control-plane.tftpl: add four Reflector annotations to the ghcr-pull Secret written at cloud-init time so Reflector auto-mirrors it to every namespace on first boot. - Rename imagePullSecrets from ghcr-pull-secret to ghcr-pull in: api-deployment.yaml, ui-deployment.yaml, marketplace-api/deployment.yaml, and all 11 sme-services/*.yaml (14 total occurrences). - Bump bp-catalyst-platform chart 1.1.12->1.1.13; update bootstrap-kit HelmRelease version reference to match. Root cause: the canonical secret name is ghcr-pull (written by cloud-init as /var/lib/catalyst/ghcr-pull-secret.yaml). Charts were referencing ghcr-pull-secret (wrong name), causing ImagePullBackOff on all Catalyst pods on every new Sovereign. Runtime hotfix applied to otech22: both ghcr-pull and ghcr-pull-secret propagated to 33 namespaces via kubectl; non-Running pods bounced. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
5b55d65461
|
fix(infra): kubeconfig points at CP public IP not LB IP (Closes #542) (#546)
The Hetzner LB only forwards 80/443 (Cilium Gateway ingress); 6443 is exposed directly on the CP node via firewall rule (main.tf:51-56, 0.0.0.0/0 → CP:6443). Previous cloud-init rewrote kubeconfig server: to the LB's public IPv4, which silently failed with "connect: connection refused" — catalyst-api helmwatch could never observe HelmReleases on the new Sovereign, so the wizard jobs page stayed PENDING for every install-* job for 50+ minutes after the cluster was actually healthy. Pass control_plane_ipv4 (= hcloud_server.control_plane[0].ipv4_address) through the templatefile() call and rewrite k3s.yaml's 127.0.0.1:6443 to that IP instead. Same firewall already opens 6443 to 0.0.0.0/0 directly on the CP, so this is reachable from contabo without any LB / firewall changes. Permanent: every otechN provisioning from this commit forward will PUT back a kubeconfig that catalyst-api can actually connect to. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
66ff717fbc
|
fix(infra): reduce bootstrap Kustomization timeouts 30m→5m to unblock iterative fixes (closes #492) (#500)
Phase-8a bug #17 (otech8 deployment 1bfc46347564467b, 2026-05-01): when the FIRST apply of bootstrap-kit was unhealthy (cilium crash-loop from issue #491), kustomize-controller held the revision lock for the full 30m health-check timeout and refused to pick up new GitRepository revisions. Even though Flux fetched fix `66ea39f0` from main within 1 minute, bootstrap-kit's lastAttemptedRevision stayed pinned to the OLD SHA `0765e89a` for the full 30 minutes. With cilium broken, the wait would never finish, no new revision would ever apply, and the operator was forced to wipe + reprovision from scratch. The same pathology would repeat on every iteration unless the timeout shape changed. Approach: Option A (timeout reduction). Drops `spec.timeout` on all three Flux Kustomizations in the cloud-init template — bootstrap-kit, sovereign-tls, infrastructure-config — from 30m to 5m. We KEEP `wait: true` so downstream `dependsOn: bootstrap-kit` declarations still get a consolidated "every HR Ready=True" signal. We do NOT adjust `interval` (5m is correct). Why 5m specifically: matches the GitRepository poll interval. Failed reconciles release the revision lock within ~6m worst case so a fresh fix on main gets applied on the next poll. Anything shorter risks tripping legitimately-slow CRD installs; anything longer re-introduces the iteration-stall pathology #492 documents. Why not Option B (wait: false): would break the dependsOn chain. The infrastructure-config Kustomization needs bootstrap-kit's HRs Ready before it applies Provider/ProviderConfig manifests that talk to Hetzner. Flipping wait: false would let infra-config apply prematurely. Why not Option C (tighter retryInterval): doesn't address the root cause. retryInterval governs how often to retry AFTER a failure; spec.timeout is what holds the revision lock during a failed wait. Test: kustomization_timeout_test.go (new) locks all three timeouts at exactly 5m AND blocks any operative `timeout: 30m` regression AND asserts wait: true is retained. Three assertions, one for each failure mode (regression to 30m, accidental 4th Kustomization without test update, drive-by flip to wait: false). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |