openova

Author	SHA1	Message	Date
e3mrah	aaaaadf8bc	feat(openova-flow): server (HTTP+SSE event router) + flux adapter (K8s informer sidecar) (#1390 ) Agent #2 of 3 for OpenovaFlow. Ships the Go backend independently of Agent #1's TS packages (@openova/flow-core + @openova/flow-canvas); the FlowMessage JSON contract is locked between agents. Two Go modules (separate go.mod each so the dep graphs stay decoupled): - products/openova-flow/server/ — stateless HTTP+SSE event router. Map<flowId, RingBuffer<FlowMessage>>, in-memory, no DB. Endpoints: POST /v1/flows/{flowId}/events, GET /v1/flows/{flowId}/snapshot, GET /v1/flows/{flowId}/stream (SSE with 15s heartbeats + Last-Event-ID seq stamping), DELETE /v1/flows/{flowId}, GET /healthz, /readyz. Zero external Go deps (stdlib net/http). Ring cap default 4096 (env-overridable). Locked schema validation rejects unknown envelope variants with 400. - products/openova-flow/adapter-flux/ — DaemonSet sidecar that watches helm.toolkit.fluxcd.io/v2.HelmRelease + HelmChart CRs via client-go's dynamicinformer.NewFilteredDynamicSharedInformerFactory (canonical seam: products/catalyst/bootstrap/api/internal/k8scache/factory.go), maps each event to FlowMessage via a pure-transform mapper, POSTs to the configured openova-flow-server with exponential-backoff retry. Status mapping: Ready=True → succeeded, InstallFailed/UpgradeFailed/ RetriesExhausted → failed, Progressing/Unknown/other-False → running, no Ready yet → pending. FlowNode.id format "{REGION_KEY}/{hrName}" so multi-region renders correctly. Region-aware: synthetic region parent FlowNode emitted on bootstrap; dependsOn entries fan-out to finish-to-start relationships. Two wrapper charts under platform/openova-flow-{server,emitter}/chart/ (canonical seam: platform/qa-app/chart/ for the simple Deployment+Service+SA shape; platform/k8s-ws-proxy/chart/ for the DaemonSet+ClusterRole+ClusterRoleBinding shape). MIRROR-EVERYTHING: image refs go through harbor.openova.io/proxy-ghcr/openova-io/... Image tag + required runtime config fail-fast at chart render via _helpers.tpl so silent ImagePullBackOff / boot crash is impossible. Two bootstrap-kit HRs added (slots 56 + 57): - 56-bp-openova-flow-server (dependsOn: bp-cilium, bp-cert-manager) — installs on primary cluster only; Cilium Gateway HTTPRoute at openova-flow.<sovereignFQDN> for cross-cluster ingest. - 57-bp-openova-flow-emitter (dependsOn: bp-flux) — DaemonSet, runs on every cluster (mother + Sovereign + every secondary region). scripts/expected-bootstrap-deps.yaml updated; check-bootstrap-deps.sh audit passes (drift=0, cycles=0). Tests (all green): - server contract_test.go — every FlowMessage variant round-trips JSON, unknown/malformed variants reject. Cross-flow Triggerer/ToFlowID preserved. - server server_test.go — full HTTP surface, including SSE replay+tail with a real httptest.Server. - adapter mapper_test.go — every HelmRelease.status.conditions[Ready] transition + multi-dependsOn fan-out + family-label/heuristic + region fallback. Verification done locally: - (cd products/openova-flow/server && go build ./... && go test ./...) — PASS - (cd products/openova-flow/adapter-flux && go build ./... && go test ./...) — PASS - helm template platform/openova-flow-server/chart/ — renders cleanly - helm template platform/openova-flow-emitter/chart/ — renders cleanly - bash scripts/check-bootstrap-deps.sh — PASS (drift=0) Agent #3 follow-ups (called out in slot 57's HelmRelease comments): - Thread SOVEREIGN_DEPLOYMENT_ID + REGION_KEY into the postBuild.substitute env in infra/hetzner/cloudinit-control-plane.tftpl so the emitter's flowId/regionKey become per-deployment + per-region automatically. Today the slot uses SOVEREIGN_FQDN as the flowId fallback and "primary" as the regionKey default; per-Sovereign overlays can override pre-Agent-#3. - catalyst-api proxy at /sovereign/api/v1/flows/{id}/stream so the Sovereign Console canvas hits a single in-tree origin. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 15:36:54 +04:00
e3mrah	b5181ec5d6	fix(catalyst-platform): gitea-token-mint hook 60->180 iters for autoscaler cold-start (Fix #184 ) (#1388 ) * fix(catalyst-platform): gitea-token-mint hook 60->180 iters for autoscaler cold-start (Fix #184) Raise the catalyst-gitea-token-mint pre-install hook's Gitea-API wait loop from a hardcoded 60x5s (300s = 5m) budget to a values-driven knob (giteaWait.iterations x giteaWait.intervalSeconds, default 168x5 = 840s = 14m). Pairs with HR install.timeout=15m to leave 60s slack for the rest of the umbrella install action. Root-cause trace (4-layer) on prov #33 (multi-region fsn1+hel1, cpx42 workerCount=0+autoscaler): bp-catalyst-platform HR (15m HR-timeout) -> Helm pre-install hook Job: catalyst-gitea-token-mint -> pod runs alpine/k8s curl loop: while ! curl gitea-http.gitea.svc.cluster.local; do sleep 5; i=$((i+1)) done -> Hook gave up at iter 60 (= 5 min wall-time) -> Meanwhile gitea Pod is Pending: autoscaler-hcloud still scaling up workers in fsn1/hel1 (Fix #157 sizing default workerCount=0 means cold start). Budget arithmetic (post-Fix #184 default): hook_wait_time = iterations x intervalSeconds = 168 x 5 = 840s (14 min) HR install.timeout = 900s (15 min) slack within HR budget = 60s ( 1 min) The hook MUST complete strictly before HR remediates; the 60s slack absorbs regular release resources rolling + post-install hooks after the pre-install Job. Canonical-seam citations: - The hook lives at products/catalyst/chart/templates/ catalyst-gitea-token-secret.yaml (line ~303 pre-Fix), the catalyst-gitea-token-mint Job's `args` block. - Prior pattern: bp-keycloak chart 1.4.5 (Fix #146) introduced keycloakConfigCli.availabilityCheck.timeout as a values knob - same shape (chart-internal hook timing knob, distinct from the outer HR timeout). See platform/keycloak/chart/values.yaml:413. - The HR's install.timeout=15m lives at clusters/_template/ bootstrap-kit/13-bp-catalyst-platform.yaml:484 - the chart-internal wait budget MUST stay strictly less than this. Recurring class: same family as Fix #127 (bp-cutover HR 15m), Fix #131 (bp-gitea HR 15m), Fix #150 (bp-harbor HR 15m), Fix #154 (HR-timeout audit). Those bumped the HelmRelease install.timeout. This bumps the chart-INTERNAL wait loop budget inside the pre- install hook Job, which is a different (lower) seam. Per INVIOLABLE-PRINCIPLES #4 (never hardcode) the budget is fully runtime-configurable via .Values.giteaWait. Operators may shorten on known-warm-cluster overlays or extend on air-gapped Sovereigns. Changes: - products/catalyst/chart/templates/catalyst-gitea-token-secret.yaml: replace hardcoded `seq 1 60` + `sleep 5` with templated ITERATIONS/INTERVAL vars driven by .Values.giteaWait.{iterations, intervalSeconds}. - products/catalyst/chart/values.yaml: add giteaWait block with defaults (iterations: 168, intervalSeconds: 5 = 14m budget). - products/catalyst/chart/Chart.yaml: bump 1.4.139 -> 1.4.140 with changelog entry capturing the 4-layer trace + budget arithmetic. - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bump HelmRelease pin 1.4.138 -> 1.4.140 (skip 1.4.139 which is a no-op packaging bump on main). Verification: - helm template renders cleanly (2799 lines, exit 0). - Force-render with lookup gate bypassed shows ITERATIONS=168 + INTERVAL=5 substituted into the rendered Job args. - --set giteaWait.iterations=240 --set giteaWait.intervalSeconds=10 override confirmed to emit ITERATIONS=240 + INTERVAL=10. Test plan (post-merge, on prov #34): - kubectl logs -n catalyst-system catalyst-gitea-token-mint-* should emit `waiting for gitea api ($i/168)` instead of `($i/60)`. - bp-catalyst-platform HR reaches Ready=True within the 15m HR budget (previously installFailures: 2 on prov #33). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bootstrap-deps): reconcile pre-existing dep-graph audit drift Two pre-existing drift items surfaced when dep-graph-audit ran on the Fix #184 PR — both are in `main` already, not introduced here, but the gate blocks any PR until the expected DAG matches the actual HRs. 1. `bp-catalyst-platform` (slot 13) — actual HR file declares `bp-crossplane-claims` as an additional dependsOn edge (added in chart-roll-rca iter-15, 2026-05-10, for the XRD-ordering race that caused the omantel.biz 90-min wedge). Update expected-deps to include it. 2. `bp-hcloud-ccm` (slot 55) — present on disk but absent from expected-deps. Cloud-provider seam, no upstream dependencies. Added with empty depends_on. --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>	2026-05-11 14:44:54 +04:00
e3mrah	a388a61ae2	fix(bootstrap-kit/_template): wire NetBird/DMZ/Hubble/BGP via envsubst — qa-loop iter-12 Fix #53C+D follow-up (#1280 ) * fix(bootstrap-kit/_template): wire NetBird/DMZ/Hubble/BGP/clustermesh-LB via envsubst — qa-loop iter-12 Fix #53C+D follow-up The omantel chroot reconciles from clusters/_template/bootstrap-kit/ (not the per-Sovereign omantel.omani.works/ overlay). PR #1275 added slot 53 (NetBird) and slot 54 (DMZ vCluster) plus Hubble UI / BGP / clustermesh-LB to the omantel.omani.works overlay only. This PR mirrors the same changes into _template via envsubst so the chroot also picks them up. 01-cilium.yaml: - Chart pin 1.2.0 → 1.3.0 (Hubble UI HTTPRoute overlay + clustermesh shape) - hubble.relay/ui.enabled gated on ${HUBBLE_ENABLED:=false} (default off, backward-compat) - bgpControlPlane.enabled gated on ${BGP_ENABLED:=false} - clustermesh.apiserver.service.type gated on ${CLUSTERMESH_SERVICE_TYPE:=NodePort} (default NodePort, backward-compat) - catalystOverlay.hubbleUI block (envsubst gated, off by default) 53-bp-netbird.yaml NEW: NetBird Sovereign install, default-OFF via NETBIRD_ENABLED. OIDC issuer / realm parameterized through SOVEREIGN_REALM_NAME so the per-Sovereign realm rename (Fix #53A) flows through. 54-bp-dmz-vcluster.yaml NEW: DMZ vCluster install, default-OFF via DMZ_VCLUSTER_ENABLED. Vcluster name parameterized via DMZ_VCLUSTER_NAME (default `dmz`). kustomization.yaml: added slots 53/54. Operator opts in per-Sovereign by setting the substitutes on the bootstrap-kit Kustomization. Live patches applied to omantel for immediate effect: - HUBBLE_ENABLED=true HUBBLE_HOSTNAME=hubble.console.omantel.biz - BGP_ENABLED=true - NETBIRD_ENABLED=true - DMZ_VCLUSTER_ENABLED=true DMZ_VCLUSTER_NAME=omantel-dmz * fix(bootstrap-deps): add bp-netbird (slot 53) + bp-dmz-vcluster (slot 54) to expected DAG — qa-loop iter-12 Fix #53C dependency-graph-audit fix	2026-05-10 11:05:20 +04:00
e3mrah	5ca0a7d178	fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots (#1236 ) * fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots Closes the scope-narrow confessed by Fix #36: bp-guacamole + bp-k8s-ws-proxy chart skeletons existed at platform/* but lacked CI image-build workflows + bootstrap-kit slots, so TC-228 / TC-230 / TC-236 / TC-237 / TC-245 / TC-246 stayed FAIL with "deployment NotFound". CI workflows ------------ - .github/workflows/build-k8s-ws-proxy.yaml: Buildx + cosign keyless sign + SBOM attestation flow on core/cmd/k8s-ws-proxy/*, then bumps platform/k8s-ws-proxy/chart/values.yaml image.tag + Chart.yaml patch version + dispatches blueprint-release. - .github/workflows/build-bp-guacamole.yaml: mirrors upstream Apache Guacamole 1.5.5 to GHCR (so every Sovereign pulls from a registry we own — no Docker Hub rate limits, no upstream availability risk), bumps values.yaml.image.{repository,tag} + Chart.yaml + dispatches blueprint-release. Charts (target-state) --------------------- - bp-k8s-ws-proxy v0.1.1: canonical workload name `k8s-ws-proxy` regardless of release name (DaemonSet + Service + ClusterRole + ClusterRoleBinding + ServiceAccount all named `k8s-ws-proxy` so matrix can address them by canonical short name). - bp-guacamole v0.1.1: canonical short resource names (`guacd`, `guacamole-server`, `guacamole-recordings`); GHCR-mirrored upstream images; realm-patch ConfigMap correctly lands in `keycloak` namespace (was: realm-name, which would have failed silently on every Sovereign); `realmConfig.namespace` override surface added. - Both charts: `catalyst.openova.io/smoke-render-mode: default-off` annotation so blueprint-release smoke-render gate honors the default-OFF render shape. Bootstrap-kit slots ------------------- - clusters/_template/bootstrap-kit/36-bp-k8s-ws-proxy.yaml + 37-bp-guacamole.yaml: dependsOn-ordered (proxy → gateway), pinned to 0.1.1, default-OFF gate flipped via slot values, install/upgrade disableWait per session-2026-04-30 architectural decision. - clusters/omantel.omani.works/bootstrap-kit/ slots mirror the same shape with omantel.biz hostnames matching the live HTTPRoutes on console.omantel.biz / auth.omantel.biz. API: shells/issue handler (matrix-canonical URL surface) -------------------------------------------------------- - POST /api/v1/sovereigns/{id}/shells/issue?namespace=&pod=&container= alias for the existing POST /api/v1/sovereigns/{id}/k8s/exec/{ns}/{pod}/{container}/session with matrix-canonical response fields (`sessionId`, `guacamoleUrl`, `recordingPath`). Same business logic, same audit surface (`guacamole-session-opened`), same RBAC gate (tier-developer or higher). 6 test cases, all PASS under -race. TCs that flip PASS in iter-8 ----------------------------- - TC-228: POST /shells/issue → sessionId + guacamoleUrl + recordingPath - TC-230: kubectl get deploy guacd guacamole-server -n catalyst-system - TC-236: kubectl get ds k8s-ws-proxy -n catalyst-system - TC-237: kubectl logs ds/k8s-ws-proxy → "listening" - TC-245: viewer-cookie POST /shells/issue → 403 - TC-246: operator-cookie POST /shells/issue → 200 sessionId Per feedback_no_mvp_no_workarounds.md: NO follow-up slices — every gap Fix #36 confessed is closed in this PR. Per feedback_machine_saturation_3rd_violation.md: CI-only build path, no local docker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bootstrap-kit): move bp-k8s-ws-proxy + bp-guacamole to slots 51/52 (Fix #39 follow-up) CI dependency-graph-audit caught a slot-number collision: slots 36-48 are reserved for the W2.K4 AI-runtime cohort (bp-stunner, bp-knative, bp-kserve, bp-vllm, bp-llm-gateway, bp-anthropic-adapter, bp-bge, bp-nemo-guardrails, bp-temporal, bp-openmeter, bp-livekit, bp-matrix, bp-librechat) per scripts/expected-bootstrap-deps.yaml. Move the exec-fan-out blueprints to slots 51/52 (post-W2.K4, pre-Phase-2 80+ slot range) and add their entries to the expected DAG. - clusters/_template/bootstrap-kit/{36,37}-* → {51,52}-* - clusters/omantel.omani.works/bootstrap-kit/{36,37}-* → {51,52}-* - kustomization.yaml updates (both _template + omantel) - scripts/expected-bootstrap-deps.yaml: declare slots 51/52 with full dependsOn lists (bp-k8s-ws-proxy on cilium+sealed-secrets, bp-guacamole on cilium+cert-manager+keycloak+sealed-secrets+ seaweedfs+k8s-ws-proxy) scripts/check-bootstrap-deps.sh re-run: 0 drift, 0 cycles, 55 declared HRs, 42 present on disk, 13 deferred (W2.K1-K4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 01:48:25 +04:00
e3mrah	94ffe01ff0	chore(bootstrap-kit): remove slot 95 bp-stalwart-sovereign (Phase-2 deferred) (#958 ) The bp-stalwart-sovereign chart's post-install Job times out on fresh Sovereigns (observed on otech113) and blocks the entire bootstrap-kit Kustomization. Phase-2 Sovereign-local mail (umbrella #924) is OUT OF SCOPE for the current Phase-1 cutover. Phase-1 Console PIN/magic-link delivery already works through the mothership SMTP relay path: - products/catalyst/chart/values.yaml#sovereign.smtp.* defaults to mail.openova.io:587 / noreply@openova.io - products/catalyst/bootstrap/api/internal/handler/sovereign_smtp_seed.go seeds those bytes into catalyst-system/sovereign-smtp-credentials at bootstrap, so bp-catalyst-platform's `lookup` resolves on first reconcile without waiting for a Sovereign-local Stalwart. This commit: - Deletes clusters/_template/bootstrap-kit/95-bp-stalwart-sovereign.yaml - Updates the kustomization.yaml resource list with a comment block documenting the deferral and the canonical re-entry conditions. - Updates scripts/expected-bootstrap-deps.yaml so check-bootstrap-deps.sh no longer expects the slot. Audit re-runs clean (0 drift, 0 cycles). The chart itself stays at platform/stalwart-sovereign/ for future Phase-2 work; only the bootstrap slot is removed. Refs: #883 #924 Co-authored-by: Hatice Yildiz <hatiyildiz@openova.io>	2026-05-05 15:55:30 +04:00
e3mrah	9077016466	feat(bp-stalwart-sovereign): per-Sovereign Stalwart for Console mail (#924 ) (#931 ) Phase-2 follow-up to #883: replace mothership Stalwart relay (mail.openova.io:587) with a Sovereign-local Stalwart so Console PIN/magic-link mail originates from `noreply@<sovereignFQDN>` with per-Sovereign SPF/DKIM/DMARC posture, eliminating the mothership SMTP SPOF for Sovereign Console login. What ships: 1. NEW blueprint platform/stalwart-sovereign/ (otech-level — distinct from per-tenant bp-stalwart-tenant). Single Stalwart instance per Sovereign cluster, scoped to Sovereign Console system mail. NO Keycloak OIDC, NO webmail UI — Sovereign Console is the only consumer. Auto-provisioned admin + submission Secrets via the lookup-or-generate pattern (#898/#830/#887). Post-install Job: - registers the noreply submission principal in Stalwart - allows send-as for noreply@<sovereignFQDN> - reads DKIM public key, patches dns-records ConfigMap - materialises catalyst-system/sovereign-smtp-credentials with Sovereign-local infrastructure addresses + credentials, carrying BOTH key shapes (smtp-user/smtp-pass + legacy user/password) so the consumer chart works either way. 2. NEW bootstrap-kit slot 95 (clusters/_template/bootstrap-kit/ 95-bp-stalwart-sovereign.yaml). dependsOn: bp-cert-manager, bp-catalyst-platform. Sequenced after bp-catalyst-platform (slot 13) so the chart's post-install Job lands its mirror Secret in an already-existing catalyst-system namespace. 3. bp-catalyst-platform 1.4.19 → 1.4.20: SOURCE-wins precedence extended to (a) non-secret fields smtp-host/smtp-port/smtp-from so Sovereign-local infra addresses (`mail.<sovereignFQDN>`) take over from mothership defaults (`mail.openova.io`) on the next reconcile after slot 95 lands, and (b) canonical key shape `smtp-user`/`smtp-pass` in addition to legacy `user`/`password` source key shape. 4. expected-bootstrap-deps.yaml: declare slot 95 graph edge. 5. catalyst-api handler/sovereign_smtp_seed.go: documentation-only update to note this Phase-1 step is now a graceful fallback — the Phase-2 chart's post-install Job overwrites the mirror Secret on first reconcile so the cutover from mothership relay to Sovereign-local relay is automatic, no operator action. Verification: - `helm template smoke ./platform/stalwart-sovereign/chart` clean (smoke-render-safe; per-template gates skip when sovereignFQDN unset). - `helm template smoke -f operator-values.yaml` emits StatefulSet, LoadBalancer Service, ClusterIP HTTP Service, DKIM-signing config, dns-records ConfigMap, Setup Job + RBAC. - `chart/tests/sovereign-render.sh` 3 cases all PASS. - `helm template smoke ./products/catalyst/chart` (1.4.20) clean. - `helm lint` both charts: clean (only icon-recommended INFO). - `bash scripts/check-bootstrap-deps.sh` PASSED — bootstrap-kit dependency graph audit, 0 drift, 0 cycles. - `go test -run TestSeedSovereignSMTP` — Phase-1 seed tests pass. - `go test -run TestBootstrapKit_TemplateClusterParses` — slot 95 YAML parses cleanly. Out of scope (sub-PR follow-up under #924): - DKIM keypair generation in catalyst-api orchestrator + DNS records (MX/A/SPF/DMARC/DKIM-pubkey) registration via PDM dynadot adapter at omani.works. - Hetzner PTR (rDNS) auto-registration via the Hetzner cloud API. - Cert-manager Certificate adding mail.<sovereignFQDN> SAN to the Sovereign wildcard cert (chart relies on the existing wildcard cert from bp-catalyst-platform 1.4.0+'s per-zone Certificate template — when that wildcard chain covers the Sovereign FQDN, `mail.<sovereignFQDN>` is already covered). Acceptance (lands when sub-PR follow-up ships): - Sovereign Console PIN delivery uses noreply@<sov-fqdn>. - External mail server (e.g. Gmail) accepts mail with valid SPF + DKIM. - Mothership SMTP no longer SPOF for Sovereign Console login. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 14:20:16 +04:00
e3mrah	20b3c5258a	feat(bp-newapi): chart maturation + first-otech deploy + Qwen vLLM channel (#799 ) (#812 ) * feat(bp-newapi): chart maturation — ExternalSecret + first-otech vLLM channel + skip-render gates (#799) Maturation work for the SME-3 turnkey-experience epic (#795). Aligns the bp-newapi scratch chart with ADR-0003 (RBAC ↔ NewAPI user-create hook contract) and gets it past the blueprint-release CI smoke render that has blocked publication since PR #396 (run 25213444992 failed at default-values render of v1.0.0). Changes ------- - templates/external-secret.yaml (NEW). Renders the `catalyst-newapi-admin-token` ExternalSecret consumed by unified-rbac (ADR-0003 §3.2 + §6) for issuing per-user keys against `http://newapi.newapi.svc/api/v1/admin/users`. Sourced from OpenBao via the `vault-region1` ClusterSecretStore (canonical default shipped by bp-external-secrets-stores). Capabilities-gated on `external-secrets.io/v1beta1` so cold installs without ESO don't fail-render. Operator supplies the per-Sovereign OpenBao path via `catalystIntegration.externalSecret.remoteRef.key`; canonical convention is `sovereign/<sovereign-fqdn>/newapi/admin-token` with property `ADMIN_API_TOKEN`. Per Inviolable Principle #4 every knob is operator-overridable in the cluster overlay. - values.yaml. Adds `catalystIntegration.externalSecret.{enabled, refreshInterval, secretStoreRef.{kind,name}, remoteRef.{key,property}}` block (default enabled=true, key="" so a misconfigured overlay fails loudly at render rather than silently skipping). Adds `defaultChannels.vllm` block — first-otech shorthand that composes a vLLM-typed channel into the rendered channels list when enabled. Default endpoint is empty per Inviolable Principle #4; the `clusters/<sovereign>/bootstrap-kit/80-newapi.yaml` overlay supplies the per-Sovereign URL (canonical first-otech reference = `https://llm-api.omtd.bankdhofar.com` model `qwen3-coder`, the same upstream Axon uses on the OpenOva marketing deployment). - templates/_helpers.tpl. New `bp-newapi.effectiveChannels` helper composes `.Values.channels` with `defaultChannels.vllm` (when enabled). The `assertChannelAttestation` helper now operates on the effective list so attestation gates apply to defaultChannels composition too. `defaultChannels.vllm.enabled=true` with empty endpoint fails-fast at render with a guided error message. - templates/configmap.yaml. Channels rendering switches to the effectiveChannels helper. OIDC block now skip-renders gracefully when `auth.adminUI.keycloak.issuer` is unset (smoke-render path) instead of `required`-failing; the per-Sovereign overlay sets the issuer. - templates/deployment.yaml. Skip-render gate on Deployment when `database.existingSecret`, `credentials.existingSecret`, or (when Keycloak mode is selected) the OIDC client secret is missing. Removes the four `required` calls that were failing CI smoke render. Service, ServiceAccount, ConfigMap, NetworkPolicy still render so the smoke test gets a non-empty output proving structural soundness; the actual Deployment defers until the per-Sovereign overlay wires the secrets. - templates/ingress.yaml. Same skip-render pattern: when either `ingress.host` or `ingress.adminHost` is empty, the entire ingress block is silently skipped. Matches the bp-keycloak / bp-openbao / bp-external-dns HTTPRoute templates. - Chart.yaml. version 1.0.0 → 1.1.0 (minor bump — additive features; no breaking changes to existing operator overrides). Verification ------------ `helm template` smoke render on default values now succeeds with 4 resources (NetworkPolicy / ServiceAccount / ConfigMap / Service); 168 lines, well above the CI 5-line minimum. With a full per-Sovereign overlay (hosts + secrets + Keycloak issuer + ESO Capabilities + Traefik Capabilities + defaultChannels.vllm.endpoint), 8 resources render including Deployment, both Ingresses, the Traefik allowlist Middleware, and the ExternalSecret. The composed qwen channel writes through to `channels.yaml` with the expected endpoint + models + attestation. Refs ---- ADR-0003 §3.2 + §6 — admin-token contract Issue #795 (epic) — locked decisions Issue #796 — hook contract spec (sequential blocker, merged) Inviolable Principles #1, #3, #4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(bootstrap-kit): slot 80 — bp-newapi default install (#799) Adds the canonical install slot for bp-newapi to every fresh Sovereign's bootstrap-kit. Sequenced after the W2.K1 dependency wave so NewAPI's ExternalSecret + Postgres DSN dependencies resolve on first reconcile. The HelmRelease declares `dependsOn: [bp-openbao, bp-keycloak, bp-cnpg]`: - bp-openbao(08): admin-token ExternalSecret backend - bp-keycloak(09): OIDC issuer for ops-staff admin UI at admin.<fqdn> - bp-cnpg(16): Postgres backing for users/credits/channels/audit Per-Sovereign overlays inherit the slot's defaults and override: - ingress.host api.${SOVEREIGN_FQDN} - ingress.adminHost admin.${SOVEREIGN_FQDN} - auth.adminUI.keycloak.issuer - database.existingSecret (Crossplane-claimed) - credentials.existingSecret - catalystIntegration.externalSecret.remoteRef.key sovereign/${FQDN}/newapi/admin-token - defaultChannels.vllm.enabled true (first-otech) - defaultChannels.vllm.endpoint (operator-supplied) The `_template/` slot keeps `defaultChannels.vllm.enabled: false` so a fresh Sovereign does not silently wire customers to a third-party endpoint; the canonical first-otech reference (Qwen3 Coder via `https://llm-api.omtd.bankdhofar.com`, same relay Axon uses on the OpenOva marketing deployment) is documented in-line for operators adopting the same upstream. Refs: #795 (epic), ADR-0003 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bootstrap-deps): register bp-newapi slot 80 in expected DAG (#799) Fixes the dependency-graph-audit drift detection caught at PR #812 CI: the audit script enumerates HelmReleases in clusters/_template/bootstrap-kit/ and compares to scripts/expected-bootstrap-deps.yaml; an HR present on disk but absent from the expected DAG is treated as drift. Adds the canonical entry for bp-newapi at slot 80 with the same depends_on set declared on the HelmRelease itself ([bp-openbao, bp-keycloak, bp-cnpg]). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-newapi): align blueprint.yaml spec.version with Chart.yaml (#799) The TestBootstrapKit_BlueprintCardsHaveRequiredFields static-validation gate asserts Chart.yaml version == blueprint.yaml spec.version. The chart was bumped to 1.1.0 in c63ecd8c; bumping the blueprint metadata to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 22:17:25 +04:00
e3mrah	33dc98782b	feat(bp-self-sovereign-cutover): chart + bootstrap-kit slot 06a (#791 ) (#808 ) New platform Blueprint at `platform/self-sovereign-cutover/chart/`. Ships DORMANT — eight step PodSpec ConfigMaps, the registry-pivot DaemonSet, the mutable cutover-status ConfigMap, plus ServiceAccount/RBAC. The catalyst-api cutover endpoint (#792, merged at `03828641`) reads each step ConfigMap by label selector and stamps real Jobs only on operator-driven trigger. Step inventory: 01 gitea-mirror — git push --mirror upstream → local Gitea 02 harbor-projects — create 7 proxy-cache projects 03 harbor-prewarm — HEAD-pull bootstrap-kit images through cache 04 registry-pivot — DaemonSet rewrites registries.yaml on every node 05 flux-gitrepository-patch — pivot GitRepository.url → local Gitea 06 helmrepository-patches — pivot 38 OCI URLs → local Harbor 07 catalyst-api-env-patch — kubectl set env CATALYST_GITOPS_REPO_URL 08 egress-block-test — CiliumNetworkPolicy + 10-min sovereignty proof Plus self-sovereign-cutover-status ConfigMap with the consumer-contract keys (cutoverComplete, currentStep, step.<name>.result, etc.) shipped at install with helm.sh/resource-policy: keep so chart uninstall doesn't lose state. Bootstrap-kit slot `06a-bp-self-sovereign-cutover.yaml` installs the chart into the `catalyst` namespace (matches catalyst-api's default discovery namespace), depends on bp-gitea + bp-harbor, uses disableWait: true. RBAC splits `create` verbs into their own Rule WITHOUT resourceNames per feedback_rbac_create_no_resourcenames.md — the bp-openbao loop anchor. chart/tests/cutover-contract.sh enforces: - 8 step ConfigMaps render - required labels (part-of/component/cutover-order/cutover-mode) - required data keys (stepName + podSpec for job-mode) - step 04 mode=daemonset-wait - status ConfigMap retained on uninstall - RBAC create/resourceNames split helm template smoke render: 1180 lines, 19 resources (1 Namespace + 1 SA + 11 ConfigMaps + 1 DaemonSet + 1 ClusterRole + 1 ClusterRoleBinding). helm lint: clean. scripts/check-bootstrap-deps.sh: PASSED (slot 6a registered, depends_on [bp-gitea, bp-harbor]). Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 21:55:19 +04:00
e3mrah	53bc4357ca	feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767 ) (#776 ) * feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767) Two-pronged fix for the FailedScheduling pattern that hit otech92 (2x cpx32 workers couldn't fit external-secrets-webhook because the bootstrap-kit ate the full 16 GB): 1. PRE-LAUNCH ESTIMATE — wizard StepReview now surfaces a "Footprint estimate" Section with: bootstrap-kit baseline (sum of mandatory-tier component footprints), selected components delta, control-plane overhead, and a "Recommended N x <SKU>" line that turns amber when the operator's chosen worker count is below the rollup. Backed by per-component RAM/CPU floors in components/wizard/steps/componentFootprints.ts (covered by 12 unit tests including the otech92 reproduction). 2. RUNTIME AUTOSCALING — new bp-cluster-autoscaler-hcloud Blueprint added at bootstrap-kit slot 40. Wraps the upstream kubernetes/autoscaler chart 9.46.6 (appVersion 1.32.0) with the Hetzner cloud-provider. Token wired from the canonical flux-system/cloud-credentials.hcloud-token Secret cloud-init writes (mirrors the velero/harbor object-storage pattern). Pinned to the control-plane node so the autoscaler never schedules onto a worker it could itself terminate. 10-minute scale-down idle as the cost-saving default. Documented in docs/ARCHITECTURE.md sec.14 (Autoscaling) — explains how VPA / HPA / KEDA / cluster-autoscaler compose, why we picked cluster-autoscaler over KEDA for cluster scaling, and the bounds + safety story. Per the issue's MVP scope, this PR ships the blueprint + StepReview estimate WITHOUT the wizard StepProvider min/max pair refactor or the tofu node-pool template restructuring. Those are tracked as a follow-up issue (scope-control rule per docs/INVIOLABLE-PRINCIPLES.md #1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(provisioner): move cluster-autoscaler to slot 50 + register in expected-bootstrap-deps Slot 40 was already forward-declared for bp-llm-gateway in scripts/expected- bootstrap-deps.yaml — the dependency-graph-audit CI check fired on PR #776 because the file existed without a matching entry in the expected DAG, AND collided with a reserved slot. Move to slot 50 (after the W2.K4 cohort + slot 49 bp-cert-manager-powerdns-webhook) and add the matching entry to the expected-bootstrap-deps.yaml so the audit passes. `scripts/check-bootstrap-deps.sh` runs clean locally now (drift=0, cycles=0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:49:44 +04:00
e3mrah	2b60e944e2	fix(bp-cert-manager-powerdns-webhook): re-target to contabo PowerDNS, drop dynadot-webhook (#681 ) * fix(bp-cert-manager-powerdns-webhook): re-target to contabo PowerDNS, drop dynadot-webhook Caught live on otech43-46: cert-manager DNS-01 challenges for .otechN.omani.works failed because the Sovereign-side webhook wrote challenge TXT records to the Sovereign's local PowerDNS. omani.works is delegated from Dynadot to ns1/2/3.openova.io which run on contabo's central PowerDNS — the Sovereign's local PowerDNS is INVISIBLE on the public DNS chain until pool-domain-manager seals the per-Sovereign NS delegation. Let's Encrypt resolvers walk the public chain, query contabo, get NXDOMAIN, the cert never issues. Manual workaround was seeding challenge TXT directly in contabo PowerDNS. This PR automates the right write path: - bp-cert-manager-powerdns-webhook chart bumped to 1.0.4. Default powerdns.host flips from "" (skip-render) to https://pdns.openova.io (contabo's public PowerDNS API ingress, authoritative for omani.works). - ClusterIssuer letsencrypt-dns01-prod-powerdns now usable with no per-cluster powerdns.host override for the omani.works pool. apiKeySecretRef.namespace clarified — upstream ignores it; the Secret must live in cert-manager namespace (= ChallengeRequest.ResourceNamespace for ClusterIssuers). - bootstrap-kit slot 49 updated: drops bp-powerdns dependsOn (webhook calls out-of-cluster contabo, not local PowerDNS), bumps chart version, removes inline powerdns.host override (defaults are correct). - bootstrap-kit slot 49b (bp-cert-manager-dynadot-webhook) DELETED entirely — Dynadot is NOT the API-level authority for omani.works subdomains, the dynadot webhook silently fails the same way the Sovereign-local powerdns one did. - clusters/_template/sovereign-tls/cilium-gateway-cert.yaml flips issuerRef from letsencrypt-dns01-prod (was dynadot-backed) to letsencrypt-dns01-prod-powerdns (the new contabo-backed issuer). - bp-cert-manager chart: certManager.issuers.dns01.enabled defaults to false (deprecated dynadot path). letsencrypt-http01-prod retained for per-host certs. Cluster overlays MAY flip dns01.enabled=true for non-omani.works pools where Dynadot IS the API-level authority. - scripts/expected-bootstrap-deps.yaml: drops slot 49b, drops bp-powerdns edge from slot 49. - Documentation (README + blueprint.yaml + Chart.yaml description) rewritten to reflect contabo retarget and lifecycle reasoning. Credential plumbing (out of scope here, must be done in cloud-init): - Every Sovereign needs a `powerdns-api-credentials` Secret in the `cert-manager` namespace whose `api-key` value matches contabo's PowerDNS API key. Same seeding pattern as `dynadot-api-credentials` in infra/hetzner/cloudinit-control-plane.tftpl. Caveat — basicAuth on contabo's PowerDNS API ingress: contabo currently fronts pdns.openova.io with Traefik basicAuth (per clusters/contabo-mkt/apps/powerdns/helmrelease.yaml). The upstream zachomedia/cert-manager-webhook-pdns binary supports the X-API-Key header but not HTTP Basic Auth out of the box. To make this end-to-end green, contabo's basicAuth requirement must be relaxed (X-API-Key alone provides the auth posture, and contabo's API endpoint is restricted to operator IPs by other means OR the Sovereign's webhook needs an Authorization header injected via the chart's powerdns.headers map (plaintext password in the ClusterIssuer config — not ideal). This PR ships the chart side; the basicAuth question is a follow-up on the contabo side. Verified locally: - helm lint platform/cert-manager-powerdns-webhook/chart -> PASS - helm template platform/cert-manager-powerdns-webhook/chart -> renders - helm template ... --set clusterIssuer.enabled=true -> renders the ClusterIssuer with host="https://pdns.openova.io" + correct apiKey Secret reference. - helm template platform/cert-manager/chart -> renders ONLY letsencrypt-http01-prod (the dns01 dynadot issuer correctly gated off). - scripts/check-bootstrap-deps.sh: net-zero new drift; my branch reduces pre-existing errors from 3 to 2 (the dropped slot 49b removed the only drift my branch was responsible for). Closes follow-up to #373. Preconditions for handover URL TLS green on otech43-46 lineage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(scripts): repair YAML structure in expected-bootstrap-deps.yaml Two pre-existing drifts were blocking dependency-graph-audit CI: 1. Slot 5a (bp-reflector) was missing its closing list separator, causing yq to merge the bp-nats-jetstream entry into the bp-reflector map and effectively drop bp-reflector from the expected DAG. Added explicit `- slot: 7` for bp-nats-jetstream and quoted "5a" so yq treats it as a string slot (matches the convention with "49b"). 2. bp-powerdns slot 11: actual bootstrap-kit declares dependsOn bp-cnpg (live since otech28 — pdns-pg-app secret race) but the expected DAG was missing this edge. This is unblocks merging fix/cert-manager-powerdns-webhook-contabo (PR above) — these drifts existed on main but weren't surfaced until the last expected-deps edit forced a re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:12:48 +04:00
e3mrah	74921e30f1	fix(architecture): drop bp-spire, Cilium WireGuard is the canonical east-west mesh (#665 ) Founder direction 2026-05-03: with 100% Cilium mesh enforcement + Envoy where required, bp-spire is redundant for the minimal Sovereign MVP. Reasoning: - Cilium 1.13+ has built-in mutual auth using SPIFFE, but it ships with its own embedded SPIRE server managed by the Cilium operator. External bp-spire is not needed for east-west mTLS. - Our ESO→OpenBao auth uses the K8s ServiceAccount auth method (TokenReview against kube-apiserver), not JWT-SVID. - WireGuard transparent encryption (already enabled in cilium values) encrypts every pod-to-pod connection at the kernel transport layer. - Cross-Sovereign federation and per-workload-fingerprint attestation are not blocking handover; they can be re-introduced as an opt-in blueprint when needed. Changes: - Delete clusters/_template/bootstrap-kit/06-spire.yaml - Remove bp-spire from kustomization.yaml + expected-bootstrap-deps.yaml - Remove bp-spire dependsOn from 07-nats-jetstream.yaml + 08-openbao.yaml - bp-cilium 1.1.4: add encryption.nodeEncryption=true so node-to-node traffic (not just pod-to-pod) is also WireGuard-encrypted; document in values.yaml comment that WireGuard is the canonical east-west mTLS layer. Removes 4 pods (spire-server, spire-agent, spire-spiffe-csi-driver, spire-spiffe-oidc-discovery-provider) from every Sovereign and the recurring CSI mount race that was getting stuck on otech43. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 13:56:36 +04:00
e3mrah	be6e610093	fix: drop bp-langfuse from minimal + bp-mimir 1.0.2 push_grpc fix (#664 ) * fix: drop bp-langfuse from minimal bootstrap-kit + bp-mimir push_grpc fix Two independent fixes packaged together: 1. Drop bp-langfuse from the SOLO minimal bootstrap-kit. Per founder direction: langfuse is LLM-specific (prompt/completion tracing for AI plane), not platform infrastructure, and belongs to a future 'AI Add-On' template. Its CreateContainerConfigError on every Sovereign provision (missing langfuse-secrets pre-install) was eating Phase-1 reconciliation budget without contributing to handover-ready state. Removed: - clusters/_template/bootstrap-kit/26-langfuse.yaml - kustomization.yaml entry - scripts/expected-bootstrap-deps.yaml slot 26 entry 2. bp-mimir 1.0.2 — re-enable ingester.push_grpc_method_enabled. Upstream mimir-distributed 6.0.6 disables Push gRPC when ingest-storage is off, but classic-mode ingester REQUIRES it. The combo crashloops with 'cannot disable Push gRPC method in ingester, while ingest storage (-ingest-storage.enabled) is not enabled'. Caught live on otech43 with 17 restarts. Both issues block Phase-1 ready=40/40 from being a clean signal. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-mimir): chart 1.0.2 push_grpc_method_enabled + finalize langfuse drop Follow-up to previous commit which only captured the file deletion. This commit applies: bp-mimir 1.0.2 chart bump, kustomization + expected-deps removal of langfuse, bootstrap-kit version bumps. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 13:50:38 +04:00
e3mrah	544dc86b5b	fix(wizard): blueprint deps sourced from Flux dependsOn (single source of truth) (#652 ) * fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13) * fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth) The wizard's componentGroups.ts carried hand-maintained `dependencies: [...]` arrays that deviated from the real Flux install graph in clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced this): componentGroups.ts Flux HelmRelease.dependsOn ---------------------- --------------------------- keycloak: [cnpg] keycloak: [cert-manager, gateway-api] openbao: [] openbao: [spire, gateway-api, cnpg] harbor: [cnpg, seaweedfs, harbor: [cnpg, cert-manager, valkey] gateway-api] Founder's directive: "all the real dependencies are related to real flux related dependencies, if you are hosting irrelevant hardcoded baseless wizard catalog dependencies, I dont know where they are coming from. The single source of truth for the dependencies is flux!!!" — 2026-05-03 This commit: 1. Adds scripts/generate-blueprint-deps.sh that parses every bootstrap-kit HelmRelease and emits blueprint-deps.generated.json keyed by bare component id (bp- prefix stripped on both source and target side). 2. Commits the generated JSON. 3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id). 4. Patches componentGroups.ts so every RAW_COMPONENT's `dependencies` field is OVERRIDDEN at module load with the Flux-canonical list (the inline `dependencies: [...]` literals are now ignored — Flux is canonical). Follow-ups (not in this PR): - CI drift check that re-runs the script and diffs the JSON. - Strip the inline `dependencies: [...]` arrays entirely once the drift check is green. - Wire the FlowPage edge-rendering to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 09:47:52 +04:00
e3mrah	8d2ba0495d	fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584 ) (#586 ) Squash merge: fix(bp-gitea) switch to CNPG-managed postgres (Closes #584)	2026-05-02 15:18:49 +04:00
hatiyildiz	7c3ff940ff	fix(ci): update solver_test.go fixtures + expected-bootstrap-deps.yaml for #550 - core/cmd/cert-manager-dynadot-webhook/solver_test.go: fix SetDns2Response → SetDnsResponse and ResponseCode:"0" → ResponseCode:0 in test fixtures so webhook command tests pass against the corrected dynadot-client JSON parsing - scripts/expected-bootstrap-deps.yaml: declare bp-cert-manager-dynadot-webhook at slot 49b so the bootstrap-kit dependency-graph audit passes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 10:44:18 +02:00
e3mrah	f689766615	fix(infra): add explicit dependsOn to bp-openbao + bp-catalyst-platform (#512 ) (#513 ) Phase-8a-preflight live deployment otech16 (9e14dcc0d2de7586, 2026-05-02): even after bumping install/upgrade timeout to 15m (commit `f47948e7`), the post-install hooks for bp-openbao and bp-catalyst-platform STILL race their dependencies. The hooks need workload pods Ready before they can do their work — bp-openbao 3-node Raft init waits for cnpg-postgres + Cilium L7, and bp-catalyst-platform umbrella init waits for keycloak + cnpg. Fix (Option C — explicit dependsOn): - bp-openbao: add bp-cnpg (already had bp-spire, bp-gateway-api) - bp-catalyst-platform: add bp-keycloak + bp-cnpg (already had bp-gitea, bp-gateway-api) This makes Flux wait for those HRs Ready=True BEFORE starting the install, so the post-install hooks run after deps are warm. Eliminates the race. Updated scripts/expected-bootstrap-deps.yaml to match. Verified: - bash scripts/check-bootstrap-deps.sh — 0 drift, 0 cycles - go test ./tests/e2e/bootstrap-kit/... -run TestBootstrapKit_DependencyOrderMatchesCanonical — PASS Closes #512 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 06:00:56 +04:00
e3mrah	e1f7d22f3c	fix(bootstrap-kit): install Gateway API CRDs ahead of HTTPRoute charts (#503 ) (#505 ) Adds bp-gateway-api Blueprint (slot 01a) that vendors the upstream Kubernetes Gateway API Standard-channel CRDs (v1.2.0) and registers them ahead of every chart that ships HTTPRoute templates: bp-openbao, bp-keycloak, bp-gitea, bp-powerdns, bp-catalyst-platform, bp-harbor, bp-grafana. Phase-8a-preflight live deployment otech10 (e1a0cd6662872fcb on catalyst-api:c148ef3, 2026-05-01) reached 21/37 HRs Ready=True before stalling on bp-harbor / bp-openbao / bp-powerdns reconciling to InstallFailed with `no matches for kind "HTTPRoute" in version "gateway.networking.k8s.io/v1"`. Cilium 1.16's chart `gatewayAPI. enabled=true` flag wires up the cilium gateway controller and creates the `cilium` GatewayClass, but does NOT install the gateway.networking.k8s.io CRDs themselves; cilium 1.16 has no `installCRDs`-equivalent knob for gateway-api so the upstream CRDs must ship via a separate Blueprint. Pattern locked in by docs/INVIOLABLE-PRINCIPLES.md and reinforced by the founder for ALL similar future cases: intra-chart CRD-ordering breaks → split into two charts + Flux dependsOn. Mirrors the bp-crossplane/bp-crossplane-claims and bp-external-secrets/ bp-external-secrets-stores splits. Files: - platform/gateway-api/{blueprint.yaml,chart/} — new Blueprint with per-CRD templates vendored from kubernetes-sigs/gateway-api v1.2.0 standard-install.yaml; helm.sh/resource-policy: keep on every CRD so Helm uninstall does not orphan every HTTPRoute on the cluster - platform/gateway-api/chart/scripts/regenerate.sh — developer tool for re-vendoring on upstream version bump (annotation-driven) - platform/gateway-api/chart/tests/crd-render.sh — chart integration test (5 CRDs, keep annotation, bundle-version matches Chart.yaml pin) - clusters/_template/bootstrap-kit/01a-gateway-api.yaml — HelmRelease + HelmRepository, dependsOn bp-cilium - clusters/_template/bootstrap-kit/{08-openbao,09-keycloak,10-gitea, 11-powerdns,13-bp-catalyst-platform,19-harbor,25-grafana}.yaml — add `dependsOn: bp-gateway-api` - clusters/_template/bootstrap-kit/kustomization.yaml — register 01a-gateway-api.yaml between 01-cilium and 02-cert-manager - scripts/expected-bootstrap-deps.yaml — declare slot 1a + add bp-gateway-api to depends_on of every HTTPRoute-using slot Closes #503 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 01:30:50 +04:00
e3mrah	1865ac8975	fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340 ) (#504 ) * fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340) The upstream seaweedfs/seaweedfs 4.22.0 chart now ships templates/shared/security-configmap.yaml which calls fromToml — a Sprig function added in Helm 3.13. Flux v1.x helm-controller bundles a Helm SDK older than 3.13 and PARSES every template before any {{- if .Values.global.seaweedfs.enableSecurity }} gate fires, so the file's mere presence breaks install on every Sovereign with: parse error at (bp-seaweedfs/charts/seaweedfs/templates/shared/security-configmap.yaml:21): function "fromToml" not defined even though enableSecurity defaults to false. Setting the gate value does NOT skip parsing — only deleting / never-shipping the file does. Fix shape (per ticket #340): 1. Vendor upstream seaweedfs/seaweedfs 4.22.0 into chart/charts/seaweedfs/ (committed bytes, not auto-pulled at build time). Required because the upstream Helm repo overwrites 4.22.0 in place — re-pulling would re-introduce the broken file. 2. Delete charts/seaweedfs/templates/shared/security-configmap.yaml. Every other template that references the deleted ConfigMap is gated under {{- if enableSecurity }} so removing it is a no-op for our default deployment shape (Catalyst SeaweedFS auth happens at the S3 layer via IAM creds from External Secrets, not via the upstream chart's TLS/JWT machinery). 3. Drop the dependencies: block from chart/Chart.yaml; add annotations.catalyst.openova.io/no-upstream=true so the blueprint-release workflow's hollow-chart guard (issue #181) skips the auto-pull/round-trip checks for this chart. 4. Whitelist platform/seaweedfs/chart/charts/ in .gitignore so the vendored bytes are tracked. 5. Bump bp-seaweedfs 1.0.1 → 1.1.0 (signal: vendored, not auto-pulled). 6. Add tests/no-fromtoml.sh — chart-test that asserts the offending file stays deleted across future re-vendors. Runs in .github/workflows/blueprint-release.yaml as a publish-gating check. Unblocks Phase-8a observability + storage chain on otech (bp-loki, bp-mimir, bp-tempo, bp-velero, bp-harbor, bp-grafana all dependsOn bp-seaweedfs). Closes #340 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): align expected-bootstrap-deps.yaml with bp-harbor's actual deps The bp-harbor HR at clusters/_template/bootstrap-kit/19-harbor.yaml lines 35-37 already removed `bp-seaweedfs` from its dependsOn (cloud-direct architecture per ADR-0001 §13 — Harbor writes blobs directly to cloud Object Storage on Sovereigns, not via SeaweedFS), but the expected DAG in scripts/expected-bootstrap-deps.yaml was never updated to match. Pre-existing drift on main; surfaced by the dependency-graph-audit check on PR #504 (bp-seaweedfs vendoring fix). Fixing it inline so the audit passes on the same PR — the two changes are both about the storage chain on Sovereigns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: alierenbaysal <alierenbaysal@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 01:20:59 +04:00
e3mrah	87ba48c44e	fix(ci): vendor-coupling guardrail path - products/catalyst/bootstrap/api/internal/objectstorage (closes #438 ) (#440 ) The mode-gate check was looking for ${REPO_ROOT}/internal/objectstorage but the actual Go package lives at products/catalyst/bootstrap/api/internal/objectstorage. Update the path so hard-fail mode auto-engages on this repo. Validation: bash scripts/check-vendor-coupling.sh -> HARD-FAIL mode banner emitted, exit 0 on clean tree Synthetic 'hetzner-object-storage' under platform/ -> exit 1. Refs: PR #437 (#383) which surfaced the bug. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 18:21:57 +04:00
e3mrah	0fdd411e79	ci(guardrail): vendor-coupling check - fail CI if chart values use vendor name (closes #428 ) (#431 ) Adds scripts/check-vendor-coupling.sh + .github/workflows/check-vendor-coupling.yaml that scan platform/, clusters/, products/catalyst/bootstrap/{api,ui} for vendor names (hetzner\|aws\|gcp\|azure\|oci) appearing in capability-named slots: 1. <vendor>-object-storage (sealed-secret / overlay-secret name) 2. <chart>Overlay\.<vendor>\. (chart values block keyed to vendor) 3. <vendor>ObjectStorage (camelCase payload field) Excludes legitimately-per-provider paths (infra/<provider>/, internal/<provider>/, internal/objectstorage/<provider>/, core/pkg/<provider>/), Crossplane Provider CR refs (lines containing "crossplane-contrib/provider-"), and *.md files (docs may discuss the rule). Mode gate: warn-only while internal/objectstorage/ does not exist (pre-#425 work-in-progress); hard-fail once that directory lands. Locally on this branch the script emits 49 warnings to stderr and exits 0 against the existing hetzner-coupled references in platform/velero, platform/seaweedfs, and clusters/.../bootstrap-kit/34-velero.yaml; once #425's rename lands those warnings disappear and any future re-introduction fails CI. Workflow trigger surface: push-to-main + pull_request on the scanned paths + workflow_dispatch. No schedule: cron per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled". Canonical seam used: scripts/ + .github/workflows/ (mirrors scripts/check-bootstrap-deps.sh + .github/workflows/blueprint-release.yaml shape). NOT a duplicate - no prior vendor-coupling guard existed. Refs: docs/omantel-handover-wbs.md §3a (canonical-seam map) docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode) Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:49:49 +04:00
e3mrah	92b7db622d	fix(bp-external-secrets-stores): split ClusterSecretStore into separate chart per #247 pattern (closes #331 ) (#426 ) * fix(bp-external-secrets): split ClusterSecretStore into bp-external-secrets-stores chart (resolves CRD ordering, closes #331) bp-external-secrets@1.0.0 deadlocked on first install on otech.omani.works: Helm install failed for release external-secrets-system/external-secrets with chart bp-external-secrets@1.0.0: failed post-install: unable to build kubernetes object for deleting hook bp-external-secrets/templates/clustersecretstore-vault-region1.yaml: resource mapping not found for name: "vault-region1" namespace: "" no matches for kind "ClusterSecretStore" in version "external-secrets.io/v1beta1" Root cause: Helm's `helm.sh/hook-delete-policy: before-hook-creation` ran a kubectl-style lookup of the existing ClusterSecretStore CR before the upstream `external-secrets` subchart's CRDs finished registration. The in-line ClusterSecretStore template (templates/clustersecretstore-vault- region1.yaml) and the upstream subchart's CRDs co-installed in the same release; admission ordering wasn't deterministic enough to make the post-install hook safe. Fix — same pattern as PR #247 (bp-crossplane@1.1.3 ↔ bp-crossplane-claims@1.0.0): split the chart into controller + stores. Flux dependsOn orders them. - bp-external-secrets@1.1.0 — controller-only (just upstream subchart + NetworkPolicy + ServiceMonitor toggle). CRDs register here. - bp-external-secrets-stores@1.0.0 (NEW) — the default ClusterSecretStore CR; depends on bp-external-secrets being Ready. No Helm hooks needed: by the time this chart's HelmRelease starts, Flux has already verified bp-external-secrets is Ready=True and therefore the CRDs are registered. Files: NEW: platform/external-secrets-stores/blueprint.yaml (1.0.0) NEW: platform/external-secrets-stores/chart/Chart.yaml (1.0.0; no upstream subchart, annotation `catalyst.openova.io/no-upstream: "true"`) NEW: platform/external-secrets-stores/chart/values.yaml (clusterSecretStore.* knobs moved from controller chart) MOVED: platform/external-secrets/chart/templates/clustersecretstore-vault-region1.yaml → platform/external-secrets-stores/chart/templates/clustersecretstore-vault-region1.yaml (Helm hook annotations removed — Flux dependsOn now handles ordering) TOUCHED: platform/external-secrets/chart/Chart.yaml (1.0.0 → 1.1.0; description note appended) TOUCHED: platform/external-secrets/blueprint.yaml (1.0.0 → 1.1.0) TOUCHED: platform/external-secrets/chart/values.yaml (clusterSecretStore block removed; pointer comment added) NEW: clusters/_template/bootstrap-kit/15a-external-secrets-stores.yaml (Flux HelmRelease, dependsOn: [bp-external-secrets, bp-openbao]) TOUCHED: clusters/_template/bootstrap-kit/15-external-secrets.yaml (chart version 1.0.0 → 1.1.0) TOUCHED: clusters/_template/bootstrap-kit/kustomization.yaml (slot 15a inserted after 15) Out of scope for this PR (separate tickets): - blueprint-release.yaml CI fan-out: verify the path-matrix picks up the new platform/external-secrets-stores/ directory automatically; if not, add the directory to the matrix in a follow-up. - Per-Sovereign cluster directory edits (#257 will delete those). - Phase 0 minimum trim (#310 will renumber slots; this PR uses 15a as a non-disruptive sub-slot insertion that works with both the current 35-slot kustomization and the eventual 15-slot canonical layout — when #310 renumbers, 15 + 15a become 08 + 09 in the canonical order). Refs: #331 (this issue), #247 (pattern reference — bp-crossplane split), Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): register bp-external-secrets-stores in expected-bootstrap-deps.yaml The dependency-graph-audit CI step rejected PR #334 because the new bp-external-secrets-stores HR was on disk at slot 15a but missing from the expected DAG. This commit adds it with the same dependsOn shape as clusters/_template/bootstrap-kit/15a-external-secrets-stores.yaml: [bp-external-secrets, bp-openbao]. Refs: #331, #310 (Phase 0 minimum), PR #334. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(bp-external-secrets): retire CR cases from controller test, add stores-toggle (#331) After splitting the default ClusterSecretStore into bp-external-secrets-stores @1.0.0, the controller chart's observability-toggle integration test still expected the CR to render in the controller chart (Cases 4 + 5). Those assertions now belong on the new chart. Changes: - platform/external-secrets/chart/tests/observability-toggle.sh: Replace Cases 4+5 with a single inverted assertion — the controller chart MUST render ZERO ClusterSecretStore CRs (top-level kind:); only the upstream subchart's CRD definition (whose spec.names.kind value is "ClusterSecretStore" at non-zero indent) is allowed. - platform/external-secrets-stores/chart/tests/clustersecretstore-toggle.sh: NEW. Mirrors the retired Cases 4+5 against the stores chart, plus a Case 3 that asserts clusterSecretStore.server overrides propagate. Local smoke: bash platform/external-secrets/chart/tests/observability-toggle.sh → 4/4 PASS bash platform/external-secrets-stores/chart/tests/clustersecretstore-toggle.sh → 3/3 PASS Refs: #331, PR #334. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): handle alphanumeric sub-slot suffixes in check-bootstrap-deps.sh PR #334 (issue #331) added slot 15a-external-secrets-stores as a sub-slot between numeric slots 15 and 16. The bootstrap-deps audit script's `printf '%02d'` formatter rejected `15a` with: scripts/check-bootstrap-deps.sh: line 390: printf: 15a: invalid number Fix: detect non-numeric slot tokens and pass them through verbatim. Numeric slots still render as zero-padded `01..49` for output alignment. Local smoke: $ bash scripts/check-bootstrap-deps.sh ... [P] slot 15 bp-external-secrets <-- bp-cert-manager bp-openbao [P] slot 15a bp-external-secrets-stores <-- bp-external-secrets bp-openbao ... OK: bootstrap-kit dependency graph audit PASSED Refs: #331, PR #334. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(wbs): tick #331 chart-released bp-external-secrets@1.1.0 (controller-only) + bp-external-secrets-stores@1.0.0 (NEW) shipped in PR #426. Helm-template acceptance + both toggle tests + dependency-graph-audit all green. Sovereign-impact deferred to Phase 8. Refs: #331, PR #426. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 17:33:47 +04:00
e3mrah	f7796ef807	feat(bp-velero): Hetzner Object Storage backend wiring (closes #384 ) (#423 ) * feat(bp-velero): Hetzner Object Storage backend wiring (closes #384) Velero on a Hetzner Sovereign now writes its backups DIRECTLY to Hetzner Object Storage per ADR-0001 §13 (S3-aware app architecture rule) + docs/omantel-handover-wbs.md §3 — NOT SeaweedFS, which is reserved as a POSIX→S3 buffer for legacy POSIX-only writers and is not in the minimal Sovereign set. Mirrors the Hetzner-direct backend pattern Agent #383 is wiring for Harbor; both consume the canonical flux-system/hetzner-object-storage Secret shipped by issue #371 (cloud-init writes 5 keys: s3-endpoint / s3-region / s3-bucket / s3-access-key / s3-secret-key, derived from the operator-issued Hetzner-Console keys + the per-Sovereign bucket provisioned by OpenTofu's aminueza/minio resource). platform/velero/chart/ (umbrella chart, bumped to 1.1.0): - templates/_helpers.tpl: NEW — bp-velero.fullname / bp-velero.labels helpers + bp-velero.hetznerCredentialsSecretName (default `velero-hetzner-credentials`). - templates/hetzner-credentials-secret.yaml: NEW — synthesises a velero-namespace Secret with a single `cloud` key in AWS-CLI INI format from .Values.veleroOverlay.hetzner.s3.{accessKey,secretKey}. The upstream Velero deployment mounts this at /credentials/cloud via existingSecret + AWS_SHARED_CREDENTIALS_FILE. Skip-render path when veleroOverlay.hetzner.enabled is false (default — keeps contabo render clean) or useExistingSecret is true (operator supplied Secret out-of-band). - values.yaml: BSL provider/region/s3Url/bucket fields populated as placeholders the per-Sovereign HelmRelease overrides via Flux valuesFrom; backupsEnabled defaults FALSE so default render emits no half-broken BSL; veleroOverlay.hetzner block surfaces the operator-overridable fields. Long-form rationale comments inline on each value per the chart's existing docstring style. clusters/_template/bootstrap-kit/34-velero.yaml (+ omantel + otech): - dependsOn: bp-seaweedfs REMOVED — Velero is no longer a SeaweedFS consumer on Sovereigns (was the old SeaweedFS-tiered architecture that minimal-omantel retired in favour of cloud-native S3). - chart version bumped 1.0.0 → 1.1.0. - valuesFrom block added: 5 Secret-key entries pull each canonical s3-* key into the matching umbrella value path. Plaintext credentials never appear in the committed manifest; Flux dereferences valuesFrom at HelmRelease apply time. - values block adds the baseline veleroOverlay.hetzner.enabled=true + velero.credentials.{useSecret:true,existingSecret:velero-hetzner- credentials} + BSL provider/credential/s3ForcePathStyle scaffolding that the valuesFrom entries fill in. docs/omantel-handover-wbs.md: - §2 row 19: "❌ chart needs S3 endpoint rework" → "🟢 chart-released v1.1.0 — Hetzner Object Storage backend wired to #371 secret". - §9 #384 row: detailed status with smoke evidence. Smoke evidence (contabo, default values — no Hetzner credentials): - helm template t . → renders cleanly (no Hetzner Secret, no BSL). - helm template t . --set veleroOverlay.hetzner.enabled=true \ --set ...accessKey=AK_TEST --set ...secretKey=SK_TEST \ --set velero.backupsEnabled=true (+ BSL config) → Secret/velero-hetzner-credentials with `cloud` INI key emitted + BackupStorageLocation/default with provider=aws, bucket=omantel-velero, region=fsn1, s3Url=https://fsn1.your-objectstorage.com. - helm install velero-smoke . -n velero-smoke (defaults) → pod velero-69bb84c5-669sh Ready 1/1 in 48s. Smoke torn down clean. Hetzner-S3 E2E deferred to Phase 8 (first omantel run) — contabo has no Hetzner Object Storage credentials so end-to-end backup→restore verification can't run here. Anti-duplication rule: NO bash scripts authored, NO parallel implementations of upstream Velero functionality. Upstream Velero + velero-plugin-for-aws natively support any S3-compatible backend; the work here is values + a credential-shape adapter Secret, not a fork. Closes #384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): drop bp-seaweedfs dep from bp-velero expected DAG (#384) Mirrors the dependsOn removal in clusters/_template/bootstrap-kit/34- velero.yaml from the parent commit. Velero on Hetzner Sovereigns now writes directly to Hetzner Object Storage (ADR-0001 §13 + WBS §3); no in-cluster prerequisite Blueprint is required. Local `bash scripts/check-bootstrap-deps.sh` now passes (0 drift, 0 cycles). The CI failure on the parent commit's PR was the audit flagging bp-velero as having a missing edge to bp-seaweedfs because this expected-DAG file still listed it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:24:44 +04:00
e3mrah	6e0f734d62	fix(bootstrap-kit): renumber bp-cert-manager-powerdns-webhook 36→49 + register in expected DAG (#373 followup) (#412 ) PR #410 landed slot 36 for bp-cert-manager-powerdns-webhook, but slot 36 was already reserved in scripts/expected-bootstrap-deps.yaml for bp-stunner (W2.K4 forward-declaration). The bootstrap-kit dependency audit failed on the merge SHA `04308af7` with: ERROR: HR 'bp-cert-manager-powerdns-webhook' (file clusters/_template/bootstrap-kit/36-bp-cert-manager-powerdns-webhook.yaml) is present on disk but NOT declared in scripts/expected-bootstrap-deps.yaml. Two fixes here: 1. Move the file to slot 49 (first free slot after W2.K4's 35-48 forward declarations). File renamed; kustomization.yaml updated; in-file comment block updated to explain the slot choice. 2. Register slot 49 in scripts/expected-bootstrap-deps.yaml as `wave: present` with `depends_on: [bp-cert-manager, bp-powerdns]` — matches the HelmRelease's actual dependsOn block. Local audit: $ bash scripts/check-bootstrap-deps.sh Present on disk: 36 Declared expected: 49 Deferred (W2.K1-K4): 13 Drift: 0 Cycles: 0 OK: bootstrap-kit dependency graph audit PASSED This is a CI-only follow-up; chart and runtime semantics from #410 are unchanged. Sovereign-impact deferred to Phase 8 per chart-only DoD. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:46:49 +04:00
e3mrah	0289f0388d	feat(scripts): bootstrap-kit dependency-graph audit script (W2.K0) (#259 ) Adds scripts/check-bootstrap-deps.sh + scripts/expected-bootstrap-deps.yaml, the W2.K0 deliverable from docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md §2 + §3. The script parses every clusters/_template/bootstrap-kit/.yaml, extracts metadata.name + spec.dependsOn for the HelmRelease document(s), and mechanically verifies the actual graph against the expected DAG declared in scripts/expected-bootstrap-deps.yaml. It detects cycles via Kahn's algorithm and prints the rendered DAG as ASCII grouped by Wave 2 batch (W2.K1-K4) on success. Behaviour against the in-flight expansion: HRs declared expected but not yet on disk are reported as "deferred" (informational, not an error), so that this script can be the static authoritative list while W2.K1-K4 PRs land their HR files in series. After all four W2 PRs merge, the "deferred" count drops to 0 and the audit goes 100% green. Wired into the existing .github/workflows/test-bootstrap-kit.yaml as a new dependency-graph-audit job that runs on every PR touching: - clusters/* (any HR file edit) - scripts/check-bootstrap-deps.sh - scripts/expected-bootstrap-deps.yaml - .github/workflows/test-bootstrap-kit.yaml Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:16:16 +04:00
hatiyildiz	9e3268f2c5	docs(ops): comprehensive operator runbook + remediation playbook + idempotent recovery script Adds docs/RUNBOOK-OPERATIONS.md as the single operator-facing entry point for provisioning, troubleshooting, and recovering Catalyst Sovereigns: A. Pre-provision checklist — Hetzner project + token, Dynadot pool zones + credentials, GHCR pull token (cross-link SECRET-ROTATION.md), PowerDNS pool zones bootstrapped, PDM healthy, bp-* chart versions, subchart-guard CI green. B. Step-by-step walkthrough with timing — Phase 0 OpenTofu (30-60s plan + 60-120s apply), PDM /commit (~5s), cloud-init (3-5min), Phase 1 bootstrap-kit (10-15min), cert-manager + Cilium Gateway (1-2min). Total 15-25min for a solo Sovereign. C. 18 known failure modes with SYMPTOM / ROOT CAUSE / DIAGNOSIS / RECOVERY, each pinned to the canonical fix commit (`c6cbfe68`, `e571ec7a`, `54872009`, `2022e1af`, `34c8de84`, `dddbab4b`, `43aff202`, `418cead0`, `64d7de97`, `330211d2`, `41c7ac13`) or marked fix-in-flight where applicable. D. Idempotent recovery script (Hetzner purge with DELETE-204-but-resource- persists verification sweep, PDM allocation release, catalyst-api deployment-record cancel). Dry-run by default; --apply gates real deletes on a validated HETZNER_API_TOKEN. E. Cross-links to INVIOLABLE-PRINCIPLES, SOVEREIGN-PROVISIONING, RUNBOOK-PROVISIONING, BLUEPRINT-AUTHORING, CHART-AUTHORING, SECRET-ROTATION, PLATFORM-POWERDNS, IMPLEMENTATION-STATUS — references, doesn't duplicate. F. Mermaid phase timeline diagram at the top showing ownership boundaries (catalyst-provisioner -> cloud-init -> Sovereign cluster) and hand-off points. G. Mermaid failure decision tree at the end — operators land at the right §C entry in 4-6 yes/no questions. Recovery script gracefully degrades to a name-only preview when HETZNER_API_TOKEN is unset in dry-run mode (apply mode still hard-fails on missing/invalid token), so operators can review what WOULD happen before exporting the token. Verified dry-run output against the live omantel.omani.works Sovereign: - Step 1 lists 8 Hetzner kinds + 8 verification-sweep targets to inspect - Step 2 confirms PDM reports the subdomain currently RESERVED (live state) - Step 3 correctly identifies catalyst-api deployment 6274daeb7a9873cd Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:26:29 +02:00

25 Commits