openova

Author	SHA1	Message	Date
hatiyildiz	3b5fca2033	merge: keep k3s local-path-provisioner; mark StorageClass default before Flux runs (closes #189 )	2026-04-29 19:43:59 +02:00
hatiyildiz	4f56ae47da	fix(cloudinit): keep k3s local-path-provisioner; mark StorageClass default before Flux runs Pre-fix the cloud-init template passed --disable=local-storage to the k3s installer with the design intent that Crossplane would install hcloud-csi day-2 and register a StorageClass after bp-crossplane reconciled. That created a circular dependency on a fresh Sovereign: every PVC-using HelmRelease in the bootstrap-kit (bp-spire, bp-keycloak postgres, bp-openbao, bp-nats-jetstream, bp-gitea, bp-catalyst-platform postgres) blocks Pending on a StorageClass that would only exist after bp-crossplane finished installing — but they ARE in the bootstrap-kit Kustomization that needs to converge before the day-2 path runs. Verified live on omantel.omani.works: data-keycloak-postgresql-0 and spire-data-spire-server-0 both stuck Pending for 20+ min with `no persistent volumes available for this claim and no storage class is set`, `kubectl get sc` empty. This change: 1. Drops --disable=local-storage from INSTALL_K3S_EXEC so k3s ships its built-in local-path-provisioner and registers the `local-path` StorageClass on first boot. 2. Adds a runcmd block AFTER /healthz wait and BEFORE the Flux bootstrap apply that: a. waits for the local-path-provisioner pod Ready b. patches the local-path SC with is-default-class=true c. fails loudly if the SC is missing post-wait (safety gate so a broken cluster doesn't fall through to Flux silently) 3. Adds tests/integration/storageclass.sh — phase 1 render-assertion (regression gate against re-introducing --disable=local-storage, plus positive assertions that the wait/patch/verify steps are present, plus ordering check that the patch precedes the Flux apply); phase 2 kind-cluster proof that a fresh cluster has a default StorageClass that binds a test PVC. 4. Adds docs/RUNBOOK-PROVISIONING.md §"StorageClass missing" — symptom, root cause, and the live-cluster recovery path (apply local-path-storage.yaml + patch default class) for already-provisioned Sovereigns that hit this without reprovisioning. Trade-off: local-path PVs are node-pinned. For the solo-Sovereign target (single CPX21/CPX31 control-plane node) that is the correct shape — the data lives on the node, capacity is bounded by the disk, and there are no other nodes for volumes to migrate to. Operators upgrading to multi-node migrate to hcloud-csi (Hetzner Cloud Volumes) as a separate, deliberate operation; that is not part of the cloud-init bootstrap. Live verification on omantel.omani.works (reproduces the production symptom + proves the recovery path): Before: NAMESPACE NAME STATUS AGE keycloak data-keycloak-postgresql-0 Pending 10m spire-system spire-data-spire-server-0 Pending 10m No StorageClass. After (kubectl apply local-path-storage.yaml + patch): NAME PROVISIONER ... AGE local-path (default) rancher.io/local-path ... 34s NAMESPACE NAME STATUS STORAGECLASS keycloak data-keycloak-postgresql-0 Bound local-path spire-system spire-data-spire-server-0 Bound local-path Gates: - tofu validate: Success! The configuration is valid. - tests/integration/storageclass.sh: PASS (phase 1 render-assertion + phase 2 fresh kind cluster default StorageClass binds test PVC). - Regression sanity: re-injecting --disable=local-storage causes phase 1 to FAIL with the documented error message (verified). Preserves the cloud-init Cilium-pre-Flux ordering (no changes to that block); the StorageClass setup runs between healthz-wait and the Flux bootstrap apply so the bootstrap-kit Kustomization sees a default class on its first reconciliation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:43:09 +02:00
hatiyildiz	b0c1c07271	fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction) Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via `https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`. helm-controller's reconcile of bp-flux ran `helm install` on top of the running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver admission with `status.storedVersions[0]: Invalid value: "v1": must appear in spec.versions`; Helm rolled back; the rollback DELETED every running Flux controller Deployment (helm-controller, source-controller, kustomize-controller, image-automation-controller, image-reflector-controller, notification-controller). The cluster lost its GitOps engine — no further HelmRelease could progress, and the only recovery was full `tofu destroy` + reprovision. This is OPTION C of the architectural fix proposed in the incident memo: version-align cloud-init's flux2 install with the bp-flux umbrella chart's `flux2` subchart so a single upstream Flux release is installed and helm-controller adopts it on first reconcile rather than reinstalls on top with a different version. Changes: * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml URL pinned at v2.4.0 (deliberate; this is the source of truth) and added the CRITICAL VERSION-PIN INVARIANT comment block documenting the failure mode. * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from 2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion 2.4.0, matching cloud-init exactly. Bumped chart version 1.1.1 -> 1.1.2. * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream .version` mirror of the dep pin moved from 2.13.0 to 2.14.1. * `clusters/_template/bootstrap-kit/03-flux.yaml` and `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped bp-flux HelmRelease to 1.1.2 + added explicit `install.disableTakeOwnership: false`, `upgrade.disableTakeOwnership: false`, and `upgrade.preserveValues: true` so helm-controller adopts the cloud-init-installed Flux objects rather than rolling back on ownership conflict. * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2. * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml` — bumped HelmRelease to 1.1.2. * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case catastrophic-failure replay test: Case 1: Chart.yaml declares the flux2 subchart with explicit version. Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag. Case 3: chart's flux2 subchart appVersion equals cloud-init's pinned upstream version (the load-bearing invariant). Case 4: values.yaml metadata mirrors the Chart.yaml dep pin. Case 5: helm template renders cleanly + contains the four core Flux controllers. Case 6: replay test rejects a planted mismatched fake Chart.yaml (the gate's own self-test — proves the gate works). All six cases green locally; the new test joins the existing observability-toggle test in tests/. * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install — version-pin invariant" documenting the failure mode, the four pin-sites, the safe bump procedure, and the existing-Sovereign recovery path (full reprovision). Existing Sovereigns running 1.1.1: no in-place recovery is possible once the rollback has fired. Reprovision required against 1.1.2. Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) + #4 (never hardcode) — the version pins remain operator-bumpable via PR, but BOTH cloud-init's URL AND the chart's subchart MUST move together in the same PR; CI gate tests/version-pin-replay.sh enforces this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:38:17 +02:00
hatiyildiz	0106c6e436	fix(helmwatch): wait for first HelmRelease before allowing terminate-on-all-done The Phase-1 watcher misread an empty informer cache as "all observed components terminal" — on the omantel.omani.works run the watch returned finalStatus=ready one second after flux-bootstrap because Flux on the new Sovereign hadn't yet reconciled the bootstrap-kit Kustomization, so zero bp-* HelmReleases existed. Zero is not done. Watcher now refuses to consider termination until BOTH: - firstSeenAt is non-zero (≥1 bp-* HelmRelease has been observed), AND - len(observed) ≥ MinBootstrapKitHRs (default 11, the bootstrap-kit count) A periodic ticker emits a single warn event after FirstSeenTimeout (default 15m) when zero HRs have been observed, naming the operator playbook in docs/RUNBOOK-PROVISIONING.md §"Phase 1 watch shows 0 HelmReleases". The watch CONTINUES — late HRs still flow. Watcher.Outcome() classifies the run as ready / failed / timeout / flux-not-reconciling. The handler copies it onto Deployment.Result.Phase1Outcome so the Sovereign Admin's wizard banner can render the right operator-actionable diagnostic, and on flux-not-reconciling flips Status=failed with an error message naming the runbook section. Both gates configurable per Principle #4: CATALYST_PHASE1_MIN_BOOTSTRAP_KIT_HRS (default 11) CATALYST_PHASE1_FIRST_SEEN_TIMEOUT (default 15m) Tests: 5 new helmwatch tests cover empty-list-doesn't-terminate, zero-HRs-after-1s-doesn't-terminate, 11-installed-terminates-ready, 11-with-1-failed-terminates-failed, 5-below-threshold-doesn't-terminate. All 25 existing helmwatch tests pass unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:29:07 +02:00
hatiyildiz	3864eef4e7	docs(reconcile-pass-2): align docs with ground truth at `6afdb303` - Wizard step canonical order updated to Org → Topology → Provider → Credentials → Components → Domain → Review (RUNBOOK-PROVISIONING, DEMO-RUNBOOK, IMPLEMENTATION-STATUS); SKU pickers cross-ref the PROVIDER_NODE_SIZES per-provider catalog (#176). - StepComponents UX rewritten: single flat marketplace card grid with family chips + product/family routes, two tabs (Choose Your Stack + Always Included) — replaces the prior "two-tab Mandatory infra/Apps" + "grouped by product header" prose (PRODUCT-FAMILIES, RUNBOOK- PROVISIONING, DEMO-RUNBOOK, COMPONENT-LOGOS). - CORTEX familyDependencies = [] reflected in PRODUCT-FAMILIES; the Specter / BGE cascade narratives rewritten to component-level-only resolution (langfuse → cnpg, librechat → ferretdb → cnpg) — fixes the "selecting Spector pulls entire FABRIC" over-broad claim. - catalyst-api OpenTofu workdir realigned from /var/lib/catalyst/... to /tmp/catalyst/tofu/<fqdn>/ via CATALYST_TOFU_WORKDIR env var (commit `27527e4c`) — fixes runtime drift in RUNBOOK-PROVISIONING, SOVEREIGN-PROVISIONING, DEMO-RUNBOOK; DEMO-RUNBOOK kubectl exec ns corrected from catalyst-system to catalyst. - Logo asset story rewritten: 58 logos (44 SVG + 14 PNG) sourced from CNCF artwork + project repos at #169b1d1c/#30ff318d, replacing the prior 62 stylised in-house marks; CI smoke-test (#6a7d2dd8) cross-referenced. - 12 G2 bootstrap-kit charts (original 11 + bp-powerdns #167) aligned in PROVISIONING-PLAN Group F + blueprint-release.yaml comment + SOVEREIGN-PROVISIONING header; previously stale at 11. - README repo-structure note updated: 12-component bootstrap kit + axon + external-dns leaf chart are built; 45 platform / 4 product folders remain README-only (was: "every folder except axon"). - ORCHESTRATOR-STATE main-tip SHA advanced from `dd578d1c` → `6afdb303` with one-line summary of the post-Pass-1 batch. - VALIDATION-LOG: Reconcile Pass 2 entry appended (drift fixed across 10 files; six-category rubric). Reconcile Pass 2 against main @ `6afdb303` — 10 files patched plus VALIDATION-LOG entry. Doc patches are landing first so the in-flight wizard step-reorder branch will merge into a doc set that already names the canonical order, avoiding a second drift round. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:48:57 +02:00
hatiyildiz	04559e5c37	docs(reconcile-pass-1): align docs with ground truth at `dd578d1c` Reconcile Pass 1 — first holistic LLM-driven reconciliation pass per ~/.claude/skills/reconcile-catalyst-docs/SKILL.md. Skill triggered after the post-Group-M architectural batch (#161, #162, #163, #167, #168, #169, #170, #171, #173, #174, #175). Live ground truth verified against kubectl + ls platform/ + git log + GHCR + componentGroups.ts. Drift categories fixed: - A. Numerical: bp-powerdns 1.0.5 → 1.0.6; component-logos 63 → 62 (powerdns SVG missing, tracked under #173); bootstrap kit 11 → 12 with bp-powerdns added per #167. - B. Service: pool-domain-manager + 5 registrar adapters (Cloudflare/Namecheap/GoDaddy/OVH/Dynadot, #170) added to IMPLEMENTATION-STATUS, ARCHITECTURE, PLATFORM-TECH-STACK, GLOSSARY, and PROVISIONING-PLAN; bp-powerdns added to ARCHITECTURE bootstrap kit + Catalyst-on-Catalyst dependency tree. - C. Architectural: SOVEREIGN-PROVISIONING §3 + DEMO-RUNBOOK Step 4 + ORCHESTRATOR-STATE Step 6 rewritten from Dynadot-direct DNS writes to PowerDNS authoritative + PDM /v1/commit + registrar-adapter NS-flip; PROVISIONING-PLAN Phase 4 paths corrected to products/catalyst/bootstrap/api/ (per INVIOLABLE-PRINCIPLES #3 the Go provisioner does NOT call cloud APIs); Phase 6 retitled and rewritten for the new DNS architecture. - D. Process: RUNBOOK-PROVISIONING §2 wizard-step table + DEMO-RUNBOOK Step 2 wizard-step table updated to canonical 7-step ordering (Org → Domain → Topology → Provider → Credentials → Components → Review per WIZARD_STEPS in WizardLayout.tsx, post #169 + #174); the three-mode StepDomain (pool / byo-manual / byo-api per #169) and two-tab StepComponents (mandatory infra + apps per #161/#162/#175) now documented. - E. Cross-doc: Group G ✅ across PROVISIONING-PLAN + ORCHESTRATOR-STATE (superseded by #167+#163+#170, not by the original Dynadot-multi-domain plan); Group C ✅ in PROVISIONING-PLAN (Flux is reconciling from openova-public today); README Stack-at-a-glance DNS row expanded. - F. Stale terminology: 11-grep banned-terms scan clean — every k8gb residual is a legitimate "removed at #171, replaced by lua-records" reference. VALIDATION-LOG.md gains the Reconcile Pass 1 entry per skill spec. Reconcile-skill numbering is independent of the Audit-skill numbering (which continues at Pass 108+). Files: 13 docs + VALIDATION-LOG entry. Escalations: none.	2026-04-29 09:40:10 +02:00
hatiyildiz	e8c3f6fd05	docs(runbook-provisioning): operator-level guide for sovereign-cloud teams Closes #136. New runbook companion to SOVEREIGN-PROVISIONING.md (the architectural contract) and PROVISIONING-PLAN.md (the Catalyst-Zero waterfall). Audience: a Sovereign cloud team (e.g. omantel-cloud) onboarding their first Sovereign via Catalyst-Zero at console.openova.io/sovereign. Sections: 1. What you get end-to-end 2. Pre-flight checklist (Hetzner project, API token, SSH key, region, domain mode, org name+email, topology) with cost estimate 3. Step-by-step: a. Open the wizard b. Walk the 7 steps with what each captures and why c. Watch the SSE event log (5 phases: tofu-init/plan/apply/output/flux-bootstrap) d. First login + DNS / cert-manager / CNAME caveats e. Day-1 setup checklist linked to SOVEREIGN-PROVISIONING.md §5 4. Troubleshooting matrix with 8 common failure modes mapped to recovery steps (token scope, hcloud quota, regional capacity, Cilium readiness chicken-and-egg, Let's Encrypt rate-limit, DNS propagation, Keycloak SMTP) 5. Re-runs + idempotency notes (tofu apply on existing state is safe) 6. Decommission flow tying back to SOVEREIGN-PROVISIONING.md §10.2 All claims about runtime behaviour cross-link to the canonical artifacts: provisioner.go for the SSE phases, infra/hetzner/main.tf for resource shape, cloudinit-control-plane.tftpl for the k3s+Flux bootstrap. Per INVIOLABLE-PRINCIPLES.md #7 the runbook flags Group M DoD as pending — it is operator-facing documentation of the deployed shape, not a claim of end-to-end runtime verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:54:14 +02:00

7 Commits