f18dd8df19
7 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
3b5fca2033 | merge: keep k3s local-path-provisioner; mark StorageClass default before Flux runs (closes #189) | ||
|
|
4f56ae47da |
fix(cloudinit): keep k3s local-path-provisioner; mark StorageClass default before Flux runs
Pre-fix the cloud-init template passed --disable=local-storage to the k3s
installer with the design intent that Crossplane would install hcloud-csi
day-2 and register a StorageClass after bp-crossplane reconciled. That
created a circular dependency on a fresh Sovereign: every PVC-using
HelmRelease in the bootstrap-kit (bp-spire, bp-keycloak postgres,
bp-openbao, bp-nats-jetstream, bp-gitea, bp-catalyst-platform postgres)
blocks Pending on a StorageClass that would only exist after bp-crossplane
finished installing — but they ARE in the bootstrap-kit Kustomization
that needs to converge before the day-2 path runs. Verified live on
omantel.omani.works: data-keycloak-postgresql-0 and spire-data-spire-server-0
both stuck Pending for 20+ min with `no persistent volumes available for
this claim and no storage class is set`, `kubectl get sc` empty.
This change:
1. Drops --disable=local-storage from INSTALL_K3S_EXEC so k3s ships its
built-in local-path-provisioner and registers the `local-path`
StorageClass on first boot.
2. Adds a runcmd block AFTER /healthz wait and BEFORE the Flux bootstrap
apply that:
a. waits for the local-path-provisioner pod Ready
b. patches the local-path SC with is-default-class=true
c. fails loudly if the SC is missing post-wait (safety gate so a
broken cluster doesn't fall through to Flux silently)
3. Adds tests/integration/storageclass.sh — phase 1 render-assertion
(regression gate against re-introducing --disable=local-storage,
plus positive assertions that the wait/patch/verify steps are
present, plus ordering check that the patch precedes the Flux
apply); phase 2 kind-cluster proof that a fresh cluster has a
default StorageClass that binds a test PVC.
4. Adds docs/RUNBOOK-PROVISIONING.md §"StorageClass missing" — symptom,
root cause, and the live-cluster recovery path (apply
local-path-storage.yaml + patch default class) for already-provisioned
Sovereigns that hit this without reprovisioning.
Trade-off: local-path PVs are node-pinned. For the solo-Sovereign target
(single CPX21/CPX31 control-plane node) that is the correct shape — the
data lives on the node, capacity is bounded by the disk, and there are
no other nodes for volumes to migrate to. Operators upgrading to
multi-node migrate to hcloud-csi (Hetzner Cloud Volumes) as a separate,
deliberate operation; that is not part of the cloud-init bootstrap.
Live verification on omantel.omani.works (reproduces the production
symptom + proves the recovery path):
Before:
NAMESPACE NAME STATUS AGE
keycloak data-keycloak-postgresql-0 Pending 10m
spire-system spire-data-spire-server-0 Pending 10m
No StorageClass.
After (kubectl apply local-path-storage.yaml + patch):
NAME PROVISIONER ... AGE
local-path (default) rancher.io/local-path ... 34s
NAMESPACE NAME STATUS STORAGECLASS
keycloak data-keycloak-postgresql-0 Bound local-path
spire-system spire-data-spire-server-0 Bound local-path
Gates:
- tofu validate: Success! The configuration is valid.
- tests/integration/storageclass.sh: PASS (phase 1 render-assertion +
phase 2 fresh kind cluster default StorageClass binds test PVC).
- Regression sanity: re-injecting --disable=local-storage causes
phase 1 to FAIL with the documented error message (verified).
Preserves the cloud-init Cilium-pre-Flux ordering (no changes to that
block); the StorageClass setup runs between healthz-wait and the Flux
bootstrap apply so the bootstrap-kit Kustomization sees a default class
on its first reconciliation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
b0c1c07271 |
fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction)
Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via `https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`. helm-controller's reconcile of bp-flux ran `helm install` on top of the running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver admission with `status.storedVersions[0]: Invalid value: "v1": must appear in spec.versions`; Helm rolled back; the rollback DELETED every running Flux controller Deployment (helm-controller, source-controller, kustomize-controller, image-automation-controller, image-reflector-controller, notification-controller). The cluster lost its GitOps engine — no further HelmRelease could progress, and the only recovery was full `tofu destroy` + reprovision. This is OPTION C of the architectural fix proposed in the incident memo: version-align cloud-init's flux2 install with the bp-flux umbrella chart's `flux2` subchart so a single upstream Flux release is installed and helm-controller adopts it on first reconcile rather than reinstalls on top with a different version. Changes: * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml URL pinned at v2.4.0 (deliberate; this is the source of truth) and added the CRITICAL VERSION-PIN INVARIANT comment block documenting the failure mode. * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from 2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion 2.4.0, matching cloud-init exactly. Bumped chart version 1.1.1 -> 1.1.2. * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream .version` mirror of the dep pin moved from 2.13.0 to 2.14.1. * `clusters/_template/bootstrap-kit/03-flux.yaml` and `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped bp-flux HelmRelease to 1.1.2 + added explicit `install.disableTakeOwnership: false`, `upgrade.disableTakeOwnership: false`, and `upgrade.preserveValues: true` so helm-controller adopts the cloud-init-installed Flux objects rather than rolling back on ownership conflict. * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2. * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml` — bumped HelmRelease to 1.1.2. * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case catastrophic-failure replay test: Case 1: Chart.yaml declares the flux2 subchart with explicit version. Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag. Case 3: chart's flux2 subchart appVersion equals cloud-init's pinned upstream version (the load-bearing invariant). Case 4: values.yaml metadata mirrors the Chart.yaml dep pin. Case 5: helm template renders cleanly + contains the four core Flux controllers. Case 6: replay test rejects a planted mismatched fake Chart.yaml (the gate's own self-test — proves the gate works). All six cases green locally; the new test joins the existing observability-toggle test in tests/. * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install — version-pin invariant" documenting the failure mode, the four pin-sites, the safe bump procedure, and the existing-Sovereign recovery path (full reprovision). Existing Sovereigns running 1.1.1: no in-place recovery is possible once the rollback has fired. Reprovision required against 1.1.2. Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) + #4 (never hardcode) — the version pins remain operator-bumpable via PR, but BOTH cloud-init's URL AND the chart's subchart MUST move together in the same PR; CI gate tests/version-pin-replay.sh enforces this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0106c6e436 |
fix(helmwatch): wait for first HelmRelease before allowing terminate-on-all-done
The Phase-1 watcher misread an empty informer cache as "all observed components terminal" — on the omantel.omani.works run the watch returned finalStatus=ready one second after flux-bootstrap because Flux on the new Sovereign hadn't yet reconciled the bootstrap-kit Kustomization, so zero bp-* HelmReleases existed. Zero is not done. Watcher now refuses to consider termination until BOTH: - firstSeenAt is non-zero (≥1 bp-* HelmRelease has been observed), AND - len(observed) ≥ MinBootstrapKitHRs (default 11, the bootstrap-kit count) A periodic ticker emits a single warn event after FirstSeenTimeout (default 15m) when zero HRs have been observed, naming the operator playbook in docs/RUNBOOK-PROVISIONING.md §"Phase 1 watch shows 0 HelmReleases". The watch CONTINUES — late HRs still flow. Watcher.Outcome() classifies the run as ready / failed / timeout / flux-not-reconciling. The handler copies it onto Deployment.Result.Phase1Outcome so the Sovereign Admin's wizard banner can render the right operator-actionable diagnostic, and on flux-not-reconciling flips Status=failed with an error message naming the runbook section. Both gates configurable per Principle #4: CATALYST_PHASE1_MIN_BOOTSTRAP_KIT_HRS (default 11) CATALYST_PHASE1_FIRST_SEEN_TIMEOUT (default 15m) Tests: 5 new helmwatch tests cover empty-list-doesn't-terminate, zero-HRs-after-1s-doesn't-terminate, 11-installed-terminates-ready, 11-with-1-failed-terminates-failed, 5-below-threshold-doesn't-terminate. All 25 existing helmwatch tests pass unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3864eef4e7 |
docs(reconcile-pass-2): align docs with ground truth at 6afdb303
- Wizard step canonical order updated to Org → Topology → Provider → Credentials → Components → Domain → Review (RUNBOOK-PROVISIONING, DEMO-RUNBOOK, IMPLEMENTATION-STATUS); SKU pickers cross-ref the PROVIDER_NODE_SIZES per-provider catalog (#176). - StepComponents UX rewritten: single flat marketplace card grid with family chips + product/family routes, two tabs (Choose Your Stack + Always Included) — replaces the prior "two-tab Mandatory infra/Apps" + "grouped by product header" prose (PRODUCT-FAMILIES, RUNBOOK- PROVISIONING, DEMO-RUNBOOK, COMPONENT-LOGOS). - CORTEX familyDependencies = [] reflected in PRODUCT-FAMILIES; the Specter / BGE cascade narratives rewritten to component-level-only resolution (langfuse → cnpg, librechat → ferretdb → cnpg) — fixes the "selecting Spector pulls entire FABRIC" over-broad claim. - catalyst-api OpenTofu workdir realigned from /var/lib/catalyst/... to /tmp/catalyst/tofu/<fqdn>/ via CATALYST_TOFU_WORKDIR env var (commit |
||
|
|
04559e5c37 |
docs(reconcile-pass-1): align docs with ground truth at dd578d1c
Reconcile Pass 1 — first holistic LLM-driven reconciliation pass per ~/.claude/skills/reconcile-catalyst-docs/SKILL.md. Skill triggered after the post-Group-M architectural batch (#161, #162, #163, #167, #168, #169, #170, #171, #173, #174, #175). Live ground truth verified against kubectl + ls platform/ + git log + GHCR + componentGroups.ts. Drift categories fixed: - A. Numerical: bp-powerdns 1.0.5 → 1.0.6; component-logos 63 → 62 (powerdns SVG missing, tracked under #173); bootstrap kit 11 → 12 with bp-powerdns added per #167. - B. Service: pool-domain-manager + 5 registrar adapters (Cloudflare/Namecheap/GoDaddy/OVH/Dynadot, #170) added to IMPLEMENTATION-STATUS, ARCHITECTURE, PLATFORM-TECH-STACK, GLOSSARY, and PROVISIONING-PLAN; bp-powerdns added to ARCHITECTURE bootstrap kit + Catalyst-on-Catalyst dependency tree. - C. Architectural: SOVEREIGN-PROVISIONING §3 + DEMO-RUNBOOK Step 4 + ORCHESTRATOR-STATE Step 6 rewritten from Dynadot-direct DNS writes to PowerDNS authoritative + PDM /v1/commit + registrar-adapter NS-flip; PROVISIONING-PLAN Phase 4 paths corrected to products/catalyst/bootstrap/api/ (per INVIOLABLE-PRINCIPLES #3 the Go provisioner does NOT call cloud APIs); Phase 6 retitled and rewritten for the new DNS architecture. - D. Process: RUNBOOK-PROVISIONING §2 wizard-step table + DEMO-RUNBOOK Step 2 wizard-step table updated to canonical 7-step ordering (Org → Domain → Topology → Provider → Credentials → Components → Review per WIZARD_STEPS in WizardLayout.tsx, post #169 + #174); the three-mode StepDomain (pool / byo-manual / byo-api per #169) and two-tab StepComponents (mandatory infra + apps per #161/#162/#175) now documented. - E. Cross-doc: Group G ✅ across PROVISIONING-PLAN + ORCHESTRATOR-STATE (superseded by #167+#163+#170, not by the original Dynadot-multi-domain plan); Group C ✅ in PROVISIONING-PLAN (Flux is reconciling from openova-public today); README Stack-at-a-glance DNS row expanded. - F. Stale terminology: 11-grep banned-terms scan clean — every k8gb residual is a legitimate "removed at #171, replaced by lua-records" reference. VALIDATION-LOG.md gains the Reconcile Pass 1 entry per skill spec. Reconcile-skill numbering is independent of the Audit-skill numbering (which continues at Pass 108+). Files: 13 docs + VALIDATION-LOG entry. Escalations: none. |
||
|
|
e8c3f6fd05 |
docs(runbook-provisioning): operator-level guide for sovereign-cloud teams
Closes #136. New runbook companion to SOVEREIGN-PROVISIONING.md (the architectural contract) and PROVISIONING-PLAN.md (the Catalyst-Zero waterfall). Audience: a Sovereign cloud team (e.g. omantel-cloud) onboarding their first Sovereign via Catalyst-Zero at console.openova.io/sovereign. Sections: 1. What you get end-to-end 2. Pre-flight checklist (Hetzner project, API token, SSH key, region, domain mode, org name+email, topology) with cost estimate 3. Step-by-step: a. Open the wizard b. Walk the 7 steps with what each captures and why c. Watch the SSE event log (5 phases: tofu-init/plan/apply/output/flux-bootstrap) d. First login + DNS / cert-manager / CNAME caveats e. Day-1 setup checklist linked to SOVEREIGN-PROVISIONING.md §5 4. Troubleshooting matrix with 8 common failure modes mapped to recovery steps (token scope, hcloud quota, regional capacity, Cilium readiness chicken-and-egg, Let's Encrypt rate-limit, DNS propagation, Keycloak SMTP) 5. Re-runs + idempotency notes (tofu apply on existing state is safe) 6. Decommission flow tying back to SOVEREIGN-PROVISIONING.md §10.2 All claims about runtime behaviour cross-link to the canonical artifacts: provisioner.go for the SSE phases, infra/hetzner/main.tf for resource shape, cloudinit-control-plane.tftpl for the k3s+Flux bootstrap. Per INVIOLABLE-PRINCIPLES.md #7 the runbook flags Group M DoD as pending — it is operator-facing documentation of the deployed shape, not a claim of end-to-end runtime verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |