Commit Graph

7 Commits

Author SHA1 Message Date
hatiyildiz
3b5fca2033 merge: keep k3s local-path-provisioner; mark StorageClass default before Flux runs (closes #189) 2026-04-29 19:43:59 +02:00
hatiyildiz
4f56ae47da fix(cloudinit): keep k3s local-path-provisioner; mark StorageClass default before Flux runs
Pre-fix the cloud-init template passed --disable=local-storage to the k3s
installer with the design intent that Crossplane would install hcloud-csi
day-2 and register a StorageClass after bp-crossplane reconciled. That
created a circular dependency on a fresh Sovereign: every PVC-using
HelmRelease in the bootstrap-kit (bp-spire, bp-keycloak postgres,
bp-openbao, bp-nats-jetstream, bp-gitea, bp-catalyst-platform postgres)
blocks Pending on a StorageClass that would only exist after bp-crossplane
finished installing — but they ARE in the bootstrap-kit Kustomization
that needs to converge before the day-2 path runs. Verified live on
omantel.omani.works: data-keycloak-postgresql-0 and spire-data-spire-server-0
both stuck Pending for 20+ min with `no persistent volumes available for
this claim and no storage class is set`, `kubectl get sc` empty.

This change:
1. Drops --disable=local-storage from INSTALL_K3S_EXEC so k3s ships its
   built-in local-path-provisioner and registers the `local-path`
   StorageClass on first boot.
2. Adds a runcmd block AFTER /healthz wait and BEFORE the Flux bootstrap
   apply that:
     a. waits for the local-path-provisioner pod Ready
     b. patches the local-path SC with is-default-class=true
     c. fails loudly if the SC is missing post-wait (safety gate so a
        broken cluster doesn't fall through to Flux silently)
3. Adds tests/integration/storageclass.sh — phase 1 render-assertion
   (regression gate against re-introducing --disable=local-storage,
   plus positive assertions that the wait/patch/verify steps are
   present, plus ordering check that the patch precedes the Flux
   apply); phase 2 kind-cluster proof that a fresh cluster has a
   default StorageClass that binds a test PVC.
4. Adds docs/RUNBOOK-PROVISIONING.md §"StorageClass missing" — symptom,
   root cause, and the live-cluster recovery path (apply
   local-path-storage.yaml + patch default class) for already-provisioned
   Sovereigns that hit this without reprovisioning.

Trade-off: local-path PVs are node-pinned. For the solo-Sovereign target
(single CPX21/CPX31 control-plane node) that is the correct shape — the
data lives on the node, capacity is bounded by the disk, and there are
no other nodes for volumes to migrate to. Operators upgrading to
multi-node migrate to hcloud-csi (Hetzner Cloud Volumes) as a separate,
deliberate operation; that is not part of the cloud-init bootstrap.

Live verification on omantel.omani.works (reproduces the production
symptom + proves the recovery path):

  Before:
    NAMESPACE      NAME                         STATUS    AGE
    keycloak       data-keycloak-postgresql-0   Pending   10m
    spire-system   spire-data-spire-server-0    Pending   10m
    No StorageClass.

  After (kubectl apply local-path-storage.yaml + patch):
    NAME                   PROVISIONER             ...   AGE
    local-path (default)   rancher.io/local-path   ...   34s

    NAMESPACE      NAME                         STATUS   STORAGECLASS
    keycloak       data-keycloak-postgresql-0   Bound    local-path
    spire-system   spire-data-spire-server-0    Bound    local-path

Gates:
  - tofu validate: Success! The configuration is valid.
  - tests/integration/storageclass.sh: PASS (phase 1 render-assertion +
    phase 2 fresh kind cluster default StorageClass binds test PVC).
  - Regression sanity: re-injecting --disable=local-storage causes
    phase 1 to FAIL with the documented error message (verified).

Preserves the cloud-init Cilium-pre-Flux ordering (no changes to that
block); the StorageClass setup runs between healthz-wait and the Flux
bootstrap apply so the bootstrap-kit Kustomization sees a default class
on its first reconciliation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:43:09 +02:00
hatiyildiz
b0c1c07271 fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction)
Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped
the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux
appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via
`https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`.
helm-controller's reconcile of bp-flux ran `helm install` on top of the
running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver
admission with `status.storedVersions[0]: Invalid value: "v1": must
appear in spec.versions`; Helm rolled back; the rollback DELETED every
running Flux controller Deployment (helm-controller, source-controller,
kustomize-controller, image-automation-controller,
image-reflector-controller, notification-controller). The cluster lost
its GitOps engine — no further HelmRelease could progress, and the only
recovery was full `tofu destroy` + reprovision.

This is OPTION C of the architectural fix proposed in the incident
memo: version-align cloud-init's flux2 install with the bp-flux umbrella
chart's `flux2` subchart so a single upstream Flux release is installed
and helm-controller adopts it on first reconcile rather than reinstalls
on top with a different version.

Changes:

  * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml
    URL pinned at v2.4.0 (deliberate; this is the source of truth) and
    added the CRITICAL VERSION-PIN INVARIANT comment block documenting
    the failure mode.

  * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from
    2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion
    2.4.0, matching cloud-init exactly. Bumped chart version
    1.1.1 -> 1.1.2.

  * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream
    .version` mirror of the dep pin moved from 2.13.0 to 2.14.1.

  * `clusters/_template/bootstrap-kit/03-flux.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped
    bp-flux HelmRelease to 1.1.2 + added explicit
    `install.disableTakeOwnership: false`,
    `upgrade.disableTakeOwnership: false`, and
    `upgrade.preserveValues: true` so helm-controller adopts the
    cloud-init-installed Flux objects rather than rolling back on
    ownership conflict.

  * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform
    umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2.

  * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml`
    — bumped HelmRelease to 1.1.2.

  * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case
    catastrophic-failure replay test:
      Case 1: Chart.yaml declares the flux2 subchart with explicit version.
      Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag.
      Case 3: chart's flux2 subchart appVersion equals cloud-init's
              pinned upstream version (the load-bearing invariant).
      Case 4: values.yaml metadata mirrors the Chart.yaml dep pin.
      Case 5: helm template renders cleanly + contains the four core
              Flux controllers.
      Case 6: replay test rejects a planted mismatched fake Chart.yaml
              (the gate's own self-test — proves the gate works).
    All six cases green locally; the new test joins the existing
    observability-toggle test in tests/.

  * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install
    — version-pin invariant" documenting the failure mode, the four
    pin-sites, the safe bump procedure, and the existing-Sovereign
    recovery path (full reprovision).

Existing Sovereigns running 1.1.1: no in-place recovery is possible
once the rollback has fired. Reprovision required against 1.1.2.

Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) +
#4 (never hardcode) — the version pins remain operator-bumpable via PR,
but BOTH cloud-init's URL AND the chart's subchart MUST move together
in the same PR; CI gate tests/version-pin-replay.sh enforces this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:38:17 +02:00
hatiyildiz
0106c6e436 fix(helmwatch): wait for first HelmRelease before allowing terminate-on-all-done
The Phase-1 watcher misread an empty informer cache as "all observed
components terminal" — on the omantel.omani.works run the watch returned
finalStatus=ready one second after flux-bootstrap because Flux on the new
Sovereign hadn't yet reconciled the bootstrap-kit Kustomization, so zero
bp-* HelmReleases existed. Zero is not done.

Watcher now refuses to consider termination until BOTH:
  - firstSeenAt is non-zero (≥1 bp-* HelmRelease has been observed), AND
  - len(observed) ≥ MinBootstrapKitHRs (default 11, the bootstrap-kit count)

A periodic ticker emits a single warn event after FirstSeenTimeout
(default 15m) when zero HRs have been observed, naming the operator
playbook in docs/RUNBOOK-PROVISIONING.md §"Phase 1 watch shows 0
HelmReleases". The watch CONTINUES — late HRs still flow.

Watcher.Outcome() classifies the run as ready / failed / timeout /
flux-not-reconciling. The handler copies it onto
Deployment.Result.Phase1Outcome so the Sovereign Admin's wizard banner
can render the right operator-actionable diagnostic, and on
flux-not-reconciling flips Status=failed with an error message naming
the runbook section.

Both gates configurable per Principle #4:
  CATALYST_PHASE1_MIN_BOOTSTRAP_KIT_HRS  (default 11)
  CATALYST_PHASE1_FIRST_SEEN_TIMEOUT     (default 15m)

Tests: 5 new helmwatch tests cover empty-list-doesn't-terminate,
zero-HRs-after-1s-doesn't-terminate, 11-installed-terminates-ready,
11-with-1-failed-terminates-failed, 5-below-threshold-doesn't-terminate.
All 25 existing helmwatch tests pass unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:29:07 +02:00
hatiyildiz
3864eef4e7 docs(reconcile-pass-2): align docs with ground truth at 6afdb303
- Wizard step canonical order updated to Org → Topology → Provider →
  Credentials → Components → Domain → Review (RUNBOOK-PROVISIONING,
  DEMO-RUNBOOK, IMPLEMENTATION-STATUS); SKU pickers cross-ref the
  PROVIDER_NODE_SIZES per-provider catalog (#176).
- StepComponents UX rewritten: single flat marketplace card grid with
  family chips + product/family routes, two tabs (Choose Your Stack +
  Always Included) — replaces the prior "two-tab Mandatory infra/Apps"
  + "grouped by product header" prose (PRODUCT-FAMILIES, RUNBOOK-
  PROVISIONING, DEMO-RUNBOOK, COMPONENT-LOGOS).
- CORTEX familyDependencies = [] reflected in PRODUCT-FAMILIES; the
  Specter / BGE cascade narratives rewritten to component-level-only
  resolution (langfuse → cnpg, librechat → ferretdb → cnpg) — fixes
  the "selecting Spector pulls entire FABRIC" over-broad claim.
- catalyst-api OpenTofu workdir realigned from /var/lib/catalyst/...
  to /tmp/catalyst/tofu/<fqdn>/ via CATALYST_TOFU_WORKDIR env var
  (commit 27527e4c) — fixes runtime drift in RUNBOOK-PROVISIONING,
  SOVEREIGN-PROVISIONING, DEMO-RUNBOOK; DEMO-RUNBOOK kubectl exec
  ns corrected from catalyst-system to catalyst.
- Logo asset story rewritten: 58 logos (44 SVG + 14 PNG) sourced from
  CNCF artwork + project repos at #169b1d1c/#30ff318d, replacing the
  prior 62 stylised in-house marks; CI smoke-test (#6a7d2dd8)
  cross-referenced.
- 12 G2 bootstrap-kit charts (original 11 + bp-powerdns #167) aligned
  in PROVISIONING-PLAN Group F + blueprint-release.yaml comment +
  SOVEREIGN-PROVISIONING header; previously stale at 11.
- README repo-structure note updated: 12-component bootstrap kit +
  axon + external-dns leaf chart are built; 45 platform / 4 product
  folders remain README-only (was: "every folder except axon").
- ORCHESTRATOR-STATE main-tip SHA advanced from dd578d1c6afdb303
  with one-line summary of the post-Pass-1 batch.
- VALIDATION-LOG: Reconcile Pass 2 entry appended (drift fixed across
  10 files; six-category rubric).

Reconcile Pass 2 against main @ 6afdb303 — 10 files patched plus
VALIDATION-LOG entry. Doc patches are landing first so the in-flight
wizard step-reorder branch will merge into a doc set that already
names the canonical order, avoiding a second drift round.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 11:48:57 +02:00
hatiyildiz
04559e5c37 docs(reconcile-pass-1): align docs with ground truth at dd578d1c
Reconcile Pass 1 — first holistic LLM-driven reconciliation pass per
~/.claude/skills/reconcile-catalyst-docs/SKILL.md. Skill triggered after
the post-Group-M architectural batch (#161, #162, #163, #167, #168,
#169, #170, #171, #173, #174, #175). Live ground truth verified against
kubectl + ls platform/ + git log + GHCR + componentGroups.ts.

Drift categories fixed:

- A. Numerical: bp-powerdns 1.0.5 → 1.0.6; component-logos 63 → 62
  (powerdns SVG missing, tracked under #173); bootstrap kit 11 → 12
  with bp-powerdns added per #167.
- B. Service: pool-domain-manager + 5 registrar adapters
  (Cloudflare/Namecheap/GoDaddy/OVH/Dynadot, #170) added to
  IMPLEMENTATION-STATUS, ARCHITECTURE, PLATFORM-TECH-STACK, GLOSSARY,
  and PROVISIONING-PLAN; bp-powerdns added to ARCHITECTURE bootstrap
  kit + Catalyst-on-Catalyst dependency tree.
- C. Architectural: SOVEREIGN-PROVISIONING §3 + DEMO-RUNBOOK Step 4
  + ORCHESTRATOR-STATE Step 6 rewritten from Dynadot-direct DNS writes
  to PowerDNS authoritative + PDM /v1/commit + registrar-adapter
  NS-flip; PROVISIONING-PLAN Phase 4 paths corrected to
  products/catalyst/bootstrap/api/ (per INVIOLABLE-PRINCIPLES #3 the
  Go provisioner does NOT call cloud APIs); Phase 6 retitled and
  rewritten for the new DNS architecture.
- D. Process: RUNBOOK-PROVISIONING §2 wizard-step table + DEMO-RUNBOOK
  Step 2 wizard-step table updated to canonical 7-step ordering
  (Org → Domain → Topology → Provider → Credentials → Components →
  Review per WIZARD_STEPS in WizardLayout.tsx, post #169 + #174); the
  three-mode StepDomain (pool / byo-manual / byo-api per #169) and
  two-tab StepComponents (mandatory infra + apps per #161/#162/#175)
  now documented.
- E. Cross-doc: Group G  across PROVISIONING-PLAN +
  ORCHESTRATOR-STATE (superseded by #167+#163+#170, not by the
  original Dynadot-multi-domain plan); Group C  in
  PROVISIONING-PLAN (Flux is reconciling from openova-public today);
  README Stack-at-a-glance DNS row expanded.
- F. Stale terminology: 11-grep banned-terms scan clean — every k8gb
  residual is a legitimate "removed at #171, replaced by lua-records"
  reference.

VALIDATION-LOG.md gains the Reconcile Pass 1 entry per skill spec.
Reconcile-skill numbering is independent of the Audit-skill numbering
(which continues at Pass 108+).

Files: 13 docs + VALIDATION-LOG entry.
Escalations: none.
2026-04-29 09:40:10 +02:00
hatiyildiz
e8c3f6fd05 docs(runbook-provisioning): operator-level guide for sovereign-cloud teams
Closes #136.

New runbook companion to SOVEREIGN-PROVISIONING.md (the architectural
contract) and PROVISIONING-PLAN.md (the Catalyst-Zero waterfall).
Audience: a Sovereign cloud team (e.g. omantel-cloud) onboarding their
first Sovereign via Catalyst-Zero at console.openova.io/sovereign.

Sections:
1. What you get end-to-end
2. Pre-flight checklist (Hetzner project, API token, SSH key, region,
   domain mode, org name+email, topology) with cost estimate
3. Step-by-step:
   a. Open the wizard
   b. Walk the 7 steps with what each captures and why
   c. Watch the SSE event log (5 phases: tofu-init/plan/apply/output/flux-bootstrap)
   d. First login + DNS / cert-manager / CNAME caveats
   e. Day-1 setup checklist linked to SOVEREIGN-PROVISIONING.md §5
4. Troubleshooting matrix with 8 common failure modes mapped to recovery
   steps (token scope, hcloud quota, regional capacity, Cilium readiness
   chicken-and-egg, Let's Encrypt rate-limit, DNS propagation, Keycloak SMTP)
5. Re-runs + idempotency notes (tofu apply on existing state is safe)
6. Decommission flow tying back to SOVEREIGN-PROVISIONING.md §10.2

All claims about runtime behaviour cross-link to the canonical artifacts:
provisioner.go for the SSE phases, infra/hetzner/main.tf for resource
shape, cloudinit-control-plane.tftpl for the k3s+Flux bootstrap. Per
INVIOLABLE-PRINCIPLES.md #7 the runbook flags Group M DoD as pending —
it is operator-facing documentation of the deployed shape, not a claim
of end-to-end runtime verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:54:14 +02:00