478743db17
20 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
238c6d2010
|
fix(bp-flux): mitigate helm-controller leader-election loss + stuck-HR recovery (#925) (#960)
* fix(bp-flux): mitigate helm-controller leader-election loss + recovery CronJob (#925) On otech113.omani.works the bp-vpa HelmRelease became stuck Ready=Unknown forever after a transient kube-apiserver blip caused helm-controller to lose its leader-election lease mid-install. The Helm release secret was already committed (Status=deployed) by the previous leader, but its last write to the HR's Ready condition was Unknown and the new leader's "release in storage?" short-circuit never re-evaluates that. The HR blocked bootstrap-kit → sovereign-tls → cilium-gateway, breaking every HTTPRoute on the Sovereign. Fix is two-pronged: 1) PRIMARY (prevent the trigger). Stretch leader-election lease durations on the three Catalyst-critical controllers (helm/kustomize/source) from the upstream defaults of lease=35s renew=30s retry=5s to lease=60s renew=40s retry=5s, and bump memory limits from 256Mi to 512Mi (helm) / 384Mi (kustomize, source) so OOMKills during 35-HR fan-out installs don't themselves trigger leadership handoffs. Costs ~50s extra failover time on a real controller crash; that's acceptable since CP HA is a Phase 2 concern and we'd much rather avoid spurious flips during transient API pressure. 2) RECOVERY (handle the residual case). New CronJob bp-flux-stuck-hr-recovery runs every 2 minutes, scans every HelmRelease cluster-wide, and for each HR stuck in Ready=Unknown for >5 minutes whose underlying Helm release secret already has status=deployed, force-toggles spec.suspend (the only known workaround per #925). Guardrail: refuses to act if more than 10 HRs would be touched in a single run (signals a cluster-wide outage). Operator-disablable via .Values.catalyst.stuckHelmReleaseRecovery.enabled=false. Lock-in tests: tests/leader-election-and-recovery.sh covers all three flag/memory bumps, CronJob render, RBAC presence, disable-toggle, and threshold operator override. version-pin-replay + observability-toggle still green. Chart bumped 1.1.4 → 1.2.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-flux): bump blueprint.yaml spec.version to 1.2.0 to match Chart.yaml (#925) The bootstrap-kit static validation gate (Chart.yaml version == blueprint.yaml spec.version) caught the missed bump on PR #960. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
ab67a48fe7
|
fix(blueprints): align blueprint.yaml spec.version with Chart.yaml version (#817) (#819)
TestBootstrapKit_BlueprintCardsHaveRequiredFields was failing on main for
9 blueprints because their platform/<name>/chart/Chart.yaml version had
been bumped without a matching update to platform/<name>/blueprint.yaml
spec.version. The pre-existing failure forced 7 recent PRs to self-merge
with --admin, masking real CI failures.
Aligned spec.version to match Chart.yaml version on:
cert-manager 1.1.1 -> 1.1.2
flux 1.1.3 -> 1.1.4
crossplane 1.1.3 -> 1.1.4
sealed-secrets 1.1.1 -> 1.1.2
spire 1.1.4 -> 1.1.7
nats-jetstream 1.1.1 -> 1.1.2
openbao 1.2.0 -> 1.2.14
keycloak 1.3.1 -> 1.3.2
gitea 1.2.1 -> 1.2.3
Verified locally:
$ go test ./... -run TestBootstrapKit_BlueprintCardsHaveRequiredFields -count=1
--- PASS: TestBootstrapKit_BlueprintCardsHaveRequiredFields (0.01s)
... all 10 sub-tests pass (cilium + the 9 above)
The existing test (tests/e2e/bootstrap-kit/main_test.go:145) is itself
the drift guardrail: it fails CI whenever Chart.yaml is bumped without a
matching blueprint.yaml bump. No additional script needed.
Closes #817 once verified on main.
Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
|
||
|
|
83ec889f06
|
feat(platform): add global.imageRegistry to remaining bp-* charts + bp-catalyst-platform (PR 3/3, #560) (#580)
Charts bumped:
- bp-keycloak 1.2.0 -> 1.2.1 (subchart stub; per-component image.registry knobs documented)
- bp-crossplane 1.1.3 -> 1.1.4 (subchart stub)
- bp-crossplane-claims 1.1.0 -> 1.1.1 (global.kubectlImage added; kubectl Job image templated; Hetzner ubuntu-24.04 server images intentionally untouched)
- bp-velero 1.2.0 -> 1.2.1 (subchart stub)
- bp-kyverno 1.0.0 -> 1.0.1 (subchart stub; per-controller image.registry knobs documented)
- bp-trivy 1.0.0 -> 1.0.1 (subchart stub; both operator + scanner image.registry knobs documented)
- bp-grafana 1.0.0 -> 1.0.1 (subchart stub)
- bp-flux 1.1.3 -> 1.1.4 (subchart stub; per-controller image.repository knobs documented)
- bp-catalyst-platform 1.1.13 -> 1.1.14 (global.imageRegistry + images.{catalystApi,catalystUi,marketplaceApi,console,smeTag} added; all 14 Catalyst-authored image refs templated: catalyst-api, catalyst-ui, marketplace-api, console + 10 SME services)
Post-handover per-Sovereign overlays set global.imageRegistry to harbor.<sovereign-fqdn> so every container image pull routes through the Sovereign's own Harbor proxy_cache.
Closes (partial): issue #560 — all 23 bp-* charts now carry global.imageRegistry
Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
|
||
|
|
05cb39c042
|
fix(bp-flux): catalyst-cluster-reconciler ClusterRoleBinding overlay (closes #338) (#393)
PROBLEM ------- On Sovereign-1 (otech.omani.works, 2026-04-30) every HelmRelease that transitioned through pending-install/pending-upgrade got stuck because the helm-controller SA could not UPDATE its own helm-storage Secrets (sh.helm.release.v1.<name>.<n>) in flux-system. Symptom: secrets "sh.helm.release.v1.catalyst-platform.v1" is forbidden: User "system:serviceaccount:flux-system:helm-controller" cannot update resource "secrets" in API group "" in the namespace "flux-system" Runtime workaround on otech (added 2026-04-30): manual ClusterRoleBinding flux-system-helm-controller-admin → cluster-admin → flux-system/helm-controller. Tracked as the permanent fix in #338. FIX --- Add platform/flux/chart/templates/catalyst-cluster-reconciler-rbac.yaml — a Catalyst-managed ClusterRoleBinding (catalyst-cluster-reconciler) that binds cluster-admin to helm-controller AND kustomize-controller in .Values.catalyst.fluxNamespace (default flux-system). Independent from the upstream subchart's cluster-reconciler binding (different name, no ownership conflict), so if the upstream binding ever drifts again the overlay still holds the cluster correct. WHY cluster-admin (not narrower) -------------------------------- helm-controller installs arbitrary user-supplied Helm charts which can ship any K8s resource (CRDs, ClusterRoles, MutatingWebhookConfigurations, etc.). There is no narrower role that satisfies the full install path. The Flux project's own bootstrap install.yaml binds cluster-admin for the same reason (upstream default multitenancy.privileged=true). Multi-tenancy lockdown is a Sovereign Day-2 hardening choice tracked separately. NEVER-HARDCODE COMPLIANCE ------------------------- Per docs/INVIOLABLE-PRINCIPLES.md #4, the namespace is operator-overridable via .Values.catalyst.fluxNamespace. Default is flux-system because that's the canonical Catalyst install namespace (matches cloud-init's flux2 install.yaml + clusters/_template/bootstrap-kit/03-flux.yaml). VERSION ------- - bp-flux 1.1.2 → 1.1.3 (Chart.yaml + blueprint.yaml + 3 bootstrap-kit refs). - The flux2 subchart pin (2.14.1) is unchanged — version-pin replay test remains green (cloud-init v2.4.0 == subchart appVersion 2.4.0). VERIFICATION ------------ - platform/flux/chart/tests/version-pin-replay.sh — all 6 cases PASS. - platform/flux/chart/tests/observability-toggle.sh — all 3 cases PASS. - helm template renders the new ClusterRoleBinding with correct subjects (flux-system by default; verified --set catalyst.fluxNamespace=custom override path). - scripts/check-bootstrap-deps.sh — 0 drift, 0 cycles. FILES ----- - platform/flux/chart/templates/catalyst-cluster-reconciler-rbac.yaml (new) - platform/flux/chart/Chart.yaml (1.1.2 → 1.1.3) - platform/flux/chart/values.yaml (catalyst.fluxNamespace default) - platform/flux/blueprint.yaml (1.1.2 → 1.1.3) - clusters/{_template,otech.omani.works,omantel.omani.works}/bootstrap-kit/03-flux.yaml (chart version) - docs/lessons-learned/helm-controller-rbac.md (permanent-fix note) - docs/omantel-handover-wbs.md (#338 status row) Refs: #43 #369 #338 Lesson: docs/lessons-learned/helm-controller-rbac.md Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> |
||
|
|
1f5c76def1
|
fix(platform): sync blueprint.yaml versions with Chart.yaml (#199)
* feat(ui): Playwright cosmetic + step-flow regression guards
15 regression guards in products/catalyst/bootstrap/ui/e2e/cosmetic-
guards.spec.ts that fail HARD when each user-flagged defect class
returns:
1. card height drift from canonical 108px
2. reserved right padding eating description width
3. logo tile drift from per-brand LOGO_SURFACE
4. invisible glyph (white-on-white) via luminance proxy
5. wizard step order Org/Topology/Provider/Credentials/Components/
Domain/Review
6. legacy "Choose Your Stack" / "Always Included" tab labels
7. Domain step reachable before Components
8. CPX32 not the recommended Hetzner SKU
9. per-region SKU dropdown shows wrong provider catalog
10. provision page is .html (static) not SPA route
11. legacy bubble/edge DAG SVG markup on provision page
12. admin sidebar drift from canonical core/console (w-56 + 7 labels)
13. AppDetail uses tablist instead of sectioned layout
14. job rows navigate to /job/<id> instead of expand-in-place
15. Phase 0 banners (Hetzner infra / Cluster bootstrap) on AdminPage
Each test prints a failure message naming the canonical reference,
the source-of-truth file, and the data-testid PR needed (if any) so
the implementing agent has a precise target. No .skip() — per
INVIOLABLE-PRINCIPLES #2, missing components fail loud.
CI: .github/workflows/cosmetic-guards.yaml runs the suite on every
PR that touches products/catalyst/bootstrap/ui/** or core/console/**.
Docs: docs/UI-REGRESSION-GUARDS.md maps each test to the user's
original complaint, the canonical reference, and the green/red
semantics (5 tests intentionally RED on main today — they stay red
until the companion-agent's UI work lands).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(platform): sync blueprint.yaml versions with Chart.yaml so manifest-validation passes
---------
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
b0c1c07271 |
fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction)
Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via `https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`. helm-controller's reconcile of bp-flux ran `helm install` on top of the running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver admission with `status.storedVersions[0]: Invalid value: "v1": must appear in spec.versions`; Helm rolled back; the rollback DELETED every running Flux controller Deployment (helm-controller, source-controller, kustomize-controller, image-automation-controller, image-reflector-controller, notification-controller). The cluster lost its GitOps engine — no further HelmRelease could progress, and the only recovery was full `tofu destroy` + reprovision. This is OPTION C of the architectural fix proposed in the incident memo: version-align cloud-init's flux2 install with the bp-flux umbrella chart's `flux2` subchart so a single upstream Flux release is installed and helm-controller adopts it on first reconcile rather than reinstalls on top with a different version. Changes: * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml URL pinned at v2.4.0 (deliberate; this is the source of truth) and added the CRITICAL VERSION-PIN INVARIANT comment block documenting the failure mode. * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from 2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion 2.4.0, matching cloud-init exactly. Bumped chart version 1.1.1 -> 1.1.2. * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream .version` mirror of the dep pin moved from 2.13.0 to 2.14.1. * `clusters/_template/bootstrap-kit/03-flux.yaml` and `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped bp-flux HelmRelease to 1.1.2 + added explicit `install.disableTakeOwnership: false`, `upgrade.disableTakeOwnership: false`, and `upgrade.preserveValues: true` so helm-controller adopts the cloud-init-installed Flux objects rather than rolling back on ownership conflict. * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2. * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml` — bumped HelmRelease to 1.1.2. * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case catastrophic-failure replay test: Case 1: Chart.yaml declares the flux2 subchart with explicit version. Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag. Case 3: chart's flux2 subchart appVersion equals cloud-init's pinned upstream version (the load-bearing invariant). Case 4: values.yaml metadata mirrors the Chart.yaml dep pin. Case 5: helm template renders cleanly + contains the four core Flux controllers. Case 6: replay test rejects a planted mismatched fake Chart.yaml (the gate's own self-test — proves the gate works). All six cases green locally; the new test joins the existing observability-toggle test in tests/. * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install — version-pin invariant" documenting the failure mode, the four pin-sites, the safe bump procedure, and the existing-Sovereign recovery path (full reprovision). Existing Sovereigns running 1.1.1: no in-place recovery is possible once the rollback has fired. Reprovision required against 1.1.2. Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) + #4 (never hardcode) — the version pins remain operator-bumpable via PR, but BOTH cloud-init's URL AND the chart's subchart MUST move together in the same PR; CI gate tests/version-pin-replay.sh enforces this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
1ddd569789 |
fix(bp-*): observability toggles default false — break circular CRD dependency
Extends the v1.1.1 hardening that started with cilium / cert-manager /
crossplane to the remaining 8 bootstrap-kit + per-Sovereign Blueprints.
Every observability toggle in every Catalyst-curated Blueprint now ships
`false`/`null` by default; the operator opts in via a per-cluster values
overlay at clusters/<sovereign>/bootstrap-kit/* once
bp-kube-prometheus-stack reconciles.
Live failure mode that prompted this (omantel.omani.works 2026-04-29):
bp-cilium @ 1.1.0 defaulted hubble.relay/ui + prometheus.serviceMonitor
to true. The upstream Cilium 1.16.5 chart renders a
monitoring.coreos.com/v1 ServiceMonitor whose CRD ships with
kube-prometheus-stack — a tier-2 Application Blueprint that depends on
the bootstrap-kit (cilium first). Helm install fails on a fresh
Sovereign with "no matches for kind ServiceMonitor in version
monitoring.coreos.com/v1 — ensure CRDs are installed first" and every
downstream HelmRelease reports `dep is not ready`. The earlier
trustCRDsExist=true mitigation only suppresses Helm's render-time gate;
the apiserver still rejects the resource at install-time.
Per-Blueprint changes:
- bp-cilium: hubble.relay.enabled, hubble.ui.enabled → false;
hubble.metrics.enabled → null (this is the exact value that disables
the upstream metrics ServiceMonitor template branch — verified by
reading cilium 1.16.5's _hubble.tpl); hubble.metrics.serviceMonitor
.enabled → false. tests/observability-toggle.sh extended with Case 4
(default render produces no hubble-relay / hubble-ui Deployments).
- bp-flux: flux2.prometheus.podMonitor.create → false.
- bp-sealed-secrets: sealed-secrets.metrics.serviceMonitor.enabled
→ false (explicit lock; upstream already defaults false).
- bp-spire: spire.global.spire.recommendations.enabled +
recommendations.prometheus → false.
- bp-nats-jetstream: nats.promExporter.enabled +
promExporter.podMonitor.enabled → false.
- bp-openbao: openbao.injector.metrics.enabled +
openbao.serviceMonitor.enabled → false.
- bp-keycloak: keycloak.metrics.enabled + metrics.serviceMonitor.enabled
+ metrics.prometheusRule.enabled → false.
- bp-gitea: gitea.gitea.metrics.* and gitea.postgresql.metrics.*
serviceMonitor + prometheusRule → false.
- bp-powerdns: powerdns.serviceMonitor.enabled + powerdns.metrics.enabled
→ false (forward-compatibility guard; current upstream
pschichtel/powerdns 0.10.0 has no ServiceMonitor template, but a future
upstream bump cannot silently regress).
Each chart ships a tests/observability-toggle.sh that asserts the rule
in three cases (default off / explicit on opt-in / explicit off) — runs
under blueprint-release.yaml's chart-test gate (added
|
||
|
|
43aff20254 |
feat(bp-*): convert all 11 bootstrap-kit charts to umbrella charts depending on upstream
Each platform/<name>/chart/Chart.yaml now declares the canonical upstream chart as a dependencies: entry. helm dependency build pulls the upstream payload into the OCI artifact at publish time, so Flux helm install of bp-<name>:1.1.0 actually installs the upstream Helm release alongside the Catalyst-curated overlays (NetworkPolicy, ServiceMonitor, ClusterIssuer, ExternalSecret) under templates/. Pinned upstream chart versions per platform/<name>/blueprint.yaml: - cilium 1.16.5 https://helm.cilium.io - cert-manager v1.16.2 https://charts.jetstack.io - flux 2.4.0 https://fluxcd-community.github.io/helm-charts - crossplane 1.17.x https://charts.crossplane.io/stable - sealed-secrets 2.16.x https://bitnami-labs.github.io/sealed-secrets - spire ... https://spiffe.github.io/helm-charts-hardened - nats-jetstream ... https://nats-io.github.io/k8s/helm/charts - openbao ... https://openbao.github.io/openbao-helm - keycloak ... https://charts.bitnami.com/bitnami - gitea ... https://dl.gitea.com/charts - catalyst-platform umbrella over the 10 leaf bp-* charts via helm dependency values.yaml in each chart adopts the umbrella convention: catalystBlueprint metadata block (provenance + version) at top level, upstream subchart values namespaced under the dependency name. cert-manager specifically: clusterissuer-letsencrypt-dns01.yaml gets the helm.sh/hook: post-install,post-upgrade annotation so it applies AFTER cert-manager controllers are running and CRDs registered (the previous hollow-chart shape ran the ClusterIssuer at install time when CRDs didn't exist yet, which was the omantel cluster's exact failure mode). Wrapper chart version bumped 1.0.0 → 1.1.0 across the board (umbrella conversion is a meaningful structural revision). Cluster manifests in clusters/_template/bootstrap-kit/ AND clusters/omantel.omani.works/ bootstrap-kit/ updated to reference 1.1.0. The blueprint-release.yaml workflow's helm package step needs an explicit helm dependency build before push so the upstream subchart bytes ship inside the OCI artifact. That CI change is a follow-up commit on this same branch (separate file scope). |
||
|
|
f5daac52af |
refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171)
PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything k8gb was doing — geo-aware response selection, health-checked failover, weighted round-robin — at the authoritative DNS layer. Eliminates a separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign. Changes: - platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never authored — only README existed) - products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted - componentGroups.ts: remove k8gb component (PowerDNS already there) - componentLogos.tsx: drop logo_k8gb + k8gb map entry - model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns - StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb - provision.html: replace k8gb tile and edges with powerdns - catalog.generated.ts regenerated (now includes bp-powerdns) - docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING- CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY, COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs (cilium, external-dns, failover-controller, litmus, flux, opentofu) rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md. Historical entries in VALIDATION-LOG.md preserved as audit trail. - New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed), Application Placement → lua-record selector mapping, when to add a second Sovereign region, operational checks. Closes #171. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
62d9c7d936 |
fix(charts): drop dependencies block — wrappers carry values overlay only
The first 2 blueprint-release CI runs failed on `helm package` with containerd permission errors because the wrapper Chart.yaml's `dependencies:` block triggered helm to pull the upstream charts via OCI/containerd at package time, which the GitHub Actions runner blocks.
Architectural fix: each Catalyst Blueprint wrapper carries the values overlay + metadata only. The bootstrap installer reads the upstream chart reference from the wrapper's values.yaml `catalystBlueprint.upstream.{chart,version,repo}` metadata block, points `helm install` at the upstream chart's repo, and overlays our values.
This keeps:
- blueprint-release CI lightweight (no upstream pulls during package; helm package now works without containerd)
- the "bp-<name> wrapper does NOT drift from upstream" property (we ship the overlay, not a fork)
- the single Blueprint contract from BLUEPRINT-AUTHORING §1 (a wrapper is still a Catalyst-curated Helm chart published as bp-<name>:<semver>)
Changes:
- 11 platform/<name>/chart/Chart.yaml: removed dependencies block. Each is now a plain Helm chart with no remote pulls during package.
- 11 platform/<name>/chart/values.yaml: prepended catalystBlueprint.upstream.{chart,version,repo} metadata block at the top. Bootstrap installer parses it to know which upstream chart to install with these values.
- products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go: installCilium now does `helm repo add cilium https://helm.cilium.io --force-update` then `helm install cilium cilium/cilium --version 1.16.5 --values -` (the cilium/cilium upstream chart, with our overlay values piped from values.yaml). Same pattern needs propagating to the other 10 install functions in a follow-up.
After this commit, blueprint-release CI should green-build all 11 wrappers (helm package now works without containerd access since there's nothing to pull). The bootstrap installer's actual `helm install` calls in production reach upstream chart repos via the runtime k3s cluster's pod network, which has full network access.
|
||
|
|
441ebaebb8 |
fix(charts): pin upstream chart versions/names to ones that exist in their repos
The first Blueprint Release CI run (commit
|
||
|
|
8c0f76640c |
feat(charts): G2 wrapper Helm charts for 11 bootstrap-kit components + blueprint-release CI
Per docs/PROVISIONING-PLAN.md and tickets [F] chart. Adds Catalyst-curated wrapper Helm charts at platform/<name>/chart/ for every component the bootstrap-kit installer (introduced in commit |
||
|
|
7cafa3c894 |
docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay
Component-level architectural correction (two changes): 1. MinIO → SeaweedFS as unified S3 encapsulation layer The old design used MinIO for in-cluster S3 plus separate cold-tier configuration scattered across consumers. The new design positions SeaweedFS as the single S3 encapsulation layer: every Catalyst component talks to one endpoint (seaweedfs.storage.svc:8333). SeaweedFS internally handles hot tier (in-cluster NVMe), warm tier (in-cluster bulk), and cold tier (transparent passthrough to cloud archival storage — Cloudflare R2 / AWS S3 / Hetzner Object Storage / etc., chosen at Sovereign provisioning). One audit/lifecycle/encryption boundary instead of N. No Catalyst component talks to cloud S3 directly anymore — Velero, CNPG WAL archive, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg, Harbor blob store, Application buckets all share one S3 surface. 2. Apache Guacamole added as Application Blueprint §4.5 Communication Clientless browser-based RDP/VNC/SSH/kubectl-exec gateway. Keycloak SSO, full session recording to SeaweedFS for compliance evidence (PSD2/DORA/SOX). Composed into bp-relay. Replaces VPN+native-client distribution for auditable remote access. Component changes: - DELETED: platform/minio/ - CREATED: platform/seaweedfs/README.md (unified S3 + cold-tier encapsulation; bucket layout; multi-region replication via shared cold backend; migration-from-MinIO section) - CREATED: platform/guacamole/README.md (clientless remote-desktop gateway; GuacamoleConnection CRD; compliance integration via session recordings) Doc updates: PLATFORM-TECH-STACK §1+§3.5+§4.5+§5+§7.4; TECHNOLOGY-FORECAST L11+mandatory+a-la-carte counts (52 → 53); ARCHITECTURE §3 topology; SECURITY §4 DB engines; SOVEREIGN-PROVISIONING §1 inputs; SRE §2.5+§7; IMPLEMENTATION-STATUS §3; BLUEPRINT-AUTHORING stateful examples; BUSINESS-STRATEGY 13 component-count anchors + Relay product line; README.md backup row; CLAUDE.md folder count. Component README updates (S3 endpoint + dependency renames): cnpg, clickhouse, flink, gitea, iceberg, harbor, grafana, livekit, kserve, milvus, opensearch, flux, stalwart, velero (substantive rewrite of velero — now writes exclusively to SeaweedFS with cold-tier auto-routing). Products: relay, fabric. UI scaffold: products/catalyst/bootstrap/ui/src/shared/constants/components.ts — minio entry replaced with seaweedfs; velero+harbor deps updated; new guacamole entry added. VALIDATION-LOG entry "Pass 104 — MinIO → SeaweedFS swap + Guacamole add" captures the encapsulation principle and adds Lesson #22: storage tier policy belongs at the encapsulation boundary, not inside every consumer. Verification: zero remaining MinIO references in canonical docs (one intentional retention in TECHNOLOGY-FORECAST L37 explaining the swap); 53 platform/ folders matching all "53 components" anchors; bp-relay composition includes guacamole. |
||
|
|
76e68e6182 |
docs(pass-36): flux deep-scrutiny + sweep gap-fill (Pass 35 head -10 cutoff)
Pass 35's sweep grep had `head -10` cutoff that produced a false-clean signal. Pass 36 ran the same grep without truncation, finding 6 surviving drift instances: platform/flux/README.md (5 fixes): - Mermaid diagram: Tenant[Tenant Repos] -> Organization[Organization Repos]. - GitRepository url gitea.<domain> -> gitea.<location-code>.<sovereign-domain>. - Bootstrap command --url=https://gitea.<domain>/... -> canonical form. - Key commands `flux reconcile kustomization tenants` -> `organizations` (Pass 34 was uppercase-only and missed lowercase plural). - Gitea Actions example flux-webhook.<domain> -> location-code form. platform/kyverno/README.md (1 fix): - Mermaid subgraph "Tenant Workload" -> "Organization Workload" (the priority class names tenant-high/tenant-default remain — those are deployed K8s PriorityClass objects requiring recreate-not-rename per Pass 9's deferred-migration note). Methodology lesson: convenience shortcuts in validation produce false-clean signals. From Pass 37 forward: drift sweeps use full grep output (no truncation) and case-insensitive banned-term searches. Validation log Pass 36 entry includes detail on each preserved "multi-tenant" generic adjective use that survived (acceptable feature descriptions, not Catalyst entity references). |
||
|
|
b6a374df26 |
docs(pass-15): final banner sweep — 52/52 platform components covered, convergence achieved
Pass 15 swept all 52 platform/*/README.md files for the role-in- Catalyst banner. 3 still lacked one (cnpg, flux, strimzi) and got banners added: - cnpg (§4.1): production Postgres; underlying engine for FerretDB + Gitea metadata. - flux (§3.2): per-vcluster Flux + host-level Flux for Catalyst itself; pulls from single per-Sovereign Gitea. - strimzi (§4.1): Application-tier event streaming; NOT the Catalyst control-plane spine (which uses NATS JetStream). Same upstream- tech-different-tier disambiguation pattern as Valkey. CONVERGENCE: 52 / 52 platform components have role-in-Catalyst banners. All cross-refs resolve. No banned terms. No architectural drift detected on this pass. VALIDATION-LOG: Pass 15 entry + "Convergence achieved (initial banner sweep)" marker added. The validation loop continues per the standing instruction — but subsequent passes will be brief drift-detection sweeps rather than systematic rewrites. Refs #37 |
||
|
|
a5ffa1a716 |
docs(pass-7): align Gitea + Flux multi-region story; fix broken mermaid id
Continuing Pass 7 cleanup after the OpenBao/ESO rewrite (
|
||
|
|
119a1e53a0 |
docs(components): terminology pass across platform and product READMEs
Bring per-component READMEs in line with the canonical glossary
(docs/GLOSSARY.md). Substantive architectural content unchanged —
this is a terminology + reference correctness pass.
Placeholder rename: <tenant> → <org> in YAML / IaC examples across
- platform/cnpg/README.md (Cluster + Pooler + ScheduledBackup)
- platform/debezium/README.md (PostgreSQL connector + topic patterns)
- platform/external-secrets/README.md (ExternalSecret / SecretStore)
- platform/grafana/README.md (Instrumentation namespace)
- platform/k8gb/README.md (Gslb + namespace + kubectl examples)
- platform/keda/README.md (ScaledObject + Kafka triggers + Prometheus)
- platform/opentofu/README.md (server resource example)
- platform/velero/README.md (BackupStorageLocation buckets)
- platform/vpa/README.md (VerticalPodAutoscaler examples)
- platform/flux/README.md (kustomization name + tenants/ → organizations/)
"Catalyst IDP" → "Catalyst console":
- platform/crossplane/README.md (integration section retitled and
rewritten — Crossplane is platform
plumbing, not user-facing)
- platform/gitea/README.md (architecture diagram + integration table)
- platform/kyverno/README.md (rollout tracking surface)
- products/fingate/README.md (TPP onboarding portal)
"Bootstrap wizard" → "Catalyst bootstrap":
- platform/openbao/README.md (bootstrap procedure rewritten —
independent Raft per region clarified;
cross-references docs/SECURITY.md §5)
- platform/opentofu/README.md (Quick Start)
Kyverno labels & prose:
- openova.io/tenant → openova.io/organization (label rename for
consistency; deployed clusters will add new label as a co-label
during migration window)
- "tenant labels" / "tenant namespace" prose updated to
"Organization labels" / "Organization-labeled namespace"
- Priority class names (tenant-high, tenant-default, tenant-batch)
retained as deployed artifact names — rename pending in a
separate migration ticket
No banned-term hits remain in component READMEs (verified by grep
in docs/GLOSSARY.md banned-terms table).
Refs #37
|
||
|
|
435f49738d |
feat: restructure platform to 52 components and 9 products
Technology forecast and strategic review restructure: - Remove 13 components (backstage, mongodb, activemq, vitess, airflow, camel, dapr, superset, searxng, langserve, trino, lago, rabbitmq) - Add 10 components (sigstore, syft-grype, nemo-guardrails, langfuse, reloader, matrix, ferretdb, litmus, livekit, coraza) - Rename product: Synapse → Axon (SaaS LLM Gateway) - Merge products: Titan + Fuse → Fabric (Data & Integration) - New product: Relay (Communication) - Replace Backstage with Catalyst IDP - Replace MongoDB with FerretDB (MongoDB wire protocol on CNPG) - Add supply chain security (Sigstore/Cosign, Syft+Grype) - Add AI safety and observability (NeMo Guardrails, LangFuse) - Add technology forecast 2027-2030 document - Full verification pass: zero stale references across all docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
10245dff98 |
feat: ecosystem expansion to 55 components with license compliance
- Replace BSL-licensed components with open-source alternatives: Terraform→OpenTofu (MPL 2.0), Vault→OpenBao (MPL 2.0), Redpanda→Strimzi/Kafka (Apache 2.0), n8n→Airflow (Apache 2.0) - Add 14 new platform components: activemq, camel, clickhouse, dapr, debezium, falco, flink, iceberg, opensearch, rabbitmq, superset, temporal, trino, vitess - Rename meta-platforms/ to products/ with new product names: Cortex (AI Hub), Fingate (Open Banking), Titan (Data Lakehouse), Fuse (Microservices Integration) - Update all documentation, READMEs, and cross-references Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
c9d04a53b4 |
refactor: flatten platform/ structure (41 components)
Remove hierarchical grouping (networking/, security/, etc.) and use flat structure for all 41 platform components. Changes: - All components now directly under platform/ (no subfolders) - AI Hub components moved from meta-platforms/ai-hub/components/ to platform/ - Open Banking components (lago, openmeter) moved to platform/ - meta-platforms/ now only contains README files that reference platform/ - Open Banking custom services remain in meta-platforms/open-banking/services/ Structure: - platform/ (41 components, flat) - meta-platforms/ai-hub/ (README only, references platform/) - meta-platforms/open-banking/ (README + 6 custom services) All documentation links updated. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> |