openova

Author	SHA1	Message	Date
e3mrah	238c6d2010	fix(bp-flux): mitigate helm-controller leader-election loss + stuck-HR recovery (#925 ) (#960 ) * fix(bp-flux): mitigate helm-controller leader-election loss + recovery CronJob (#925) On otech113.omani.works the bp-vpa HelmRelease became stuck Ready=Unknown forever after a transient kube-apiserver blip caused helm-controller to lose its leader-election lease mid-install. The Helm release secret was already committed (Status=deployed) by the previous leader, but its last write to the HR's Ready condition was Unknown and the new leader's "release in storage?" short-circuit never re-evaluates that. The HR blocked bootstrap-kit → sovereign-tls → cilium-gateway, breaking every HTTPRoute on the Sovereign. Fix is two-pronged: 1) PRIMARY (prevent the trigger). Stretch leader-election lease durations on the three Catalyst-critical controllers (helm/kustomize/source) from the upstream defaults of lease=35s renew=30s retry=5s to lease=60s renew=40s retry=5s, and bump memory limits from 256Mi to 512Mi (helm) / 384Mi (kustomize, source) so OOMKills during 35-HR fan-out installs don't themselves trigger leadership handoffs. Costs ~50s extra failover time on a real controller crash; that's acceptable since CP HA is a Phase 2 concern and we'd much rather avoid spurious flips during transient API pressure. 2) RECOVERY (handle the residual case). New CronJob bp-flux-stuck-hr-recovery runs every 2 minutes, scans every HelmRelease cluster-wide, and for each HR stuck in Ready=Unknown for >5 minutes whose underlying Helm release secret already has status=deployed, force-toggles spec.suspend (the only known workaround per #925). Guardrail: refuses to act if more than 10 HRs would be touched in a single run (signals a cluster-wide outage). Operator-disablable via .Values.catalyst.stuckHelmReleaseRecovery.enabled=false. Lock-in tests: tests/leader-election-and-recovery.sh covers all three flag/memory bumps, CronJob render, RBAC presence, disable-toggle, and threshold operator override. version-pin-replay + observability-toggle still green. Chart bumped 1.1.4 → 1.2.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-flux): bump blueprint.yaml spec.version to 1.2.0 to match Chart.yaml (#925) The bootstrap-kit static validation gate (Chart.yaml version == blueprint.yaml spec.version) caught the missed bump on PR #960. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 16:05:38 +04:00
e3mrah	ab67a48fe7	fix(blueprints): align blueprint.yaml spec.version with Chart.yaml version (#817 ) (#819 ) TestBootstrapKit_BlueprintCardsHaveRequiredFields was failing on main for 9 blueprints because their platform/<name>/chart/Chart.yaml version had been bumped without a matching update to platform/<name>/blueprint.yaml spec.version. The pre-existing failure forced 7 recent PRs to self-merge with --admin, masking real CI failures. Aligned spec.version to match Chart.yaml version on: cert-manager 1.1.1 -> 1.1.2 flux 1.1.3 -> 1.1.4 crossplane 1.1.3 -> 1.1.4 sealed-secrets 1.1.1 -> 1.1.2 spire 1.1.4 -> 1.1.7 nats-jetstream 1.1.1 -> 1.1.2 openbao 1.2.0 -> 1.2.14 keycloak 1.3.1 -> 1.3.2 gitea 1.2.1 -> 1.2.3 Verified locally: $ go test ./... -run TestBootstrapKit_BlueprintCardsHaveRequiredFields -count=1 --- PASS: TestBootstrapKit_BlueprintCardsHaveRequiredFields (0.01s) ... all 10 sub-tests pass (cilium + the 9 above) The existing test (tests/e2e/bootstrap-kit/main_test.go:145) is itself the drift guardrail: it fails CI whenever Chart.yaml is bumped without a matching blueprint.yaml bump. No additional script needed. Closes #817 once verified on main. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>	2026-05-04 22:32:49 +04:00
e3mrah	83ec889f06	feat(platform): add global.imageRegistry to remaining bp-* charts + bp-catalyst-platform (PR 3/3, #560 ) (#580 ) Charts bumped: - bp-keycloak 1.2.0 -> 1.2.1 (subchart stub; per-component image.registry knobs documented) - bp-crossplane 1.1.3 -> 1.1.4 (subchart stub) - bp-crossplane-claims 1.1.0 -> 1.1.1 (global.kubectlImage added; kubectl Job image templated; Hetzner ubuntu-24.04 server images intentionally untouched) - bp-velero 1.2.0 -> 1.2.1 (subchart stub) - bp-kyverno 1.0.0 -> 1.0.1 (subchart stub; per-controller image.registry knobs documented) - bp-trivy 1.0.0 -> 1.0.1 (subchart stub; both operator + scanner image.registry knobs documented) - bp-grafana 1.0.0 -> 1.0.1 (subchart stub) - bp-flux 1.1.3 -> 1.1.4 (subchart stub; per-controller image.repository knobs documented) - bp-catalyst-platform 1.1.13 -> 1.1.14 (global.imageRegistry + images.{catalystApi,catalystUi,marketplaceApi,console,smeTag} added; all 14 Catalyst-authored image refs templated: catalyst-api, catalyst-ui, marketplace-api, console + 10 SME services) Post-handover per-Sovereign overlays set global.imageRegistry to harbor.<sovereign-fqdn> so every container image pull routes through the Sovereign's own Harbor proxy_cache. Closes (partial): issue #560 — all 23 bp-* charts now carry global.imageRegistry Co-authored-by: alierenbaysal <alierenbaysal@openova.io>	2026-05-02 13:21:53 +04:00
e3mrah	05cb39c042	fix(bp-flux): catalyst-cluster-reconciler ClusterRoleBinding overlay (closes #338 ) (#393 ) PROBLEM ------- On Sovereign-1 (otech.omani.works, 2026-04-30) every HelmRelease that transitioned through pending-install/pending-upgrade got stuck because the helm-controller SA could not UPDATE its own helm-storage Secrets (sh.helm.release.v1.<name>.<n>) in flux-system. Symptom: secrets "sh.helm.release.v1.catalyst-platform.v1" is forbidden: User "system:serviceaccount:flux-system:helm-controller" cannot update resource "secrets" in API group "" in the namespace "flux-system" Runtime workaround on otech (added 2026-04-30): manual ClusterRoleBinding flux-system-helm-controller-admin → cluster-admin → flux-system/helm-controller. Tracked as the permanent fix in #338. FIX --- Add platform/flux/chart/templates/catalyst-cluster-reconciler-rbac.yaml — a Catalyst-managed ClusterRoleBinding (catalyst-cluster-reconciler) that binds cluster-admin to helm-controller AND kustomize-controller in .Values.catalyst.fluxNamespace (default flux-system). Independent from the upstream subchart's cluster-reconciler binding (different name, no ownership conflict), so if the upstream binding ever drifts again the overlay still holds the cluster correct. WHY cluster-admin (not narrower) -------------------------------- helm-controller installs arbitrary user-supplied Helm charts which can ship any K8s resource (CRDs, ClusterRoles, MutatingWebhookConfigurations, etc.). There is no narrower role that satisfies the full install path. The Flux project's own bootstrap install.yaml binds cluster-admin for the same reason (upstream default multitenancy.privileged=true). Multi-tenancy lockdown is a Sovereign Day-2 hardening choice tracked separately. NEVER-HARDCODE COMPLIANCE ------------------------- Per docs/INVIOLABLE-PRINCIPLES.md #4, the namespace is operator-overridable via .Values.catalyst.fluxNamespace. Default is flux-system because that's the canonical Catalyst install namespace (matches cloud-init's flux2 install.yaml + clusters/_template/bootstrap-kit/03-flux.yaml). VERSION ------- - bp-flux 1.1.2 → 1.1.3 (Chart.yaml + blueprint.yaml + 3 bootstrap-kit refs). - The flux2 subchart pin (2.14.1) is unchanged — version-pin replay test remains green (cloud-init v2.4.0 == subchart appVersion 2.4.0). VERIFICATION ------------ - platform/flux/chart/tests/version-pin-replay.sh — all 6 cases PASS. - platform/flux/chart/tests/observability-toggle.sh — all 3 cases PASS. - helm template renders the new ClusterRoleBinding with correct subjects (flux-system by default; verified --set catalyst.fluxNamespace=custom override path). - scripts/check-bootstrap-deps.sh — 0 drift, 0 cycles. FILES ----- - platform/flux/chart/templates/catalyst-cluster-reconciler-rbac.yaml (new) - platform/flux/chart/Chart.yaml (1.1.2 → 1.1.3) - platform/flux/chart/values.yaml (catalyst.fluxNamespace default) - platform/flux/blueprint.yaml (1.1.2 → 1.1.3) - clusters/{_template,otech.omani.works,omantel.omani.works}/bootstrap-kit/03-flux.yaml (chart version) - docs/lessons-learned/helm-controller-rbac.md (permanent-fix note) - docs/omantel-handover-wbs.md (#338 status row) Refs: #43 #369 #338 Lesson: docs/lessons-learned/helm-controller-rbac.md Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>	2026-05-01 15:56:45 +04:00
e3mrah	1f5c76def1	fix(platform): sync blueprint.yaml versions with Chart.yaml (#199 ) * feat(ui): Playwright cosmetic + step-flow regression guards 15 regression guards in products/catalyst/bootstrap/ui/e2e/cosmetic- guards.spec.ts that fail HARD when each user-flagged defect class returns: 1. card height drift from canonical 108px 2. reserved right padding eating description width 3. logo tile drift from per-brand LOGO_SURFACE 4. invisible glyph (white-on-white) via luminance proxy 5. wizard step order Org/Topology/Provider/Credentials/Components/ Domain/Review 6. legacy "Choose Your Stack" / "Always Included" tab labels 7. Domain step reachable before Components 8. CPX32 not the recommended Hetzner SKU 9. per-region SKU dropdown shows wrong provider catalog 10. provision page is .html (static) not SPA route 11. legacy bubble/edge DAG SVG markup on provision page 12. admin sidebar drift from canonical core/console (w-56 + 7 labels) 13. AppDetail uses tablist instead of sectioned layout 14. job rows navigate to /job/<id> instead of expand-in-place 15. Phase 0 banners (Hetzner infra / Cluster bootstrap) on AdminPage Each test prints a failure message naming the canonical reference, the source-of-truth file, and the data-testid PR needed (if any) so the implementing agent has a precise target. No .skip() — per INVIOLABLE-PRINCIPLES #2, missing components fail loud. CI: .github/workflows/cosmetic-guards.yaml runs the suite on every PR that touches products/catalyst/bootstrap/ui/ or core/console/. Docs: docs/UI-REGRESSION-GUARDS.md maps each test to the user's original complaint, the canonical reference, and the green/red semantics (5 tests intentionally RED on main today — they stay red until the companion-agent's UI work lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(platform): sync blueprint.yaml versions with Chart.yaml so manifest-validation passes --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:07:55 +04:00
hatiyildiz	b0c1c07271	fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction) Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via `https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`. helm-controller's reconcile of bp-flux ran `helm install` on top of the running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver admission with `status.storedVersions[0]: Invalid value: "v1": must appear in spec.versions`; Helm rolled back; the rollback DELETED every running Flux controller Deployment (helm-controller, source-controller, kustomize-controller, image-automation-controller, image-reflector-controller, notification-controller). The cluster lost its GitOps engine — no further HelmRelease could progress, and the only recovery was full `tofu destroy` + reprovision. This is OPTION C of the architectural fix proposed in the incident memo: version-align cloud-init's flux2 install with the bp-flux umbrella chart's `flux2` subchart so a single upstream Flux release is installed and helm-controller adopts it on first reconcile rather than reinstalls on top with a different version. Changes: * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml URL pinned at v2.4.0 (deliberate; this is the source of truth) and added the CRITICAL VERSION-PIN INVARIANT comment block documenting the failure mode. * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from 2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion 2.4.0, matching cloud-init exactly. Bumped chart version 1.1.1 -> 1.1.2. * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream .version` mirror of the dep pin moved from 2.13.0 to 2.14.1. * `clusters/_template/bootstrap-kit/03-flux.yaml` and `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped bp-flux HelmRelease to 1.1.2 + added explicit `install.disableTakeOwnership: false`, `upgrade.disableTakeOwnership: false`, and `upgrade.preserveValues: true` so helm-controller adopts the cloud-init-installed Flux objects rather than rolling back on ownership conflict. * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2. * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml` — bumped HelmRelease to 1.1.2. * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case catastrophic-failure replay test: Case 1: Chart.yaml declares the flux2 subchart with explicit version. Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag. Case 3: chart's flux2 subchart appVersion equals cloud-init's pinned upstream version (the load-bearing invariant). Case 4: values.yaml metadata mirrors the Chart.yaml dep pin. Case 5: helm template renders cleanly + contains the four core Flux controllers. Case 6: replay test rejects a planted mismatched fake Chart.yaml (the gate's own self-test — proves the gate works). All six cases green locally; the new test joins the existing observability-toggle test in tests/. * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install — version-pin invariant" documenting the failure mode, the four pin-sites, the safe bump procedure, and the existing-Sovereign recovery path (full reprovision). Existing Sovereigns running 1.1.1: no in-place recovery is possible once the rollback has fired. Reprovision required against 1.1.2. Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) + #4 (never hardcode) — the version pins remain operator-bumpable via PR, but BOTH cloud-init's URL AND the chart's subchart MUST move together in the same PR; CI gate tests/version-pin-replay.sh enforces this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:38:17 +02:00
hatiyildiz	1ddd569789	fix(bp-): observability toggles default false — break circular CRD dependency Extends the v1.1.1 hardening that started with cilium / cert-manager / crossplane to the remaining 8 bootstrap-kit + per-Sovereign Blueprints. Every observability toggle in every Catalyst-curated Blueprint now ships `false`/`null` by default; the operator opts in via a per-cluster values overlay at clusters/<sovereign>/bootstrap-kit/ once bp-kube-prometheus-stack reconciles. Live failure mode that prompted this (omantel.omani.works 2026-04-29): bp-cilium @ 1.1.0 defaulted hubble.relay/ui + prometheus.serviceMonitor to true. The upstream Cilium 1.16.5 chart renders a monitoring.coreos.com/v1 ServiceMonitor whose CRD ships with kube-prometheus-stack — a tier-2 Application Blueprint that depends on the bootstrap-kit (cilium first). Helm install fails on a fresh Sovereign with "no matches for kind ServiceMonitor in version monitoring.coreos.com/v1 — ensure CRDs are installed first" and every downstream HelmRelease reports `dep is not ready`. The earlier trustCRDsExist=true mitigation only suppresses Helm's render-time gate; the apiserver still rejects the resource at install-time. Per-Blueprint changes: - bp-cilium: hubble.relay.enabled, hubble.ui.enabled → false; hubble.metrics.enabled → null (this is the exact value that disables the upstream metrics ServiceMonitor template branch — verified by reading cilium 1.16.5's _hubble.tpl); hubble.metrics.serviceMonitor .enabled → false. tests/observability-toggle.sh extended with Case 4 (default render produces no hubble-relay / hubble-ui Deployments). - bp-flux: flux2.prometheus.podMonitor.create → false. - bp-sealed-secrets: sealed-secrets.metrics.serviceMonitor.enabled → false (explicit lock; upstream already defaults false). - bp-spire: spire.global.spire.recommendations.enabled + recommendations.prometheus → false. - bp-nats-jetstream: nats.promExporter.enabled + promExporter.podMonitor.enabled → false. - bp-openbao: openbao.injector.metrics.enabled + openbao.serviceMonitor.enabled → false. - bp-keycloak: keycloak.metrics.enabled + metrics.serviceMonitor.enabled + metrics.prometheusRule.enabled → false. - bp-gitea: gitea.gitea.metrics.* and gitea.postgresql.metrics.* serviceMonitor + prometheusRule → false. - bp-powerdns: powerdns.serviceMonitor.enabled + powerdns.metrics.enabled → false (forward-compatibility guard; current upstream pschichtel/powerdns 0.10.0 has no ServiceMonitor template, but a future upstream bump cannot silently regress). Each chart ships a tests/observability-toggle.sh that asserts the rule in three cases (default off / explicit on opt-in / explicit off) — runs under blueprint-release.yaml's chart-test gate (added `bdeb0f54` + the existing wiring) before helm push. A regression that re-introduces a hardcoded enabled: true in any chart fails CI before the OCI artifact is published. Versioning: - All 11 leaf charts bumped 1.1.0 → 1.1.1. - products/catalyst/chart (bp-catalyst-platform umbrella) deps updated to 1.1.1 across the board. - clusters/_template/bootstrap-kit/03-flux through 10-gitea bumped to 1.1.1; clusters/omantel.omani.works/bootstrap-kit/* mirror. docs/BLUEPRINT-AUTHORING.md §11.2 table extended to enumerate every toggle disabled across all 11 Blueprints. References docs/INVIOLABLE-PRINCIPLES.md #4. GATES (all green): - helm dep build resolves cleanly post-change for every chart whose upstream is published (umbrella waits on per-leaf publish). - helm lint clean on all 11 leaves. - helm template . default render produces zero monitoring.coreos.com references on every leaf (verified locally). - tests/observability-toggle.sh PASS on all 11 leaves. Live verification: with v1.1.1 published the omantel.omani.works HelmRelease can roll forward without a manual values patch — Flux picks up the new chart digest automatically (semver: 1.x in OCIRepository). Refs: issue #182.	2026-04-29 19:23:52 +02:00
hatiyildiz	43aff20254	feat(bp-): convert all 11 bootstrap-kit charts to umbrella charts depending on upstream Each platform/<name>/chart/Chart.yaml now declares the canonical upstream chart as a dependencies: entry. helm dependency build pulls the upstream payload into the OCI artifact at publish time, so Flux helm install of bp-<name>:1.1.0 actually installs the upstream Helm release alongside the Catalyst-curated overlays (NetworkPolicy, ServiceMonitor, ClusterIssuer, ExternalSecret) under templates/. Pinned upstream chart versions per platform/<name>/blueprint.yaml: - cilium 1.16.5 https://helm.cilium.io - cert-manager v1.16.2 https://charts.jetstack.io - flux 2.4.0 https://fluxcd-community.github.io/helm-charts - crossplane 1.17.x https://charts.crossplane.io/stable - sealed-secrets 2.16.x https://bitnami-labs.github.io/sealed-secrets - spire ... https://spiffe.github.io/helm-charts-hardened - nats-jetstream ... https://nats-io.github.io/k8s/helm/charts - openbao ... https://openbao.github.io/openbao-helm - keycloak ... https://charts.bitnami.com/bitnami - gitea ... https://dl.gitea.com/charts - catalyst-platform umbrella over the 10 leaf bp- charts via helm dependency values.yaml in each chart adopts the umbrella convention: catalystBlueprint metadata block (provenance + version) at top level, upstream subchart values namespaced under the dependency name. cert-manager specifically: clusterissuer-letsencrypt-dns01.yaml gets the helm.sh/hook: post-install,post-upgrade annotation so it applies AFTER cert-manager controllers are running and CRDs registered (the previous hollow-chart shape ran the ClusterIssuer at install time when CRDs didn't exist yet, which was the omantel cluster's exact failure mode). Wrapper chart version bumped 1.0.0 → 1.1.0 across the board (umbrella conversion is a meaningful structural revision). Cluster manifests in clusters/_template/bootstrap-kit/ AND clusters/omantel.omani.works/ bootstrap-kit/ updated to reference 1.1.0. The blueprint-release.yaml workflow's helm package step needs an explicit helm dependency build before push so the upstream subchart bytes ship inside the OCI artifact. That CI change is a follow-up commit on this same branch (separate file scope).	2026-04-29 17:21:36 +02:00
hatiyildiz	f5daac52af	refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171 ) PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything k8gb was doing — geo-aware response selection, health-checked failover, weighted round-robin — at the authoritative DNS layer. Eliminates a separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign. Changes: - platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never authored — only README existed) - products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted - componentGroups.ts: remove k8gb component (PowerDNS already there) - componentLogos.tsx: drop logo_k8gb + k8gb map entry - model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns - StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb - provision.html: replace k8gb tile and edges with powerdns - catalog.generated.ts regenerated (now includes bp-powerdns) - docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING- CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY, COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs (cilium, external-dns, failover-controller, litmus, flux, opentofu) rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md. Historical entries in VALIDATION-LOG.md preserved as audit trail. - New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed), Application Placement → lua-record selector mapping, when to add a second Sovereign region, operational checks. Closes #171. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:51:09 +02:00
hatiyildiz	62d9c7d936	fix(charts): drop dependencies block — wrappers carry values overlay only The first 2 blueprint-release CI runs failed on `helm package` with containerd permission errors because the wrapper Chart.yaml's `dependencies:` block triggered helm to pull the upstream charts via OCI/containerd at package time, which the GitHub Actions runner blocks. Architectural fix: each Catalyst Blueprint wrapper carries the values overlay + metadata only. The bootstrap installer reads the upstream chart reference from the wrapper's values.yaml `catalystBlueprint.upstream.{chart,version,repo}` metadata block, points `helm install` at the upstream chart's repo, and overlays our values. This keeps: - blueprint-release CI lightweight (no upstream pulls during package; helm package now works without containerd) - the "bp-<name> wrapper does NOT drift from upstream" property (we ship the overlay, not a fork) - the single Blueprint contract from BLUEPRINT-AUTHORING §1 (a wrapper is still a Catalyst-curated Helm chart published as bp-<name>:<semver>) Changes: - 11 platform/<name>/chart/Chart.yaml: removed dependencies block. Each is now a plain Helm chart with no remote pulls during package. - 11 platform/<name>/chart/values.yaml: prepended catalystBlueprint.upstream.{chart,version,repo} metadata block at the top. Bootstrap installer parses it to know which upstream chart to install with these values. - products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go: installCilium now does `helm repo add cilium https://helm.cilium.io --force-update` then `helm install cilium cilium/cilium --version 1.16.5 --values -` (the cilium/cilium upstream chart, with our overlay values piped from values.yaml). Same pattern needs propagating to the other 10 install functions in a follow-up. After this commit, blueprint-release CI should green-build all 11 wrappers (helm package now works without containerd access since there's nothing to pull). The bootstrap installer's actual `helm install` calls in production reach upstream chart repos via the runtime k3s cluster's pod network, which has full network access.	2026-04-28 12:57:29 +02:00
hatiyildiz	441ebaebb8	fix(charts): pin upstream chart versions/names to ones that exist in their repos The first Blueprint Release CI run (commit `8c0f766`) failed because four chart wrappers referenced upstream chart versions/names that don't exist in their published repositories: - platform/flux/chart: name was "flux", repo was OCI; actual is name "flux2" in plain helm repo at https://fluxcd-community.github.io/helm-charts. Pinned to 2.13.0. - platform/openbao/chart: version 2.1.0 was the binary appVersion, not the chart version. Pinned to 0.16.0 chart (which packages openbao 2.1.0 internally). - platform/keycloak/chart (Bitnami): chart version 25.0.6 was the appVersion of upstream; Bitnami's chart is at 24.7.1 packaging Keycloak 26.0.x. Pinned to 24.7.1. - platform/nats-jetstream/chart: name was "nats-jetstream"; the upstream chart is named "nats" (it always was — JetStream is a feature of NATS, not a separate chart). Renamed. Cilium, cert-manager, crossplane, sealed-secrets, spire wrappers were unaffected; their version pins matched upstream availability. Containerd permission-denied errors from `helm package` on cilium/cert-manager/crossplane/gitea/sealed-secrets are a separate CI plumbing issue (helm tries to pull OCI base images during package build via containerd, but the GitHub Actions runner blocks containerd socket access). Tracked as a follow-up: switch to `helm package --skip-refresh` or use a runner with containerd permissions. After this commit lands, the next blueprint-release CI run should green-build at minimum the 4 fixed charts. Successful builds publish bp-{flux,openbao,keycloak,nats-jetstream}:1.0.0 OCI artifacts to ghcr.io/openova-io/.	2026-04-28 12:55:21 +02:00
hatiyildiz	8c0f76640c	feat(charts): G2 wrapper Helm charts for 11 bootstrap-kit components + blueprint-release CI Per docs/PROVISIONING-PLAN.md and tickets [F] chart. Adds Catalyst-curated wrapper Helm charts at platform/<name>/chart/ for every component the bootstrap-kit installer (introduced in commit `07b4bcf`) needs. Each chart is the canonical bp-<name> source per BLUEPRINT-AUTHORING.md §1's source-location rule. 11 charts created with Chart.yaml + values.yaml + blueprint.yaml each: Network + GitOps: - platform/cilium/chart — wraps cilium 1.16.5; kubeProxyReplacement, WireGuard mTLS, Hubble, Gateway API - platform/flux/chart — wraps flux 2.4.0 - platform/crossplane/chart — wraps crossplane 1.18.0 + provider-hcloud manifest Security: - platform/cert-manager/chart — wraps cert-manager 1.16.2 with CRDs+ServiceMonitor - platform/sealed-secrets/chart — wraps sealed-secrets 2.16.1 (transient bootstrap-only) - platform/spire/chart — wraps spiffe/spire 1.10.4 (5-min SVID rotation) Catalyst control-plane services: - platform/nats-jetstream/chart — wraps nats 2.10.22 (3-node cluster, JetStream + KV) - platform/openbao/chart — wraps openbao 2.1.0 (3-node Raft, region-local per SECURITY §5) - platform/keycloak/chart — wraps keycloak 25.0.6 (Bitnami flavor, edge proxy mode) - platform/gitea/chart — wraps gitea 10.5.0 (CNPG Postgres backend, no chart-bundled valkey/redis since Catalyst control plane uses JetStream) New platform/ folders (added per AUDIT-PROCEDURE component-count anchor — was 53, now 55): - platform/spire/README.md — workload identity Catalyst control plane component - platform/nats-jetstream/README.md — control-plane event spine - platform/sealed-secrets/README.md — transient bootstrap-only Each blueprint.yaml declares: - catalyst.openova.io/v1alpha1 Blueprint kind (canonical CRD per BLUEPRINT-AUTHORING §3) - visibility: unlisted (mandatory infra, auto-installed by bootstrap kit, not a marketplace card) - manifests.chart: ./chart pointer - depends: [] (foundational components have no Blueprint dependencies; control-plane services depend on each other implicitly via bootstrap order, not via Blueprint depends) .github/workflows/blueprint-release.yaml: - New CI workflow per BLUEPRINT-AUTHORING §11 (path-matrix per Blueprint folder) - Triggers on push to main touching platform//chart/* or products//chart/* - detect job: emits matrix of changed Blueprint folders via git diff - build job (per chart): helm dependency build → helm package → helm push to GHCR → cosign keyless sign (GitHub OIDC) → Syft SBOM attestation - Output: ghcr.io/openova-io/bp-<name>:<semver> with SLSA-3-style supply-chain provenance Closes [F] tickets: 11 G2 charts (cilium, cert-manager, flux, crossplane, sealed-secrets, spire, nats-jetstream, openbao, keycloak, gitea, plus the umbrella products/catalyst/chart already exists from Pass 105). blueprint.yaml CRDs added across 11 entries. CI fan-out workflow live. After this commit lands, the bootstrap-kit installer in commit `07b4bcf` has real OCI artifacts to install. The first push to main will trigger 10 build matrix jobs (cilium was created in a separate commit earlier in this session) which produce 10 cosigned bp-<name>:<semver> artifacts on GHCR. Component-count anchor update follows: 53 → 55 (added spire + nats-jetstream + sealed-secrets — but sealed-secrets was already conceptually counted under "supporting services"). Per AUDIT-PROCEDURE the count needs updating in CLAUDE.md, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST L11. Tracked as separate ticket [K] docs.	2026-04-28 12:51:06 +02:00
hatiyildiz	7cafa3c894	docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay Component-level architectural correction (two changes): 1. MinIO → SeaweedFS as unified S3 encapsulation layer The old design used MinIO for in-cluster S3 plus separate cold-tier configuration scattered across consumers. The new design positions SeaweedFS as the single S3 encapsulation layer: every Catalyst component talks to one endpoint (seaweedfs.storage.svc:8333). SeaweedFS internally handles hot tier (in-cluster NVMe), warm tier (in-cluster bulk), and cold tier (transparent passthrough to cloud archival storage — Cloudflare R2 / AWS S3 / Hetzner Object Storage / etc., chosen at Sovereign provisioning). One audit/lifecycle/encryption boundary instead of N. No Catalyst component talks to cloud S3 directly anymore — Velero, CNPG WAL archive, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg, Harbor blob store, Application buckets all share one S3 surface. 2. Apache Guacamole added as Application Blueprint §4.5 Communication Clientless browser-based RDP/VNC/SSH/kubectl-exec gateway. Keycloak SSO, full session recording to SeaweedFS for compliance evidence (PSD2/DORA/SOX). Composed into bp-relay. Replaces VPN+native-client distribution for auditable remote access. Component changes: - DELETED: platform/minio/ - CREATED: platform/seaweedfs/README.md (unified S3 + cold-tier encapsulation; bucket layout; multi-region replication via shared cold backend; migration-from-MinIO section) - CREATED: platform/guacamole/README.md (clientless remote-desktop gateway; GuacamoleConnection CRD; compliance integration via session recordings) Doc updates: PLATFORM-TECH-STACK §1+§3.5+§4.5+§5+§7.4; TECHNOLOGY-FORECAST L11+mandatory+a-la-carte counts (52 → 53); ARCHITECTURE §3 topology; SECURITY §4 DB engines; SOVEREIGN-PROVISIONING §1 inputs; SRE §2.5+§7; IMPLEMENTATION-STATUS §3; BLUEPRINT-AUTHORING stateful examples; BUSINESS-STRATEGY 13 component-count anchors + Relay product line; README.md backup row; CLAUDE.md folder count. Component README updates (S3 endpoint + dependency renames): cnpg, clickhouse, flink, gitea, iceberg, harbor, grafana, livekit, kserve, milvus, opensearch, flux, stalwart, velero (substantive rewrite of velero — now writes exclusively to SeaweedFS with cold-tier auto-routing). Products: relay, fabric. UI scaffold: products/catalyst/bootstrap/ui/src/shared/constants/components.ts — minio entry replaced with seaweedfs; velero+harbor deps updated; new guacamole entry added. VALIDATION-LOG entry "Pass 104 — MinIO → SeaweedFS swap + Guacamole add" captures the encapsulation principle and adds Lesson #22: storage tier policy belongs at the encapsulation boundary, not inside every consumer. Verification: zero remaining MinIO references in canonical docs (one intentional retention in TECHNOLOGY-FORECAST L37 explaining the swap); 53 platform/ folders matching all "53 components" anchors; bp-relay composition includes guacamole.	2026-04-28 10:23:46 +02:00
hatiyildiz	76e68e6182	docs(pass-36): flux deep-scrutiny + sweep gap-fill (Pass 35 head -10 cutoff) Pass 35's sweep grep had `head -10` cutoff that produced a false-clean signal. Pass 36 ran the same grep without truncation, finding 6 surviving drift instances: platform/flux/README.md (5 fixes): - Mermaid diagram: Tenant[Tenant Repos] -> Organization[Organization Repos]. - GitRepository url gitea.<domain> -> gitea.<location-code>.<sovereign-domain>. - Bootstrap command --url=https://gitea.<domain>/... -> canonical form. - Key commands `flux reconcile kustomization tenants` -> `organizations` (Pass 34 was uppercase-only and missed lowercase plural). - Gitea Actions example flux-webhook.<domain> -> location-code form. platform/kyverno/README.md (1 fix): - Mermaid subgraph "Tenant Workload" -> "Organization Workload" (the priority class names tenant-high/tenant-default remain — those are deployed K8s PriorityClass objects requiring recreate-not-rename per Pass 9's deferred-migration note). Methodology lesson: convenience shortcuts in validation produce false-clean signals. From Pass 37 forward: drift sweeps use full grep output (no truncation) and case-insensitive banned-term searches. Validation log Pass 36 entry includes detail on each preserved "multi-tenant" generic adjective use that survived (acceptable feature descriptions, not Catalyst entity references).	2026-04-27 22:49:05 +02:00
hatiyildiz	b6a374df26	docs(pass-15): final banner sweep — 52/52 platform components covered, convergence achieved Pass 15 swept all 52 platform/*/README.md files for the role-in- Catalyst banner. 3 still lacked one (cnpg, flux, strimzi) and got banners added: - cnpg (§4.1): production Postgres; underlying engine for FerretDB + Gitea metadata. - flux (§3.2): per-vcluster Flux + host-level Flux for Catalyst itself; pulls from single per-Sovereign Gitea. - strimzi (§4.1): Application-tier event streaming; NOT the Catalyst control-plane spine (which uses NATS JetStream). Same upstream- tech-different-tier disambiguation pattern as Valkey. CONVERGENCE: 52 / 52 platform components have role-in-Catalyst banners. All cross-refs resolve. No banned terms. No architectural drift detected on this pass. VALIDATION-LOG: Pass 15 entry + "Convergence achieved (initial banner sweep)" marker added. The validation loop continues per the standing instruction — but subsequent passes will be brief drift-detection sweeps rather than systematic rewrites. Refs #37	2026-04-27 21:53:27 +02:00
hatiyildiz	a5ffa1a716	docs(pass-7): align Gitea + Flux multi-region story; fix broken mermaid id Continuing Pass 7 cleanup after the OpenBao/ESO rewrite (`42aeb62`). Gitea README: - Was describing "Bidirectional mirroring for multi-region" with two Gitea instances mirroring repos cross-region. Wrong: Catalyst's agreed model has one Gitea per Sovereign on the management cluster (PLATFORM-TECH-STACK §2.3). Replaced the multi-region mirror diagram with a single-Gitea + intra-cluster HA topology and added a "Why not cross-region bidirectional mirror" explainer (write- conflict semantics would break EnvironmentPolicy enforcement). - Status banner: notes the canonical references. - Backup section: removed "Repository mirror for redundancy" (replaced with Velero scheduled backups). Flux README: - "Multi-Region GitOps" section was showing one Gitea per region with bidirectional mirror. Replaced with one Gitea per Sovereign topology. Per-vcluster Flux pulls from this single Gitea. Mermaid syntax bug: - Earlier mass replace_all of "Catalyst IDP" → "Catalyst console" had left an invalid mermaid node identifier `Catalyst console[Catalyst console]` (mermaid forbids spaces in node IDs). Fixed to `Console[Catalyst console]`. Would have rendered as a broken diagram on GitHub. VALIDATION-LOG: Pass 7 entry added documenting the OpenBao/ESO active-active rewrite (the most consequential drift fix in any pass). Refs #37	2026-04-27 21:36:20 +02:00
hatiyildiz	119a1e53a0	docs(components): terminology pass across platform and product READMEs Bring per-component READMEs in line with the canonical glossary (docs/GLOSSARY.md). Substantive architectural content unchanged — this is a terminology + reference correctness pass. Placeholder rename: <tenant> → <org> in YAML / IaC examples across - platform/cnpg/README.md (Cluster + Pooler + ScheduledBackup) - platform/debezium/README.md (PostgreSQL connector + topic patterns) - platform/external-secrets/README.md (ExternalSecret / SecretStore) - platform/grafana/README.md (Instrumentation namespace) - platform/k8gb/README.md (Gslb + namespace + kubectl examples) - platform/keda/README.md (ScaledObject + Kafka triggers + Prometheus) - platform/opentofu/README.md (server resource example) - platform/velero/README.md (BackupStorageLocation buckets) - platform/vpa/README.md (VerticalPodAutoscaler examples) - platform/flux/README.md (kustomization name + tenants/ → organizations/) "Catalyst IDP" → "Catalyst console": - platform/crossplane/README.md (integration section retitled and rewritten — Crossplane is platform plumbing, not user-facing) - platform/gitea/README.md (architecture diagram + integration table) - platform/kyverno/README.md (rollout tracking surface) - products/fingate/README.md (TPP onboarding portal) "Bootstrap wizard" → "Catalyst bootstrap": - platform/openbao/README.md (bootstrap procedure rewritten — independent Raft per region clarified; cross-references docs/SECURITY.md §5) - platform/opentofu/README.md (Quick Start) Kyverno labels & prose: - openova.io/tenant → openova.io/organization (label rename for consistency; deployed clusters will add new label as a co-label during migration window) - "tenant labels" / "tenant namespace" prose updated to "Organization labels" / "Organization-labeled namespace" - Priority class names (tenant-high, tenant-default, tenant-batch) retained as deployed artifact names — rename pending in a separate migration ticket No banned-term hits remain in component READMEs (verified by grep in docs/GLOSSARY.md banned-terms table). Refs #37	2026-04-27 20:06:51 +02:00
talent-mesh	435f49738d	feat: restructure platform to 52 components and 9 products Technology forecast and strategic review restructure: - Remove 13 components (backstage, mongodb, activemq, vitess, airflow, camel, dapr, superset, searxng, langserve, trino, lago, rabbitmq) - Add 10 components (sigstore, syft-grype, nemo-guardrails, langfuse, reloader, matrix, ferretdb, litmus, livekit, coraza) - Rename product: Synapse → Axon (SaaS LLM Gateway) - Merge products: Titan + Fuse → Fabric (Data & Integration) - New product: Relay (Communication) - Replace Backstage with Catalyst IDP - Replace MongoDB with FerretDB (MongoDB wire protocol on CNPG) - Add supply chain security (Sigstore/Cosign, Syft+Grype) - Add AI safety and observability (NeMo Guardrails, LangFuse) - Add technology forecast 2027-2030 document - Full verification pass: zero stale references across all docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 21:00:19 +00:00
talent-mesh	10245dff98	feat: ecosystem expansion to 55 components with license compliance - Replace BSL-licensed components with open-source alternatives: Terraform→OpenTofu (MPL 2.0), Vault→OpenBao (MPL 2.0), Redpanda→Strimzi/Kafka (Apache 2.0), n8n→Airflow (Apache 2.0) - Add 14 new platform components: activemq, camel, clickhouse, dapr, debezium, falco, flink, iceberg, opensearch, rabbitmq, superset, temporal, trino, vitess - Rename meta-platforms/ to products/ with new product names: Cortex (AI Hub), Fingate (Open Banking), Titan (Data Lakehouse), Fuse (Microservices Integration) - Update all documentation, READMEs, and cross-references Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 18:15:11 +00:00
talent-mesh	c9d04a53b4	refactor: flatten platform/ structure (41 components) Remove hierarchical grouping (networking/, security/, etc.) and use flat structure for all 41 platform components. Changes: - All components now directly under platform/ (no subfolders) - AI Hub components moved from meta-platforms/ai-hub/components/ to platform/ - Open Banking components (lago, openmeter) moved to platform/ - meta-platforms/ now only contains README files that reference platform/ - Open Banking custom services remain in meta-platforms/open-banking/services/ Structure: - platform/ (41 components, flat) - meta-platforms/ai-hub/ (README only, references platform/) - meta-platforms/open-banking/ (README + 6 custom services) All documentation links updated. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:19:48 +00:00

20 Commits