Commit Graph

11 Commits

Author SHA1 Message Date
e3mrah
ab67a48fe7
fix(blueprints): align blueprint.yaml spec.version with Chart.yaml version (#817) (#819)
TestBootstrapKit_BlueprintCardsHaveRequiredFields was failing on main for
9 blueprints because their platform/<name>/chart/Chart.yaml version had
been bumped without a matching update to platform/<name>/blueprint.yaml
spec.version. The pre-existing failure forced 7 recent PRs to self-merge
with --admin, masking real CI failures.

Aligned spec.version to match Chart.yaml version on:

  cert-manager   1.1.1 -> 1.1.2
  flux           1.1.3 -> 1.1.4
  crossplane     1.1.3 -> 1.1.4
  sealed-secrets 1.1.1 -> 1.1.2
  spire          1.1.4 -> 1.1.7
  nats-jetstream 1.1.1 -> 1.1.2
  openbao        1.2.0  -> 1.2.14
  keycloak       1.3.1 -> 1.3.2
  gitea          1.2.1 -> 1.2.3

Verified locally:

  $ go test ./... -run TestBootstrapKit_BlueprintCardsHaveRequiredFields -count=1
  --- PASS: TestBootstrapKit_BlueprintCardsHaveRequiredFields (0.01s)
      ... all 10 sub-tests pass (cilium + the 9 above)

The existing test (tests/e2e/bootstrap-kit/main_test.go:145) is itself
the drift guardrail: it fails CI whenever Chart.yaml is bumped without a
matching blueprint.yaml bump. No additional script needed.

Closes #817 once verified on main.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-04 22:32:49 +04:00
e3mrah
a359278b7d
fix(bp-spire): disable oidc ClusterSPIFFEID + chart bump (1.1.7) (#645)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it

cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry`
(default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa
overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered
image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` —
doubled prefix, image-not-found, ImagePullBackOff on every fresh
Sovereign. Caught live during otech26.

Fix: drop the redundant prefix. Subchart's default `.image.registry`
keeps it pointing at registry.k8s.io which the new Sovereign's
containerd routes through harbor.openova.io/v2/proxy-k8s/... via
registries.yaml rewrite (#640).

Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB

CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single
node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending
indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux +
cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir +
Loki + Tempo + … each request 50-500m vCPU and the node hits 100%
allocatable before half the workloads schedule.

CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size
that fits the bootstrap-kit with VPA-recommendation headroom. Operators
can still pick CPX32 explicitly if they trim the component set on
StepComponents — but the default SOLO path now provisions a node
that actually boots into a steady state.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2)

- Replace forbidden `:latest` tag with current short-SHA `942be6f` per
  docs/INVIOLABLE-PRINCIPLES.md #4.
- Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet
  authenticates against private ghcr.io/openova-io/openova/* via the
  Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace.
  Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff
  on every Sovereign — caught live during otech27.
- Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-{harbor,gitea,powerdns}): add bp-cnpg dependency + Reflector auto-enabled

Two related Phase-8a stragglers diagnosed live during otech28:

1. bp-powerdns missed bp-cnpg in dependsOn. Helm renders BEFORE
   postgresql.cnpg.io/v1 CRD is registered → templates/cnpg-cluster.yaml
   `Capabilities.APIVersions.Has` gate evaluates false → no Cluster CR
   → no pdns-pg-app Secret → powerdns Pods stuck CreateContainerConfigError
   forever ("secret pdns-pg-app not found"). Adds explicit dependsOn.

2. bp-harbor/gitea/powerdns CNPG inheritedMetadata only set
   reflection-allowed; missing reflection-auto-enabled. Reflector races
   when destination Secret (harbor-database-secret) is created BEFORE
   CNPG provisions the source (harbor-pg-app). Reflector logs
   "Source could not be found" once and never retries — leaving harbor-
   core stuck CreateContainerConfigError. Adding auto-enabled makes
   Reflector actively watch the source and re-fire when it appears.

Bumps:
  bp-harbor    1.2.8 -> 1.2.9
  bp-gitea     1.2.1 -> 1.2.2
  bp-powerdns  1.1.5 -> 1.1.7 (skips 1.1.6 which was a non-released bump)

Bootstrap-kit references updated to pull the new chart versions on
the next Sovereign provisioning.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-spire): Chart.lock missing spire-crds → CRDs never installed (chart 1.1.7)

bp-spire 1.1.4 added spire-crds 0.5.0 as a Helm dependency to register
the spire.spiffe.io/v1alpha1 CRDs (ClusterSPIFFEID, ClusterStaticEntry,
ClusterFederatedTrustDomain) before the spire subchart's controller-
manager Deployment starts. But Chart.lock was never regenerated — only
contained the original `spire` entry. As a result every Blueprint
Release packaged the chart WITHOUT spire-crds, the Sovereign saw no
CRDs registered, and Helm install failed with:

  no matches for kind "ClusterSPIFFEID" in version "spire.spiffe.io/v1alpha1"

bp-openbao / bp-external-secrets / bp-nats-jetstream all dependsOn
bp-spire so this single bug cascades and blocks 5+ HRs from reaching
Ready=True. Caught live during otech29.

Fix: ran `helm dependency update` to regenerate Chart.lock + pull both
spire and spire-crds tarballs; bumps bp-spire 1.1.6 -> 1.1.7 and
bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 08:27:33 +04:00
e3mrah
5796de12bc
fix(bp-spire): re-enable oidc-discovery-provider ClusterSPIFFEID to fix init stuck (Closes #571) (#575)
The oidc-discovery-provider ClusterSPIFFEID was disabled at bootstrap to
work around a CRD-ordering race (spire-controller-manager applying the
template before CRDs were registered). That race was fixed in bp-spire 1.1.4
by listing spire-crds as the first Helm dependency.

With all ClusterSPIFFEIDs still disabled the oidc-discovery-provider init
container blocks indefinitely with "PermissionDenied: no identity issued" —
the controller-manager never creates the registration entry so no SVID is
issued.

Re-enable oidc-discovery-provider identity. The default, test-keys, and
child-servers identities remain disabled (not needed for bootstrap).

Also carries the global.imageRegistry field added by issue #560 (was 1.1.5
in working tree, now bumped to 1.1.6 for this fix). Bootstrap-kit slot 06
updated from 1.1.4 → 1.1.6.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 13:00:43 +04:00
e3mrah
ec3821f7e1
fix(bp-*): event-driven HR install -- drop blanket timeout, use disableWait (#250)
Helm install completes when manifests apply, not when pods reach Ready.
Flux dependsOn checks Ready=True on each HR independently, so
spec.install.disableWait + spec.upgrade.disableWait is the correct
shape for slow-Ready workloads. Blanket spec.timeout: Nm watchdogs from
PR #221 were a band-aid that caused cascading HR failures and blocked
downstream HRs (bp-nats-jetstream, bp-openbao depended on bp-spire).

Founder direction (verbatim): "always event driven robust jobs"

Per-HR audit (drop spec.timeout: 15m, add disableWait, with reason):

- bp-cilium:        envoyconfig CRD self-wait — agent crash-loops until
                    its own CRDs land
- bp-cert-manager:  webhook readiness depends on cainjector mutating
                    Secret — multi-minute on cold start
- bp-flux:          adopts cloud-init Flux objects; the helm-controller
                    reconciling THIS HR is itself a chart target — Ready
                    deadlock without disableWait
- bp-sealed-secrets: single-replica controller + CRD — install completes
                    on manifest apply
- bp-spire:         spire-controller-manager waits for CRD informer cache
                    sync — multi-minute legitimate path; chart fix below
- bp-nats-jetstream: JetStream raft quorum formation across N replicas
- bp-openbao:       3-node Raft sealed-by-default; Ready=True only after
                    operator runs `bao operator init` unseal flow
- bp-keycloak:      DB schema migration + 100+ Liquibase changesets on
                    first install
- bp-gitea:         PostgreSQL DB init + admin user + Blueprint catalog
                    mirror seeding
- bp-external-dns:  pod readiness depends on PowerDNS API + pdns-pg CNPG
                    cascade
- bp-catalyst-platform: ~10 services, inter-service NATS/OTel readiness
                    is not Helm's concern

Intentionally NOT touched (other parallel agents own these):
- bp-crossplane (Agent A): chart split for intra-chart CRD-ordering
- bp-powerdns   (Agent D): post-install hook for intra-chart Job-ordering

bp-spire chart fix (1.1.3 -> 1.1.4):

Root cause investigation on otech.omani.works (live):
  spire-controller-manager has restarted 37 times with:
    "failed to wait for clusterstaticentry caches to sync: timed out
     waiting for cache to be synced for Kind *v1alpha1.ClusterStaticEntry"

`kubectl get crd | grep spire` returns nothing — the spire.spiffe.io
v1alpha1 CRDs (ClusterSPIFFEID / ClusterStaticEntry /
ClusterFederatedTrustDomain) are NOT registered. The upstream `spire`
chart does not install its own CRDs; the spiffe maintainers ship them
via the SEPARATE `spire-crds` chart, expected to be installed first.

Fix: platform/spire/chart/Chart.yaml now declares spire-crds 0.5.0 as
the FIRST dependency. Helm installs subcharts in dependency order, so
listing spire-crds first guarantees CRDs are applied before the spire
subchart's controller-manager Deployment starts. blueprint.yaml +
both 06-spire.yaml cluster references bumped to 1.1.4.

Live error this fixes (otech.omani.works, persistent ~5h):
  Helm upgrade failed for release spire-system/spire with chart
  bp-spire@1.1.3: context deadline exceeded
  + downstream cascade: bp-nats-jetstream / bp-openbao stuck at
    "dependency 'flux-system/bp-spire' is not ready"

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:55:19 +04:00
e3mrah
da87fb38c4
fix(bp-spire): disable ALL default-enabled clusterSPIFFEIDs (default+oidc+test-keys) (#230)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 08:13:41 +02:00
e3mrah
719c3bac35
fix(bp-spire): disable default ClusterSPIFFEID — CRD not observable in time on fresh install (#228)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 07:51:03 +02:00
e3mrah
1f5c76def1
fix(platform): sync blueprint.yaml versions with Chart.yaml (#199)
* feat(ui): Playwright cosmetic + step-flow regression guards

15 regression guards in products/catalyst/bootstrap/ui/e2e/cosmetic-
guards.spec.ts that fail HARD when each user-flagged defect class
returns:

  1.  card height drift from canonical 108px
  2.  reserved right padding eating description width
  3.  logo tile drift from per-brand LOGO_SURFACE
  4.  invisible glyph (white-on-white) via luminance proxy
  5.  wizard step order Org/Topology/Provider/Credentials/Components/
      Domain/Review
  6.  legacy "Choose Your Stack" / "Always Included" tab labels
  7.  Domain step reachable before Components
  8.  CPX32 not the recommended Hetzner SKU
  9.  per-region SKU dropdown shows wrong provider catalog
  10. provision page is .html (static) not SPA route
  11. legacy bubble/edge DAG SVG markup on provision page
  12. admin sidebar drift from canonical core/console (w-56 + 7 labels)
  13. AppDetail uses tablist instead of sectioned layout
  14. job rows navigate to /job/<id> instead of expand-in-place
  15. Phase 0 banners (Hetzner infra / Cluster bootstrap) on AdminPage

Each test prints a failure message naming the canonical reference,
the source-of-truth file, and the data-testid PR needed (if any) so
the implementing agent has a precise target. No .skip() — per
INVIOLABLE-PRINCIPLES #2, missing components fail loud.

CI: .github/workflows/cosmetic-guards.yaml runs the suite on every
PR that touches products/catalyst/bootstrap/ui/** or core/console/**.

Docs: docs/UI-REGRESSION-GUARDS.md maps each test to the user's
original complaint, the canonical reference, and the green/red
semantics (5 tests intentionally RED on main today — they stay red
until the companion-agent's UI work lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(platform): sync blueprint.yaml versions with Chart.yaml so manifest-validation passes

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:07:55 +04:00
hatiyildiz
1ddd569789 fix(bp-*): observability toggles default false — break circular CRD dependency
Extends the v1.1.1 hardening that started with cilium / cert-manager /
crossplane to the remaining 8 bootstrap-kit + per-Sovereign Blueprints.
Every observability toggle in every Catalyst-curated Blueprint now ships
`false`/`null` by default; the operator opts in via a per-cluster values
overlay at clusters/<sovereign>/bootstrap-kit/* once
bp-kube-prometheus-stack reconciles.

Live failure mode that prompted this (omantel.omani.works 2026-04-29):
bp-cilium @ 1.1.0 defaulted hubble.relay/ui + prometheus.serviceMonitor
to true. The upstream Cilium 1.16.5 chart renders a
monitoring.coreos.com/v1 ServiceMonitor whose CRD ships with
kube-prometheus-stack — a tier-2 Application Blueprint that depends on
the bootstrap-kit (cilium first). Helm install fails on a fresh
Sovereign with "no matches for kind ServiceMonitor in version
monitoring.coreos.com/v1 — ensure CRDs are installed first" and every
downstream HelmRelease reports `dep is not ready`. The earlier
trustCRDsExist=true mitigation only suppresses Helm's render-time gate;
the apiserver still rejects the resource at install-time.

Per-Blueprint changes:
- bp-cilium: hubble.relay.enabled, hubble.ui.enabled → false;
  hubble.metrics.enabled → null (this is the exact value that disables
  the upstream metrics ServiceMonitor template branch — verified by
  reading cilium 1.16.5's _hubble.tpl); hubble.metrics.serviceMonitor
  .enabled → false. tests/observability-toggle.sh extended with Case 4
  (default render produces no hubble-relay / hubble-ui Deployments).
- bp-flux: flux2.prometheus.podMonitor.create → false.
- bp-sealed-secrets: sealed-secrets.metrics.serviceMonitor.enabled
  → false (explicit lock; upstream already defaults false).
- bp-spire: spire.global.spire.recommendations.enabled +
  recommendations.prometheus → false.
- bp-nats-jetstream: nats.promExporter.enabled +
  promExporter.podMonitor.enabled → false.
- bp-openbao: openbao.injector.metrics.enabled +
  openbao.serviceMonitor.enabled → false.
- bp-keycloak: keycloak.metrics.enabled + metrics.serviceMonitor.enabled
  + metrics.prometheusRule.enabled → false.
- bp-gitea: gitea.gitea.metrics.* and gitea.postgresql.metrics.*
  serviceMonitor + prometheusRule → false.
- bp-powerdns: powerdns.serviceMonitor.enabled + powerdns.metrics.enabled
  → false (forward-compatibility guard; current upstream
  pschichtel/powerdns 0.10.0 has no ServiceMonitor template, but a future
  upstream bump cannot silently regress).

Each chart ships a tests/observability-toggle.sh that asserts the rule
in three cases (default off / explicit on opt-in / explicit off) — runs
under blueprint-release.yaml's chart-test gate (added bdeb0f54 + the
existing wiring) before helm push. A regression that re-introduces a
hardcoded enabled: true in any chart fails CI before the OCI artifact
is published.

Versioning:
- All 11 leaf charts bumped 1.1.0 → 1.1.1.
- products/catalyst/chart (bp-catalyst-platform umbrella) deps updated
  to 1.1.1 across the board.
- clusters/_template/bootstrap-kit/03-flux through 10-gitea bumped to
  1.1.1; clusters/omantel.omani.works/bootstrap-kit/* mirror.

docs/BLUEPRINT-AUTHORING.md §11.2 table extended to enumerate every
toggle disabled across all 11 Blueprints. References
docs/INVIOLABLE-PRINCIPLES.md #4.

GATES (all green):
- helm dep build resolves cleanly post-change for every chart whose
  upstream is published (umbrella waits on per-leaf publish).
- helm lint clean on all 11 leaves.
- helm template . default render produces zero monitoring.coreos.com
  references on every leaf (verified locally).
- tests/observability-toggle.sh PASS on all 11 leaves.

Live verification: with v1.1.1 published the omantel.omani.works
HelmRelease can roll forward without a manual values patch — Flux picks
up the new chart digest automatically (semver: 1.x in OCIRepository).

Refs: issue #182.
2026-04-29 19:23:52 +02:00
hatiyildiz
43aff20254 feat(bp-*): convert all 11 bootstrap-kit charts to umbrella charts depending on upstream
Each platform/<name>/chart/Chart.yaml now declares the canonical upstream
chart as a dependencies: entry. helm dependency build pulls the upstream
payload into the OCI artifact at publish time, so Flux helm install of
bp-<name>:1.1.0 actually installs the upstream Helm release alongside the
Catalyst-curated overlays (NetworkPolicy, ServiceMonitor, ClusterIssuer,
ExternalSecret) under templates/.

Pinned upstream chart versions per platform/<name>/blueprint.yaml:
- cilium                 1.16.5  https://helm.cilium.io
- cert-manager           v1.16.2 https://charts.jetstack.io
- flux                   2.4.0   https://fluxcd-community.github.io/helm-charts
- crossplane             1.17.x  https://charts.crossplane.io/stable
- sealed-secrets         2.16.x  https://bitnami-labs.github.io/sealed-secrets
- spire                  ...     https://spiffe.github.io/helm-charts-hardened
- nats-jetstream         ...     https://nats-io.github.io/k8s/helm/charts
- openbao                ...     https://openbao.github.io/openbao-helm
- keycloak               ...     https://charts.bitnami.com/bitnami
- gitea                  ...     https://dl.gitea.com/charts
- catalyst-platform      umbrella over the 10 leaf bp-* charts via
                         helm dependency

values.yaml in each chart adopts the umbrella convention: catalystBlueprint
metadata block (provenance + version) at top level, upstream subchart
values namespaced under the dependency name.

cert-manager specifically: clusterissuer-letsencrypt-dns01.yaml gets the
helm.sh/hook: post-install,post-upgrade annotation so it applies AFTER
cert-manager controllers are running and CRDs registered (the previous
hollow-chart shape ran the ClusterIssuer at install time when CRDs
didn't exist yet, which was the omantel cluster's exact failure mode).

Wrapper chart version bumped 1.0.0 → 1.1.0 across the board (umbrella
conversion is a meaningful structural revision). Cluster manifests in
clusters/_template/bootstrap-kit/ AND clusters/omantel.omani.works/
bootstrap-kit/ updated to reference 1.1.0.

The blueprint-release.yaml workflow's helm package step needs an
explicit helm dependency build before push so the upstream subchart
bytes ship inside the OCI artifact. That CI change is a follow-up
commit on this same branch (separate file scope).
2026-04-29 17:21:36 +02:00
hatiyildiz
62d9c7d936 fix(charts): drop dependencies block — wrappers carry values overlay only
The first 2 blueprint-release CI runs failed on `helm package` with containerd permission errors because the wrapper Chart.yaml's `dependencies:` block triggered helm to pull the upstream charts via OCI/containerd at package time, which the GitHub Actions runner blocks.

Architectural fix: each Catalyst Blueprint wrapper carries the values overlay + metadata only. The bootstrap installer reads the upstream chart reference from the wrapper's values.yaml `catalystBlueprint.upstream.{chart,version,repo}` metadata block, points `helm install` at the upstream chart's repo, and overlays our values.

This keeps:
- blueprint-release CI lightweight (no upstream pulls during package; helm package now works without containerd)
- the "bp-<name> wrapper does NOT drift from upstream" property (we ship the overlay, not a fork)
- the single Blueprint contract from BLUEPRINT-AUTHORING §1 (a wrapper is still a Catalyst-curated Helm chart published as bp-<name>:<semver>)

Changes:
- 11 platform/<name>/chart/Chart.yaml: removed dependencies block. Each is now a plain Helm chart with no remote pulls during package.
- 11 platform/<name>/chart/values.yaml: prepended catalystBlueprint.upstream.{chart,version,repo} metadata block at the top. Bootstrap installer parses it to know which upstream chart to install with these values.
- products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go: installCilium now does `helm repo add cilium https://helm.cilium.io --force-update` then `helm install cilium cilium/cilium --version 1.16.5 --values -` (the cilium/cilium upstream chart, with our overlay values piped from values.yaml). Same pattern needs propagating to the other 10 install functions in a follow-up.

After this commit, blueprint-release CI should green-build all 11 wrappers (helm package now works without containerd access since there's nothing to pull). The bootstrap installer's actual `helm install` calls in production reach upstream chart repos via the runtime k3s cluster's pod network, which has full network access.
2026-04-28 12:57:29 +02:00
hatiyildiz
8c0f76640c feat(charts): G2 wrapper Helm charts for 11 bootstrap-kit components + blueprint-release CI
Per docs/PROVISIONING-PLAN.md and tickets [F] chart. Adds Catalyst-curated wrapper Helm charts at platform/<name>/chart/ for every component the bootstrap-kit installer (introduced in commit 07b4bcf) needs. Each chart is the canonical bp-<name> source per BLUEPRINT-AUTHORING.md §1's source-location rule.

11 charts created with Chart.yaml + values.yaml + blueprint.yaml each:

Network + GitOps:
- platform/cilium/chart — wraps cilium 1.16.5; kubeProxyReplacement, WireGuard mTLS, Hubble, Gateway API
- platform/flux/chart — wraps flux 2.4.0
- platform/crossplane/chart — wraps crossplane 1.18.0 + provider-hcloud manifest

Security:
- platform/cert-manager/chart — wraps cert-manager 1.16.2 with CRDs+ServiceMonitor
- platform/sealed-secrets/chart — wraps sealed-secrets 2.16.1 (transient bootstrap-only)
- platform/spire/chart — wraps spiffe/spire 1.10.4 (5-min SVID rotation)

Catalyst control-plane services:
- platform/nats-jetstream/chart — wraps nats 2.10.22 (3-node cluster, JetStream + KV)
- platform/openbao/chart — wraps openbao 2.1.0 (3-node Raft, region-local per SECURITY §5)
- platform/keycloak/chart — wraps keycloak 25.0.6 (Bitnami flavor, edge proxy mode)
- platform/gitea/chart — wraps gitea 10.5.0 (CNPG Postgres backend, no chart-bundled valkey/redis since Catalyst control plane uses JetStream)

New platform/ folders (added per AUDIT-PROCEDURE component-count anchor — was 53, now 55):
- platform/spire/README.md — workload identity Catalyst control plane component
- platform/nats-jetstream/README.md — control-plane event spine
- platform/sealed-secrets/README.md — transient bootstrap-only

Each blueprint.yaml declares:
- catalyst.openova.io/v1alpha1 Blueprint kind (canonical CRD per BLUEPRINT-AUTHORING §3)
- visibility: unlisted (mandatory infra, auto-installed by bootstrap kit, not a marketplace card)
- manifests.chart: ./chart pointer
- depends: [] (foundational components have no Blueprint dependencies; control-plane services depend on each other implicitly via bootstrap order, not via Blueprint depends)

.github/workflows/blueprint-release.yaml:
- New CI workflow per BLUEPRINT-AUTHORING §11 (path-matrix per Blueprint folder)
- Triggers on push to main touching platform/*/chart/** or products/*/chart/**
- detect job: emits matrix of changed Blueprint folders via git diff
- build job (per chart): helm dependency build → helm package → helm push to GHCR → cosign keyless sign (GitHub OIDC) → Syft SBOM attestation
- Output: ghcr.io/openova-io/bp-<name>:<semver> with SLSA-3-style supply-chain provenance

Closes [F] tickets: 11 G2 charts (cilium, cert-manager, flux, crossplane, sealed-secrets, spire, nats-jetstream, openbao, keycloak, gitea, plus the umbrella products/catalyst/chart already exists from Pass 105). blueprint.yaml CRDs added across 11 entries. CI fan-out workflow live.

After this commit lands, the bootstrap-kit installer in commit 07b4bcf has real OCI artifacts to install. The first push to main will trigger 10 build matrix jobs (cilium was created in a separate commit earlier in this session) which produce 10 cosigned bp-<name>:<semver> artifacts on GHCR.

Component-count anchor update follows: 53 → 55 (added spire + nats-jetstream + sealed-secrets — but sealed-secrets was already conceptually counted under "supporting services"). Per AUDIT-PROCEDURE the count needs updating in CLAUDE.md, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST L11. Tracked as separate ticket [K] docs.
2026-04-28 12:51:06 +02:00