openova/clusters/otech.omani.works/bootstrap-kit/07-nats-jetstream.yaml
e3mrah ec3821f7e1
fix(bp-*): event-driven HR install -- drop blanket timeout, use disableWait (#250)
Helm install completes when manifests apply, not when pods reach Ready.
Flux dependsOn checks Ready=True on each HR independently, so
spec.install.disableWait + spec.upgrade.disableWait is the correct
shape for slow-Ready workloads. Blanket spec.timeout: Nm watchdogs from
PR #221 were a band-aid that caused cascading HR failures and blocked
downstream HRs (bp-nats-jetstream, bp-openbao depended on bp-spire).

Founder direction (verbatim): "always event driven robust jobs"

Per-HR audit (drop spec.timeout: 15m, add disableWait, with reason):

- bp-cilium:        envoyconfig CRD self-wait — agent crash-loops until
                    its own CRDs land
- bp-cert-manager:  webhook readiness depends on cainjector mutating
                    Secret — multi-minute on cold start
- bp-flux:          adopts cloud-init Flux objects; the helm-controller
                    reconciling THIS HR is itself a chart target — Ready
                    deadlock without disableWait
- bp-sealed-secrets: single-replica controller + CRD — install completes
                    on manifest apply
- bp-spire:         spire-controller-manager waits for CRD informer cache
                    sync — multi-minute legitimate path; chart fix below
- bp-nats-jetstream: JetStream raft quorum formation across N replicas
- bp-openbao:       3-node Raft sealed-by-default; Ready=True only after
                    operator runs `bao operator init` unseal flow
- bp-keycloak:      DB schema migration + 100+ Liquibase changesets on
                    first install
- bp-gitea:         PostgreSQL DB init + admin user + Blueprint catalog
                    mirror seeding
- bp-external-dns:  pod readiness depends on PowerDNS API + pdns-pg CNPG
                    cascade
- bp-catalyst-platform: ~10 services, inter-service NATS/OTel readiness
                    is not Helm's concern

Intentionally NOT touched (other parallel agents own these):
- bp-crossplane (Agent A): chart split for intra-chart CRD-ordering
- bp-powerdns   (Agent D): post-install hook for intra-chart Job-ordering

bp-spire chart fix (1.1.3 -> 1.1.4):

Root cause investigation on otech.omani.works (live):
  spire-controller-manager has restarted 37 times with:
    "failed to wait for clusterstaticentry caches to sync: timed out
     waiting for cache to be synced for Kind *v1alpha1.ClusterStaticEntry"

`kubectl get crd | grep spire` returns nothing — the spire.spiffe.io
v1alpha1 CRDs (ClusterSPIFFEID / ClusterStaticEntry /
ClusterFederatedTrustDomain) are NOT registered. The upstream `spire`
chart does not install its own CRDs; the spiffe maintainers ship them
via the SEPARATE `spire-crds` chart, expected to be installed first.

Fix: platform/spire/chart/Chart.yaml now declares spire-crds 0.5.0 as
the FIRST dependency. Helm installs subcharts in dependency order, so
listing spire-crds first guarantees CRDs are applied before the spire
subchart's controller-manager Deployment starts. blueprint.yaml +
both 06-spire.yaml cluster references bumped to 1.1.4.

Live error this fixes (otech.omani.works, persistent ~5h):
  Helm upgrade failed for release spire-system/spire with chart
  bp-spire@1.1.3: context deadline exceeded
  + downstream cascade: bp-nats-jetstream / bp-openbao stuck at
    "dependency 'flux-system/bp-spire' is not ready"

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:55:19 +04:00

58 lines
1.5 KiB
YAML

# bp-nats-jetstream — Catalyst bootstrap-kit Blueprint. Catalyst's control-plane event spine. Per-Org Account isolation. KV bucket per Environment.
#
# Wrapper chart: platform/nats-jetstream/chart/
# Catalyst-curated values: platform/nats-jetstream/chart/values.yaml
# Reconciled by: Flux on the new Sovereign's k3s control plane.
---
apiVersion: v1
kind: Namespace
metadata:
name: nats-system
labels:
catalyst.openova.io/sovereign: otech.omani.works
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: bp-nats-jetstream
namespace: flux-system
spec:
type: oci
interval: 15m
url: oci://ghcr.io/openova-io
secretRef:
name: ghcr-pull
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: bp-nats-jetstream
namespace: flux-system
spec:
interval: 15m
releaseName: nats-jetstream
targetNamespace: nats-system
dependsOn:
- name: bp-spire
chart:
spec:
chart: bp-nats-jetstream
version: 1.1.1
sourceRef:
kind: HelmRepository
name: bp-nats-jetstream
namespace: flux-system
# Event-driven install: NATS StatefulSet with JetStream raft initialisation
# — quorum formation across N replicas is legitimately multi-minute on
# cold start. Helm install completes when manifests apply; downstream
# dependsOn checks Ready=True independently. Replaces PR #221 timeout.
install:
disableWait: true
remediation:
retries: 3
upgrade:
disableWait: true
remediation:
retries: 3