openova/clusters/otech.omani.works/bootstrap-kit/01-cilium.yaml
e3mrah ec3821f7e1
fix(bp-*): event-driven HR install -- drop blanket timeout, use disableWait (#250)
Helm install completes when manifests apply, not when pods reach Ready.
Flux dependsOn checks Ready=True on each HR independently, so
spec.install.disableWait + spec.upgrade.disableWait is the correct
shape for slow-Ready workloads. Blanket spec.timeout: Nm watchdogs from
PR #221 were a band-aid that caused cascading HR failures and blocked
downstream HRs (bp-nats-jetstream, bp-openbao depended on bp-spire).

Founder direction (verbatim): "always event driven robust jobs"

Per-HR audit (drop spec.timeout: 15m, add disableWait, with reason):

- bp-cilium:        envoyconfig CRD self-wait — agent crash-loops until
                    its own CRDs land
- bp-cert-manager:  webhook readiness depends on cainjector mutating
                    Secret — multi-minute on cold start
- bp-flux:          adopts cloud-init Flux objects; the helm-controller
                    reconciling THIS HR is itself a chart target — Ready
                    deadlock without disableWait
- bp-sealed-secrets: single-replica controller + CRD — install completes
                    on manifest apply
- bp-spire:         spire-controller-manager waits for CRD informer cache
                    sync — multi-minute legitimate path; chart fix below
- bp-nats-jetstream: JetStream raft quorum formation across N replicas
- bp-openbao:       3-node Raft sealed-by-default; Ready=True only after
                    operator runs `bao operator init` unseal flow
- bp-keycloak:      DB schema migration + 100+ Liquibase changesets on
                    first install
- bp-gitea:         PostgreSQL DB init + admin user + Blueprint catalog
                    mirror seeding
- bp-external-dns:  pod readiness depends on PowerDNS API + pdns-pg CNPG
                    cascade
- bp-catalyst-platform: ~10 services, inter-service NATS/OTel readiness
                    is not Helm's concern

Intentionally NOT touched (other parallel agents own these):
- bp-crossplane (Agent A): chart split for intra-chart CRD-ordering
- bp-powerdns   (Agent D): post-install hook for intra-chart Job-ordering

bp-spire chart fix (1.1.3 -> 1.1.4):

Root cause investigation on otech.omani.works (live):
  spire-controller-manager has restarted 37 times with:
    "failed to wait for clusterstaticentry caches to sync: timed out
     waiting for cache to be synced for Kind *v1alpha1.ClusterStaticEntry"

`kubectl get crd | grep spire` returns nothing — the spire.spiffe.io
v1alpha1 CRDs (ClusterSPIFFEID / ClusterStaticEntry /
ClusterFederatedTrustDomain) are NOT registered. The upstream `spire`
chart does not install its own CRDs; the spiffe maintainers ship them
via the SEPARATE `spire-crds` chart, expected to be installed first.

Fix: platform/spire/chart/Chart.yaml now declares spire-crds 0.5.0 as
the FIRST dependency. Helm installs subcharts in dependency order, so
listing spire-crds first guarantees CRDs are applied before the spire
subchart's controller-manager Deployment starts. blueprint.yaml +
both 06-spire.yaml cluster references bumped to 1.1.4.

Live error this fixes (otech.omani.works, persistent ~5h):
  Helm upgrade failed for release spire-system/spire with chart
  bp-spire@1.1.3: context deadline exceeded
  + downstream cascade: bp-nats-jetstream / bp-openbao stuck at
    "dependency 'flux-system/bp-spire' is not ready"

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:55:19 +04:00

76 lines
2.3 KiB
YAML

# bp-cilium — Catalyst bootstrap-kit Blueprint. CNI must come first; k3s started with --flannel-backend=none precisely so Cilium can take over.
#
# Wrapper chart: platform/cilium/chart/
# Catalyst-curated values: platform/cilium/chart/values.yaml
# Reconciled by: Flux on the new Sovereign's k3s control plane.
---
# kube-system is built into every Kubernetes cluster — never re-declare it.
# Earlier revisions of 01-cilium.yaml AND 05-sealed-secrets.yaml both
# declared it, which collided when kustomize tried to merge the two:
# "may not add resource with an already registered id:
# Namespace.v1.[noGrp]/kube-system.[noNs]"
# This Blueprint installs Cilium INTO kube-system; the HelmRelease's
# targetNamespace field below is sufficient.
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: bp-cilium
namespace: flux-system
spec:
type: oci
interval: 15m
url: oci://ghcr.io/openova-io
secretRef:
name: ghcr-pull
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: bp-cilium
namespace: flux-system
spec:
interval: 15m
releaseName: cilium
targetNamespace: kube-system
chart:
spec:
chart: bp-cilium
version: 1.1.1
sourceRef:
kind: HelmRepository
name: bp-cilium
namespace: flux-system
# Event-driven install: Helm completes when manifests apply, not when
# cilium-agent reaches Ready (agent waits for envoyconfig CRDs that the
# SAME chart installs — legitimate slow-Ready). Replaces blanket
# spec.timeout: 15m band-aid from PR #221.
install:
disableWait: true
remediation:
retries: 3
upgrade:
disableWait: true
remediation:
retries: 3
values:
cilium:
# Enable L7 proxy so Cilium's chart installs the
# ciliumenvoyconfigs / ciliumclusterwideenvoyconfigs CRDs that the
# cilium-agent waits for at startup. Without this, agent crash-loops
# forever and the node.cilium.io/agent-not-ready taint never lifts.
l7Proxy: true
prometheus:
enabled: false
serviceMonitor:
enabled: false
hubble:
metrics:
enabled: null
serviceMonitor:
enabled: false
relay:
enabled: false
ui:
enabled: false