Commit Graph

25 Commits

Author SHA1 Message Date
e3mrah
aaaaadf8bc
feat(openova-flow): server (HTTP+SSE event router) + flux adapter (K8s informer sidecar) (#1390)
Agent #2 of 3 for OpenovaFlow. Ships the Go backend independently of
Agent #1's TS packages (@openova/flow-core + @openova/flow-canvas);
the FlowMessage JSON contract is locked between agents.

Two Go modules (separate go.mod each so the dep graphs stay decoupled):

- products/openova-flow/server/ — stateless HTTP+SSE event router.
  Map<flowId, RingBuffer<FlowMessage>>, in-memory, no DB. Endpoints:
  POST /v1/flows/{flowId}/events, GET /v1/flows/{flowId}/snapshot,
  GET /v1/flows/{flowId}/stream (SSE with 15s heartbeats + Last-Event-ID
  seq stamping), DELETE /v1/flows/{flowId}, GET /healthz, /readyz.
  Zero external Go deps (stdlib net/http). Ring cap default 4096
  (env-overridable). Locked schema validation rejects unknown envelope
  variants with 400.

- products/openova-flow/adapter-flux/ — DaemonSet sidecar that watches
  helm.toolkit.fluxcd.io/v2.HelmRelease + HelmChart CRs via
  client-go's dynamicinformer.NewFilteredDynamicSharedInformerFactory
  (canonical seam: products/catalyst/bootstrap/api/internal/k8scache/factory.go),
  maps each event to FlowMessage via a pure-transform mapper, POSTs to
  the configured openova-flow-server with exponential-backoff retry.
  Status mapping: Ready=True → succeeded, InstallFailed/UpgradeFailed/
  RetriesExhausted → failed, Progressing/Unknown/other-False → running,
  no Ready yet → pending. FlowNode.id format "{REGION_KEY}/{hrName}"
  so multi-region renders correctly. Region-aware: synthetic region
  parent FlowNode emitted on bootstrap; dependsOn entries fan-out to
  finish-to-start relationships.

Two wrapper charts under platform/openova-flow-{server,emitter}/chart/
(canonical seam: platform/qa-app/chart/ for the simple
Deployment+Service+SA shape; platform/k8s-ws-proxy/chart/ for the
DaemonSet+ClusterRole+ClusterRoleBinding shape). MIRROR-EVERYTHING:
image refs go through harbor.openova.io/proxy-ghcr/openova-io/...
Image tag + required runtime config fail-fast at chart render via
_helpers.tpl so silent ImagePullBackOff / boot crash is impossible.

Two bootstrap-kit HRs added (slots 56 + 57):
- 56-bp-openova-flow-server (dependsOn: bp-cilium, bp-cert-manager) —
  installs on primary cluster only; Cilium Gateway HTTPRoute at
  openova-flow.<sovereignFQDN> for cross-cluster ingest.
- 57-bp-openova-flow-emitter (dependsOn: bp-flux) — DaemonSet, runs
  on every cluster (mother + Sovereign + every secondary region).

scripts/expected-bootstrap-deps.yaml updated; check-bootstrap-deps.sh
audit passes (drift=0, cycles=0).

Tests (all green):
- server contract_test.go — every FlowMessage variant round-trips JSON,
  unknown/malformed variants reject. Cross-flow Triggerer/ToFlowID
  preserved.
- server server_test.go — full HTTP surface, including SSE replay+tail
  with a real httptest.Server.
- adapter mapper_test.go — every HelmRelease.status.conditions[Ready]
  transition + multi-dependsOn fan-out + family-label/heuristic + region
  fallback.

Verification done locally:
- (cd products/openova-flow/server && go build ./... && go test ./...) — PASS
- (cd products/openova-flow/adapter-flux && go build ./... && go test ./...) — PASS
- helm template platform/openova-flow-server/chart/ — renders cleanly
- helm template platform/openova-flow-emitter/chart/ — renders cleanly
- bash scripts/check-bootstrap-deps.sh — PASS (drift=0)

Agent #3 follow-ups (called out in slot 57's HelmRelease comments):
- Thread SOVEREIGN_DEPLOYMENT_ID + REGION_KEY into the
  postBuild.substitute env in infra/hetzner/cloudinit-control-plane.tftpl
  so the emitter's flowId/regionKey become per-deployment + per-region
  automatically. Today the slot uses SOVEREIGN_FQDN as the flowId
  fallback and "primary" as the regionKey default; per-Sovereign overlays
  can override pre-Agent-#3.
- catalyst-api proxy at /sovereign/api/v1/flows/{id}/stream so the
  Sovereign Console canvas hits a single in-tree origin.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 15:36:54 +04:00
e3mrah
b5181ec5d6
fix(catalyst-platform): gitea-token-mint hook 60->180 iters for autoscaler cold-start (Fix #184) (#1388)
* fix(catalyst-platform): gitea-token-mint hook 60->180 iters for autoscaler cold-start (Fix #184)

Raise the catalyst-gitea-token-mint pre-install hook's Gitea-API wait
loop from a hardcoded 60x5s (300s = 5m) budget to a values-driven knob
(giteaWait.iterations x giteaWait.intervalSeconds, default 168x5 =
840s = 14m). Pairs with HR install.timeout=15m to leave 60s slack for
the rest of the umbrella install action.

Root-cause trace (4-layer) on prov #33 (multi-region fsn1+hel1, cpx42
workerCount=0+autoscaler):

  bp-catalyst-platform HR (15m HR-timeout)
    -> Helm pre-install hook Job: catalyst-gitea-token-mint
         -> pod runs alpine/k8s curl loop:
              while ! curl gitea-http.gitea.svc.cluster.local; do
                sleep 5; i=$((i+1))
              done
         -> Hook gave up at iter 60 (= 5 min wall-time)
         -> Meanwhile gitea Pod is Pending: autoscaler-hcloud still
            scaling up workers in fsn1/hel1 (Fix #157 sizing default
            workerCount=0 means cold start).

Budget arithmetic (post-Fix #184 default):
  hook_wait_time = iterations x intervalSeconds = 168 x 5 = 840s (14 min)
  HR install.timeout =                                       900s (15 min)
  slack within HR budget =                                    60s ( 1 min)

The hook MUST complete strictly before HR remediates; the 60s slack
absorbs regular release resources rolling + post-install hooks after
the pre-install Job.

Canonical-seam citations:
- The hook lives at products/catalyst/chart/templates/
  catalyst-gitea-token-secret.yaml (line ~303 pre-Fix), the
  catalyst-gitea-token-mint Job's `args` block.
- Prior pattern: bp-keycloak chart 1.4.5 (Fix #146) introduced
  keycloakConfigCli.availabilityCheck.timeout as a values knob -
  same shape (chart-internal hook timing knob, distinct from the
  outer HR timeout). See platform/keycloak/chart/values.yaml:413.
- The HR's install.timeout=15m lives at clusters/_template/
  bootstrap-kit/13-bp-catalyst-platform.yaml:484 - the chart-internal
  wait budget MUST stay strictly less than this.

Recurring class: same family as Fix #127 (bp-cutover HR 15m),
Fix #131 (bp-gitea HR 15m), Fix #150 (bp-harbor HR 15m), Fix #154
(HR-timeout audit). Those bumped the HelmRelease install.timeout.
This bumps the chart-INTERNAL wait loop budget inside the pre-
install hook Job, which is a different (lower) seam.

Per INVIOLABLE-PRINCIPLES #4 (never hardcode) the budget is fully
runtime-configurable via .Values.giteaWait. Operators may shorten on
known-warm-cluster overlays or extend on air-gapped Sovereigns.

Changes:
- products/catalyst/chart/templates/catalyst-gitea-token-secret.yaml:
  replace hardcoded `seq 1 60` + `sleep 5` with templated
  ITERATIONS/INTERVAL vars driven by .Values.giteaWait.{iterations,
  intervalSeconds}.
- products/catalyst/chart/values.yaml: add giteaWait block with
  defaults (iterations: 168, intervalSeconds: 5 = 14m budget).
- products/catalyst/chart/Chart.yaml: bump 1.4.139 -> 1.4.140 with
  changelog entry capturing the 4-layer trace + budget arithmetic.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bump
  HelmRelease pin 1.4.138 -> 1.4.140 (skip 1.4.139 which is a no-op
  packaging bump on main).

Verification:
- helm template renders cleanly (2799 lines, exit 0).
- Force-render with lookup gate bypassed shows ITERATIONS=168 +
  INTERVAL=5 substituted into the rendered Job args.
- --set giteaWait.iterations=240 --set giteaWait.intervalSeconds=10
  override confirmed to emit ITERATIONS=240 + INTERVAL=10.

Test plan (post-merge, on prov #34):
- kubectl logs -n catalyst-system catalyst-gitea-token-mint-* should
  emit `waiting for gitea api ($i/168)` instead of `($i/60)`.
- bp-catalyst-platform HR reaches Ready=True within the 15m HR
  budget (previously installFailures: 2 on prov #33).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bootstrap-deps): reconcile pre-existing dep-graph audit drift

Two pre-existing drift items surfaced when dep-graph-audit ran on the
Fix #184 PR — both are in `main` already, not introduced here, but the
gate blocks any PR until the expected DAG matches the actual HRs.

1. `bp-catalyst-platform` (slot 13) — actual HR file declares
   `bp-crossplane-claims` as an additional dependsOn edge (added in
   chart-roll-rca iter-15, 2026-05-10, for the XRD-ordering race that
   caused the omantel.biz 90-min wedge). Update expected-deps to
   include it.

2. `bp-hcloud-ccm` (slot 55) — present on disk but absent from
   expected-deps. Cloud-provider seam, no upstream dependencies.
   Added with empty depends_on.

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
2026-05-11 14:44:54 +04:00
e3mrah
a388a61ae2
fix(bootstrap-kit/_template): wire NetBird/DMZ/Hubble/BGP via envsubst — qa-loop iter-12 Fix #53C+D follow-up (#1280)
* fix(bootstrap-kit/_template): wire NetBird/DMZ/Hubble/BGP/clustermesh-LB via envsubst — qa-loop iter-12 Fix #53C+D follow-up

The omantel chroot reconciles from clusters/_template/bootstrap-kit/ (not the per-Sovereign omantel.omani.works/ overlay). PR #1275 added slot 53 (NetBird) and slot 54 (DMZ vCluster) plus Hubble UI / BGP / clustermesh-LB to the omantel.omani.works overlay only. This PR mirrors the same changes into _template via envsubst so the chroot also picks them up.

01-cilium.yaml:
- Chart pin 1.2.0 → 1.3.0 (Hubble UI HTTPRoute overlay + clustermesh shape)
- hubble.relay/ui.enabled gated on ${HUBBLE_ENABLED:=false} (default off, backward-compat)
- bgpControlPlane.enabled gated on ${BGP_ENABLED:=false}
- clustermesh.apiserver.service.type gated on ${CLUSTERMESH_SERVICE_TYPE:=NodePort} (default NodePort, backward-compat)
- catalystOverlay.hubbleUI block (envsubst gated, off by default)

53-bp-netbird.yaml NEW: NetBird Sovereign install, default-OFF via NETBIRD_ENABLED. OIDC issuer / realm parameterized through SOVEREIGN_REALM_NAME so the per-Sovereign realm rename (Fix #53A) flows through.

54-bp-dmz-vcluster.yaml NEW: DMZ vCluster install, default-OFF via DMZ_VCLUSTER_ENABLED. Vcluster name parameterized via DMZ_VCLUSTER_NAME (default `dmz`).

kustomization.yaml: added slots 53/54.

Operator opts in per-Sovereign by setting the substitutes on the bootstrap-kit Kustomization. Live patches applied to omantel for immediate effect:
- HUBBLE_ENABLED=true HUBBLE_HOSTNAME=hubble.console.omantel.biz
- BGP_ENABLED=true
- NETBIRD_ENABLED=true
- DMZ_VCLUSTER_ENABLED=true DMZ_VCLUSTER_NAME=omantel-dmz

* fix(bootstrap-deps): add bp-netbird (slot 53) + bp-dmz-vcluster (slot 54) to expected DAG — qa-loop iter-12 Fix #53C dependency-graph-audit fix
2026-05-10 11:05:20 +04:00
e3mrah
5ca0a7d178
fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots (#1236)
* fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots

Closes the scope-narrow confessed by Fix #36: bp-guacamole +
bp-k8s-ws-proxy chart skeletons existed at platform/* but lacked CI
image-build workflows + bootstrap-kit slots, so TC-228 / TC-230 /
TC-236 / TC-237 / TC-245 / TC-246 stayed FAIL with "deployment
NotFound".

CI workflows
------------
- .github/workflows/build-k8s-ws-proxy.yaml: Buildx + cosign keyless
  sign + SBOM attestation flow on core/cmd/k8s-ws-proxy/**, then bumps
  platform/k8s-ws-proxy/chart/values.yaml image.tag + Chart.yaml
  patch version + dispatches blueprint-release.
- .github/workflows/build-bp-guacamole.yaml: mirrors upstream Apache
  Guacamole 1.5.5 to GHCR (so every Sovereign pulls from a registry
  we own — no Docker Hub rate limits, no upstream availability risk),
  bumps values.yaml.image.{repository,tag} + Chart.yaml + dispatches
  blueprint-release.

Charts (target-state)
---------------------
- bp-k8s-ws-proxy v0.1.1: canonical workload name `k8s-ws-proxy`
  regardless of release name (DaemonSet + Service + ClusterRole +
  ClusterRoleBinding + ServiceAccount all named `k8s-ws-proxy` so
  matrix can address them by canonical short name).
- bp-guacamole v0.1.1: canonical short resource names (`guacd`,
  `guacamole-server`, `guacamole-recordings`); GHCR-mirrored upstream
  images; realm-patch ConfigMap correctly lands in `keycloak`
  namespace (was: realm-name, which would have failed silently on
  every Sovereign); `realmConfig.namespace` override surface added.
- Both charts: `catalyst.openova.io/smoke-render-mode: default-off`
  annotation so blueprint-release smoke-render gate honors the
  default-OFF render shape.

Bootstrap-kit slots
-------------------
- clusters/_template/bootstrap-kit/36-bp-k8s-ws-proxy.yaml +
  37-bp-guacamole.yaml: dependsOn-ordered (proxy → gateway), pinned
  to 0.1.1, default-OFF gate flipped via slot values, install/upgrade
  disableWait per session-2026-04-30 architectural decision.
- clusters/omantel.omani.works/bootstrap-kit/* slots mirror the same
  shape with omantel.biz hostnames matching the live HTTPRoutes on
  console.omantel.biz / auth.omantel.biz.

API: shells/issue handler (matrix-canonical URL surface)
--------------------------------------------------------
- POST /api/v1/sovereigns/{id}/shells/issue?namespace=&pod=&container=
  alias for the existing
  POST /api/v1/sovereigns/{id}/k8s/exec/{ns}/{pod}/{container}/session
  with matrix-canonical response fields (`sessionId`, `guacamoleUrl`,
  `recordingPath`). Same business logic, same audit surface
  (`guacamole-session-opened`), same RBAC gate (tier-developer or
  higher). 6 test cases, all PASS under -race.

TCs that flip PASS in iter-8
-----------------------------
- TC-228: POST /shells/issue → sessionId + guacamoleUrl + recordingPath
- TC-230: kubectl get deploy guacd guacamole-server -n catalyst-system
- TC-236: kubectl get ds k8s-ws-proxy -n catalyst-system
- TC-237: kubectl logs ds/k8s-ws-proxy → "listening"
- TC-245: viewer-cookie POST /shells/issue → 403
- TC-246: operator-cookie POST /shells/issue → 200 sessionId

Per feedback_no_mvp_no_workarounds.md: NO follow-up slices — every
gap Fix #36 confessed is closed in this PR. Per
feedback_machine_saturation_3rd_violation.md: CI-only build path,
no local docker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bootstrap-kit): move bp-k8s-ws-proxy + bp-guacamole to slots 51/52 (Fix #39 follow-up)

CI dependency-graph-audit caught a slot-number collision: slots 36-48
are reserved for the W2.K4 AI-runtime cohort (bp-stunner, bp-knative,
bp-kserve, bp-vllm, bp-llm-gateway, bp-anthropic-adapter, bp-bge,
bp-nemo-guardrails, bp-temporal, bp-openmeter, bp-livekit, bp-matrix,
bp-librechat) per scripts/expected-bootstrap-deps.yaml. Move the
exec-fan-out blueprints to slots 51/52 (post-W2.K4, pre-Phase-2 80+
slot range) and add their entries to the expected DAG.

- clusters/_template/bootstrap-kit/{36,37}-* → {51,52}-*
- clusters/omantel.omani.works/bootstrap-kit/{36,37}-* → {51,52}-*
- kustomization.yaml updates (both _template + omantel)
- scripts/expected-bootstrap-deps.yaml: declare slots 51/52 with full
  dependsOn lists (bp-k8s-ws-proxy on cilium+sealed-secrets,
  bp-guacamole on cilium+cert-manager+keycloak+sealed-secrets+
  seaweedfs+k8s-ws-proxy)

scripts/check-bootstrap-deps.sh re-run: 0 drift, 0 cycles, 55
declared HRs, 42 present on disk, 13 deferred (W2.K1-K4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:48:25 +04:00
e3mrah
94ffe01ff0
chore(bootstrap-kit): remove slot 95 bp-stalwart-sovereign (Phase-2 deferred) (#958)
The bp-stalwart-sovereign chart's post-install Job times out on fresh
Sovereigns (observed on otech113) and blocks the entire bootstrap-kit
Kustomization. Phase-2 Sovereign-local mail (umbrella #924) is OUT OF
SCOPE for the current Phase-1 cutover.

Phase-1 Console PIN/magic-link delivery already works through the
mothership SMTP relay path:
  - products/catalyst/chart/values.yaml#sovereign.smtp.* defaults to
    mail.openova.io:587 / noreply@openova.io
  - products/catalyst/bootstrap/api/internal/handler/sovereign_smtp_seed.go
    seeds those bytes into catalyst-system/sovereign-smtp-credentials at
    bootstrap, so bp-catalyst-platform's `lookup` resolves on first
    reconcile without waiting for a Sovereign-local Stalwart.

This commit:
  - Deletes clusters/_template/bootstrap-kit/95-bp-stalwart-sovereign.yaml
  - Updates the kustomization.yaml resource list with a comment block
    documenting the deferral and the canonical re-entry conditions.
  - Updates scripts/expected-bootstrap-deps.yaml so check-bootstrap-deps.sh
    no longer expects the slot. Audit re-runs clean (0 drift, 0 cycles).

The chart itself stays at platform/stalwart-sovereign/ for future
Phase-2 work; only the bootstrap slot is removed.

Refs: #883 #924

Co-authored-by: Hatice Yildiz <hatiyildiz@openova.io>
2026-05-05 15:55:30 +04:00
e3mrah
9077016466
feat(bp-stalwart-sovereign): per-Sovereign Stalwart for Console mail (#924) (#931)
Phase-2 follow-up to #883: replace mothership Stalwart relay
(mail.openova.io:587) with a Sovereign-local Stalwart so Console
PIN/magic-link mail originates from `noreply@<sovereignFQDN>` with
per-Sovereign SPF/DKIM/DMARC posture, eliminating the mothership
SMTP SPOF for Sovereign Console login.

What ships:

  1. NEW blueprint platform/stalwart-sovereign/ (otech-level — distinct
     from per-tenant bp-stalwart-tenant). Single Stalwart instance per
     Sovereign cluster, scoped to Sovereign Console system mail. NO
     Keycloak OIDC, NO webmail UI — Sovereign Console is the only
     consumer. Auto-provisioned admin + submission Secrets via the
     lookup-or-generate pattern (#898/#830/#887). Post-install Job:
       - registers the noreply submission principal in Stalwart
       - allows send-as for noreply@<sovereignFQDN>
       - reads DKIM public key, patches dns-records ConfigMap
       - materialises catalyst-system/sovereign-smtp-credentials with
         Sovereign-local infrastructure addresses + credentials,
         carrying BOTH key shapes (smtp-user/smtp-pass + legacy
         user/password) so the consumer chart works either way.

  2. NEW bootstrap-kit slot 95 (clusters/_template/bootstrap-kit/
     95-bp-stalwart-sovereign.yaml). dependsOn: bp-cert-manager,
     bp-catalyst-platform. Sequenced after bp-catalyst-platform (slot
     13) so the chart's post-install Job lands its mirror Secret in
     an already-existing catalyst-system namespace.

  3. bp-catalyst-platform 1.4.19 → 1.4.20: SOURCE-wins precedence
     extended to (a) non-secret fields smtp-host/smtp-port/smtp-from
     so Sovereign-local infra addresses (`mail.<sovereignFQDN>`) take
     over from mothership defaults (`mail.openova.io`) on the next
     reconcile after slot 95 lands, and (b) canonical key shape
     `smtp-user`/`smtp-pass` in addition to legacy `user`/`password`
     source key shape.

  4. expected-bootstrap-deps.yaml: declare slot 95 graph edge.

  5. catalyst-api handler/sovereign_smtp_seed.go: documentation-only
     update to note this Phase-1 step is now a graceful fallback —
     the Phase-2 chart's post-install Job overwrites the mirror
     Secret on first reconcile so the cutover from mothership relay
     to Sovereign-local relay is automatic, no operator action.

Verification:
  - `helm template smoke ./platform/stalwart-sovereign/chart` clean
    (smoke-render-safe; per-template gates skip when sovereignFQDN unset).
  - `helm template smoke -f operator-values.yaml` emits StatefulSet,
    LoadBalancer Service, ClusterIP HTTP Service, DKIM-signing config,
    dns-records ConfigMap, Setup Job + RBAC.
  - `chart/tests/sovereign-render.sh` 3 cases all PASS.
  - `helm template smoke ./products/catalyst/chart` (1.4.20) clean.
  - `helm lint` both charts: clean (only icon-recommended INFO).
  - `bash scripts/check-bootstrap-deps.sh` PASSED — bootstrap-kit
    dependency graph audit, 0 drift, 0 cycles.
  - `go test -run TestSeedSovereignSMTP` — Phase-1 seed tests pass.
  - `go test -run TestBootstrapKit_TemplateClusterParses` — slot 95
    YAML parses cleanly.

Out of scope (sub-PR follow-up under #924):
  - DKIM keypair generation in catalyst-api orchestrator + DNS records
    (MX/A/SPF/DMARC/DKIM-pubkey) registration via PDM dynadot adapter
    at omani.works.
  - Hetzner PTR (rDNS) auto-registration via the Hetzner cloud API.
  - Cert-manager Certificate adding mail.<sovereignFQDN> SAN to the
    Sovereign wildcard cert (chart relies on the existing wildcard
    cert from bp-catalyst-platform 1.4.0+'s per-zone Certificate
    template — when that wildcard chain covers the Sovereign FQDN,
    `mail.<sovereignFQDN>` is already covered).

Acceptance (lands when sub-PR follow-up ships):
  - Sovereign Console PIN delivery uses noreply@<sov-fqdn>.
  - External mail server (e.g. Gmail) accepts mail with valid SPF + DKIM.
  - Mothership SMTP no longer SPOF for Sovereign Console login.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:20:16 +04:00
e3mrah
20b3c5258a
feat(bp-newapi): chart maturation + first-otech deploy + Qwen vLLM channel (#799) (#812)
* feat(bp-newapi): chart maturation — ExternalSecret + first-otech vLLM channel + skip-render gates (#799)

Maturation work for the SME-3 turnkey-experience epic (#795). Aligns
the bp-newapi scratch chart with ADR-0003 (RBAC ↔ NewAPI user-create
hook contract) and gets it past the blueprint-release CI smoke render
that has blocked publication since PR #396 (run 25213444992 failed at
default-values render of v1.0.0).

Changes
-------
- templates/external-secret.yaml (NEW). Renders the
  `catalyst-newapi-admin-token` ExternalSecret consumed by unified-rbac
  (ADR-0003 §3.2 + §6) for issuing per-user keys against
  `http://newapi.newapi.svc/api/v1/admin/users`. Sourced from OpenBao
  via the `vault-region1` ClusterSecretStore (canonical default shipped
  by bp-external-secrets-stores). Capabilities-gated on
  `external-secrets.io/v1beta1` so cold installs without ESO don't
  fail-render. Operator supplies the per-Sovereign OpenBao path via
  `catalystIntegration.externalSecret.remoteRef.key`; canonical
  convention is `sovereign/<sovereign-fqdn>/newapi/admin-token` with
  property `ADMIN_API_TOKEN`. Per Inviolable Principle #4 every knob
  is operator-overridable in the cluster overlay.

- values.yaml. Adds `catalystIntegration.externalSecret.{enabled,
  refreshInterval, secretStoreRef.{kind,name}, remoteRef.{key,property}}`
  block (default enabled=true, key="" so a misconfigured overlay fails
  loudly at render rather than silently skipping). Adds
  `defaultChannels.vllm` block — first-otech shorthand that composes a
  vLLM-typed channel into the rendered channels list when enabled.
  Default endpoint is empty per Inviolable Principle #4; the
  `clusters/<sovereign>/bootstrap-kit/80-newapi.yaml` overlay supplies
  the per-Sovereign URL (canonical first-otech reference =
  `https://llm-api.omtd.bankdhofar.com` model `qwen3-coder`, the same
  upstream Axon uses on the OpenOva marketing deployment).

- templates/_helpers.tpl. New `bp-newapi.effectiveChannels` helper
  composes `.Values.channels` with `defaultChannels.vllm` (when
  enabled). The `assertChannelAttestation` helper now operates on the
  effective list so attestation gates apply to defaultChannels
  composition too. `defaultChannels.vllm.enabled=true` with empty
  endpoint fails-fast at render with a guided error message.

- templates/configmap.yaml. Channels rendering switches to the
  effectiveChannels helper. OIDC block now skip-renders gracefully when
  `auth.adminUI.keycloak.issuer` is unset (smoke-render path) instead
  of `required`-failing; the per-Sovereign overlay sets the issuer.

- templates/deployment.yaml. Skip-render gate on Deployment when
  `database.existingSecret`, `credentials.existingSecret`, or (when
  Keycloak mode is selected) the OIDC client secret is missing. Removes
  the four `required` calls that were failing CI smoke render. Service,
  ServiceAccount, ConfigMap, NetworkPolicy still render so the smoke
  test gets a non-empty output proving structural soundness; the actual
  Deployment defers until the per-Sovereign overlay wires the secrets.

- templates/ingress.yaml. Same skip-render pattern: when either
  `ingress.host` or `ingress.adminHost` is empty, the entire ingress
  block is silently skipped. Matches the bp-keycloak / bp-openbao /
  bp-external-dns HTTPRoute templates.

- Chart.yaml. version 1.0.0 → 1.1.0 (minor bump — additive features;
  no breaking changes to existing operator overrides).

Verification
------------
`helm template` smoke render on default values now succeeds with 4
resources (NetworkPolicy / ServiceAccount / ConfigMap / Service); 168
lines, well above the CI 5-line minimum. With a full per-Sovereign
overlay (hosts + secrets + Keycloak issuer + ESO Capabilities + Traefik
Capabilities + defaultChannels.vllm.endpoint), 8 resources render
including Deployment, both Ingresses, the Traefik allowlist Middleware,
and the ExternalSecret. The composed qwen channel writes through to
`channels.yaml` with the expected endpoint + models + attestation.

Refs
----
ADR-0003 §3.2 + §6 — admin-token contract
Issue #795 (epic) — locked decisions
Issue #796 — hook contract spec (sequential blocker, merged)
Inviolable Principles #1, #3, #4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(bootstrap-kit): slot 80 — bp-newapi default install (#799)

Adds the canonical install slot for bp-newapi to every fresh Sovereign's
bootstrap-kit. Sequenced after the W2.K1 dependency wave so NewAPI's
ExternalSecret + Postgres DSN dependencies resolve on first reconcile.

The HelmRelease declares `dependsOn: [bp-openbao, bp-keycloak, bp-cnpg]`:
- bp-openbao(08): admin-token ExternalSecret backend
- bp-keycloak(09): OIDC issuer for ops-staff admin UI at admin.<fqdn>
- bp-cnpg(16): Postgres backing for users/credits/channels/audit

Per-Sovereign overlays inherit the slot's defaults and override:
- ingress.host                                        api.${SOVEREIGN_FQDN}
- ingress.adminHost                                   admin.${SOVEREIGN_FQDN}
- auth.adminUI.keycloak.issuer
- database.existingSecret                             (Crossplane-claimed)
- credentials.existingSecret
- catalystIntegration.externalSecret.remoteRef.key    sovereign/${FQDN}/newapi/admin-token
- defaultChannels.vllm.enabled                        true (first-otech)
- defaultChannels.vllm.endpoint                       (operator-supplied)

The `_template/` slot keeps `defaultChannels.vllm.enabled: false` so a
fresh Sovereign does not silently wire customers to a third-party
endpoint; the canonical first-otech reference (Qwen3 Coder via
`https://llm-api.omtd.bankdhofar.com`, same relay Axon uses on the
OpenOva marketing deployment) is documented in-line for operators
adopting the same upstream.

Refs: #795 (epic), ADR-0003

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bootstrap-deps): register bp-newapi slot 80 in expected DAG (#799)

Fixes the dependency-graph-audit drift detection caught at PR #812 CI:
the audit script enumerates HelmReleases in clusters/_template/bootstrap-kit/
and compares to scripts/expected-bootstrap-deps.yaml; an HR present on
disk but absent from the expected DAG is treated as drift.

Adds the canonical entry for bp-newapi at slot 80 with the same
depends_on set declared on the HelmRelease itself
([bp-openbao, bp-keycloak, bp-cnpg]).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-newapi): align blueprint.yaml spec.version with Chart.yaml (#799)

The TestBootstrapKit_BlueprintCardsHaveRequiredFields static-validation
gate asserts Chart.yaml version == blueprint.yaml spec.version. The
chart was bumped to 1.1.0 in c63ecd8c; bumping the blueprint metadata
to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:17:25 +04:00
e3mrah
33dc98782b
feat(bp-self-sovereign-cutover): chart + bootstrap-kit slot 06a (#791) (#808)
New platform Blueprint at `platform/self-sovereign-cutover/chart/`. Ships
DORMANT — eight step PodSpec ConfigMaps, the registry-pivot DaemonSet, the
mutable cutover-status ConfigMap, plus ServiceAccount/RBAC. The catalyst-api
cutover endpoint (#792, merged at 03828641) reads each step ConfigMap by
label selector and stamps real Jobs only on operator-driven trigger.

Step inventory:
  01 gitea-mirror             — git push --mirror upstream → local Gitea
  02 harbor-projects          — create 7 proxy-cache projects
  03 harbor-prewarm           — HEAD-pull bootstrap-kit images through cache
  04 registry-pivot           — DaemonSet rewrites registries.yaml on every node
  05 flux-gitrepository-patch — pivot GitRepository.url → local Gitea
  06 helmrepository-patches   — pivot 38 OCI URLs → local Harbor
  07 catalyst-api-env-patch   — kubectl set env CATALYST_GITOPS_REPO_URL
  08 egress-block-test        — CiliumNetworkPolicy + 10-min sovereignty proof

Plus self-sovereign-cutover-status ConfigMap with the consumer-contract keys
(cutoverComplete, currentStep, step.<name>.result, etc.) shipped at install
with helm.sh/resource-policy: keep so chart uninstall doesn't lose state.

Bootstrap-kit slot `06a-bp-self-sovereign-cutover.yaml` installs the chart
into the `catalyst` namespace (matches catalyst-api's default discovery
namespace), depends on bp-gitea + bp-harbor, uses disableWait: true.

RBAC splits `create` verbs into their own Rule WITHOUT resourceNames per
feedback_rbac_create_no_resourcenames.md — the bp-openbao loop anchor.

chart/tests/cutover-contract.sh enforces:
  - 8 step ConfigMaps render
  - required labels (part-of/component/cutover-order/cutover-mode)
  - required data keys (stepName + podSpec for job-mode)
  - step 04 mode=daemonset-wait
  - status ConfigMap retained on uninstall
  - RBAC create/resourceNames split

helm template smoke render: 1180 lines, 19 resources (1 Namespace + 1 SA +
11 ConfigMaps + 1 DaemonSet + 1 ClusterRole + 1 ClusterRoleBinding).
helm lint: clean.
scripts/check-bootstrap-deps.sh: PASSED (slot 6a registered, depends_on
[bp-gitea, bp-harbor]).

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:55:19 +04:00
e3mrah
53bc4357ca
feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767) (#776)
* feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767)

Two-pronged fix for the FailedScheduling pattern that hit otech92 (2x cpx32 workers
couldn't fit external-secrets-webhook because the bootstrap-kit ate the full 16 GB):

1. PRE-LAUNCH ESTIMATE — wizard StepReview now surfaces a "Footprint estimate"
   Section with: bootstrap-kit baseline (sum of mandatory-tier component
   footprints), selected components delta, control-plane overhead, and a
   "Recommended N x <SKU>" line that turns amber when the operator's chosen
   worker count is below the rollup. Backed by per-component RAM/CPU floors
   in components/wizard/steps/componentFootprints.ts (covered by 12 unit
   tests including the otech92 reproduction).

2. RUNTIME AUTOSCALING — new bp-cluster-autoscaler-hcloud Blueprint added at
   bootstrap-kit slot 40. Wraps the upstream kubernetes/autoscaler chart
   9.46.6 (appVersion 1.32.0) with the Hetzner cloud-provider. Token wired
   from the canonical flux-system/cloud-credentials.hcloud-token Secret
   cloud-init writes (mirrors the velero/harbor object-storage pattern).
   Pinned to the control-plane node so the autoscaler never schedules onto
   a worker it could itself terminate. 10-minute scale-down idle as the
   cost-saving default.

Documented in docs/ARCHITECTURE.md sec.14 (Autoscaling) — explains how VPA / HPA /
KEDA / cluster-autoscaler compose, why we picked cluster-autoscaler over
KEDA for cluster scaling, and the bounds + safety story.

Per the issue's MVP scope, this PR ships the blueprint + StepReview
estimate WITHOUT the wizard StepProvider min/max pair refactor or the
tofu node-pool template restructuring. Those are tracked as a follow-up
issue (scope-control rule per docs/INVIOLABLE-PRINCIPLES.md #1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(provisioner): move cluster-autoscaler to slot 50 + register in expected-bootstrap-deps

Slot 40 was already forward-declared for bp-llm-gateway in scripts/expected-
bootstrap-deps.yaml — the dependency-graph-audit CI check fired on PR #776
because the file existed without a matching entry in the expected DAG, AND
collided with a reserved slot. Move to slot 50 (after the W2.K4 cohort +
slot 49 bp-cert-manager-powerdns-webhook) and add the matching entry to
the expected-bootstrap-deps.yaml so the audit passes.

`scripts/check-bootstrap-deps.sh` runs clean locally now (drift=0, cycles=0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 19:49:44 +04:00
e3mrah
2b60e944e2
fix(bp-cert-manager-powerdns-webhook): re-target to contabo PowerDNS, drop dynadot-webhook (#681)
* fix(bp-cert-manager-powerdns-webhook): re-target to contabo PowerDNS, drop dynadot-webhook

Caught live on otech43-46: cert-manager DNS-01 challenges for
*.otechN.omani.works failed because the Sovereign-side webhook wrote
challenge TXT records to the Sovereign's local PowerDNS. omani.works is
delegated from Dynadot to ns1/2/3.openova.io which run on contabo's
central PowerDNS — the Sovereign's local PowerDNS is INVISIBLE on the
public DNS chain until pool-domain-manager seals the per-Sovereign NS
delegation. Let's Encrypt resolvers walk the public chain, query
contabo, get NXDOMAIN, the cert never issues. Manual workaround was
seeding challenge TXT directly in contabo PowerDNS.

This PR automates the right write path:

- bp-cert-manager-powerdns-webhook chart bumped to 1.0.4. Default
  powerdns.host flips from "" (skip-render) to https://pdns.openova.io
  (contabo's public PowerDNS API ingress, authoritative for omani.works).
- ClusterIssuer letsencrypt-dns01-prod-powerdns now usable with no
  per-cluster powerdns.host override for the omani.works pool.
  apiKeySecretRef.namespace clarified — upstream ignores it; the Secret
  must live in cert-manager namespace (= ChallengeRequest.ResourceNamespace
  for ClusterIssuers).
- bootstrap-kit slot 49 updated: drops bp-powerdns dependsOn (webhook
  calls out-of-cluster contabo, not local PowerDNS), bumps chart version,
  removes inline powerdns.host override (defaults are correct).
- bootstrap-kit slot 49b (bp-cert-manager-dynadot-webhook) DELETED
  entirely — Dynadot is NOT the API-level authority for omani.works
  subdomains, the dynadot webhook silently fails the same way the
  Sovereign-local powerdns one did.
- clusters/_template/sovereign-tls/cilium-gateway-cert.yaml flips
  issuerRef from letsencrypt-dns01-prod (was dynadot-backed) to
  letsencrypt-dns01-prod-powerdns (the new contabo-backed issuer).
- bp-cert-manager chart: certManager.issuers.dns01.enabled defaults to
  false (deprecated dynadot path). letsencrypt-http01-prod retained for
  per-host certs. Cluster overlays MAY flip dns01.enabled=true for
  non-omani.works pools where Dynadot IS the API-level authority.
- scripts/expected-bootstrap-deps.yaml: drops slot 49b, drops bp-powerdns
  edge from slot 49.
- Documentation (README + blueprint.yaml + Chart.yaml description)
  rewritten to reflect contabo retarget and lifecycle reasoning.

Credential plumbing (out of scope here, must be done in cloud-init):
- Every Sovereign needs a `powerdns-api-credentials` Secret in the
  `cert-manager` namespace whose `api-key` value matches contabo's
  PowerDNS API key. Same seeding pattern as `dynadot-api-credentials`
  in infra/hetzner/cloudinit-control-plane.tftpl.

Caveat — basicAuth on contabo's PowerDNS API ingress: contabo currently
fronts pdns.openova.io with Traefik basicAuth (per
clusters/contabo-mkt/apps/powerdns/helmrelease.yaml). The upstream
zachomedia/cert-manager-webhook-pdns binary supports the X-API-Key
header but not HTTP Basic Auth out of the box. To make this end-to-end
green, contabo's basicAuth requirement must be relaxed (X-API-Key alone
provides the auth posture, and contabo's API endpoint is restricted to
operator IPs by other means OR the Sovereign's webhook needs an
Authorization header injected via the chart's powerdns.headers map
(plaintext password in the ClusterIssuer config — not ideal). This PR
ships the chart side; the basicAuth question is a follow-up on the
contabo side.

Verified locally:
- helm lint platform/cert-manager-powerdns-webhook/chart -> PASS
- helm template platform/cert-manager-powerdns-webhook/chart -> renders
- helm template ... --set clusterIssuer.enabled=true -> renders the
  ClusterIssuer with host="https://pdns.openova.io" + correct apiKey
  Secret reference.
- helm template platform/cert-manager/chart -> renders ONLY
  letsencrypt-http01-prod (the dns01 dynadot issuer correctly gated off).
- scripts/check-bootstrap-deps.sh: net-zero new drift; my branch reduces
  pre-existing errors from 3 to 2 (the dropped slot 49b removed the only
  drift my branch was responsible for).

Closes follow-up to #373. Preconditions for handover URL TLS green
on otech43-46 lineage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): repair YAML structure in expected-bootstrap-deps.yaml

Two pre-existing drifts were blocking dependency-graph-audit CI:

1. Slot 5a (bp-reflector) was missing its closing list separator,
   causing yq to merge the bp-nats-jetstream entry into the bp-reflector
   map and effectively drop bp-reflector from the expected DAG.
   Added explicit `- slot: 7` for bp-nats-jetstream and quoted "5a" so
   yq treats it as a string slot (matches the convention with "49b").

2. bp-powerdns slot 11: actual bootstrap-kit declares dependsOn
   bp-cnpg (live since otech28 — pdns-pg-app secret race) but the
   expected DAG was missing this edge.

This is unblocks merging fix/cert-manager-powerdns-webhook-contabo (PR
above) — these drifts existed on main but weren't surfaced until the
last expected-deps edit forced a re-run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:12:48 +04:00
e3mrah
74921e30f1
fix(architecture): drop bp-spire, Cilium WireGuard is the canonical east-west mesh (#665)
Founder direction 2026-05-03: with 100% Cilium mesh enforcement +
Envoy where required, bp-spire is redundant for the minimal Sovereign
MVP.

Reasoning:
- Cilium 1.13+ has built-in mutual auth using SPIFFE, but it ships
  with its own embedded SPIRE server managed by the Cilium operator.
  External bp-spire is not needed for east-west mTLS.
- Our ESO→OpenBao auth uses the K8s ServiceAccount auth method
  (TokenReview against kube-apiserver), not JWT-SVID.
- WireGuard transparent encryption (already enabled in cilium values)
  encrypts every pod-to-pod connection at the kernel transport layer.
- Cross-Sovereign federation and per-workload-fingerprint attestation
  are not blocking handover; they can be re-introduced as an opt-in
  blueprint when needed.

Changes:
- Delete clusters/_template/bootstrap-kit/06-spire.yaml
- Remove bp-spire from kustomization.yaml + expected-bootstrap-deps.yaml
- Remove bp-spire dependsOn from 07-nats-jetstream.yaml + 08-openbao.yaml
- bp-cilium 1.1.4: add encryption.nodeEncryption=true so node-to-node
  traffic (not just pod-to-pod) is also WireGuard-encrypted; document
  in values.yaml comment that WireGuard is the canonical east-west
  mTLS layer.

Removes 4 pods (spire-server, spire-agent, spire-spiffe-csi-driver,
spire-spiffe-oidc-discovery-provider) from every Sovereign and the
recurring CSI mount race that was getting stuck on otech43.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 13:56:36 +04:00
e3mrah
be6e610093
fix: drop bp-langfuse from minimal + bp-mimir 1.0.2 push_grpc fix (#664)
* fix: drop bp-langfuse from minimal bootstrap-kit + bp-mimir push_grpc fix

Two independent fixes packaged together:

1. **Drop bp-langfuse** from the SOLO minimal bootstrap-kit. Per
   founder direction: langfuse is LLM-specific (prompt/completion
   tracing for AI plane), not platform infrastructure, and belongs
   to a future 'AI Add-On' template. Its CreateContainerConfigError
   on every Sovereign provision (missing langfuse-secrets pre-install)
   was eating Phase-1 reconciliation budget without contributing to
   handover-ready state. Removed:
   - clusters/_template/bootstrap-kit/26-langfuse.yaml
   - kustomization.yaml entry
   - scripts/expected-bootstrap-deps.yaml slot 26 entry

2. **bp-mimir 1.0.2** — re-enable ingester.push_grpc_method_enabled.
   Upstream mimir-distributed 6.0.6 disables Push gRPC when
   ingest-storage is off, but classic-mode ingester REQUIRES it.
   The combo crashloops with 'cannot disable Push gRPC method in
   ingester, while ingest storage (-ingest-storage.enabled) is not
   enabled'. Caught live on otech43 with 17 restarts.

Both issues block Phase-1 ready=40/40 from being a clean signal.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-mimir): chart 1.0.2 push_grpc_method_enabled + finalize langfuse drop

Follow-up to previous commit which only captured the file deletion.
This commit applies: bp-mimir 1.0.2 chart bump, kustomization +
expected-deps removal of langfuse, bootstrap-kit version bumps.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 13:50:38 +04:00
e3mrah
544dc86b5b
fix(wizard): blueprint deps sourced from Flux dependsOn (single source of truth) (#652)
* fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13)

* fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth)

The wizard's componentGroups.ts carried hand-maintained `dependencies:
[...]` arrays that deviated from the real Flux install graph in
clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced
this):

  componentGroups.ts          Flux HelmRelease.dependsOn
  ----------------------      ---------------------------
  keycloak: [cnpg]            keycloak: [cert-manager, gateway-api]
  openbao:  []                openbao:  [spire, gateway-api, cnpg]
  harbor:   [cnpg, seaweedfs, harbor:   [cnpg, cert-manager,
              valkey]                    gateway-api]

Founder's directive: "all the real dependencies are related to real
flux related dependencies, if you are hosting irrelevant hardcoded
baseless wizard catalog dependencies, I dont know where they are
coming from. The single source of truth for the dependencies is
flux!!!" — 2026-05-03

This commit:
  1. Adds scripts/generate-blueprint-deps.sh that parses every
     bootstrap-kit HelmRelease and emits blueprint-deps.generated.json
     keyed by bare component id (bp- prefix stripped on both source
     and target side).
  2. Commits the generated JSON.
  3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts
     thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id).
  4. Patches componentGroups.ts so every RAW_COMPONENT's
     `dependencies` field is OVERRIDDEN at module load with the
     Flux-canonical list (the inline `dependencies: [...]` literals
     are now ignored — Flux is canonical).

Follow-ups (not in this PR):
  - CI drift check that re-runs the script and diffs the JSON.
  - Strip the inline `dependencies: [...]` arrays entirely once the
    drift check is green.
  - Wire the FlowPage edge-rendering to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 09:47:52 +04:00
e3mrah
8d2ba0495d
fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584) (#586)
Squash merge: fix(bp-gitea) switch to CNPG-managed postgres (Closes #584)
2026-05-02 15:18:49 +04:00
hatiyildiz
7c3ff940ff fix(ci): update solver_test.go fixtures + expected-bootstrap-deps.yaml for #550
- core/cmd/cert-manager-dynadot-webhook/solver_test.go: fix SetDns2Response →
  SetDnsResponse and ResponseCode:"0" → ResponseCode:0 in test fixtures so
  webhook command tests pass against the corrected dynadot-client JSON parsing
- scripts/expected-bootstrap-deps.yaml: declare bp-cert-manager-dynadot-webhook
  at slot 49b so the bootstrap-kit dependency-graph audit passes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 10:44:18 +02:00
e3mrah
f689766615
fix(infra): add explicit dependsOn to bp-openbao + bp-catalyst-platform (#512) (#513)
Phase-8a-preflight live deployment otech16 (9e14dcc0d2de7586, 2026-05-02):
even after bumping install/upgrade timeout to 15m (commit f47948e7), the
post-install hooks for bp-openbao and bp-catalyst-platform STILL race their
dependencies. The hooks need workload pods Ready before they can do their
work — bp-openbao 3-node Raft init waits for cnpg-postgres + Cilium L7,
and bp-catalyst-platform umbrella init waits for keycloak + cnpg.

Fix (Option C — explicit dependsOn):
- bp-openbao: add bp-cnpg (already had bp-spire, bp-gateway-api)
- bp-catalyst-platform: add bp-keycloak + bp-cnpg (already had bp-gitea, bp-gateway-api)

This makes Flux wait for those HRs Ready=True BEFORE starting the install,
so the post-install hooks run after deps are warm. Eliminates the race.

Updated scripts/expected-bootstrap-deps.yaml to match. Verified:
- bash scripts/check-bootstrap-deps.sh — 0 drift, 0 cycles
- go test ./tests/e2e/bootstrap-kit/... -run TestBootstrapKit_DependencyOrderMatchesCanonical — PASS

Closes #512

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 06:00:56 +04:00
e3mrah
e1f7d22f3c
fix(bootstrap-kit): install Gateway API CRDs ahead of HTTPRoute charts (#503) (#505)
Adds bp-gateway-api Blueprint (slot 01a) that vendors the upstream
Kubernetes Gateway API Standard-channel CRDs (v1.2.0) and registers them
ahead of every chart that ships HTTPRoute templates: bp-openbao,
bp-keycloak, bp-gitea, bp-powerdns, bp-catalyst-platform, bp-harbor,
bp-grafana.

Phase-8a-preflight live deployment otech10 (e1a0cd6662872fcb on
catalyst-api:c148ef3, 2026-05-01) reached 21/37 HRs Ready=True before
stalling on bp-harbor / bp-openbao / bp-powerdns reconciling to
InstallFailed with `no matches for kind "HTTPRoute" in version
"gateway.networking.k8s.io/v1"`. Cilium 1.16's chart `gatewayAPI.
enabled=true` flag wires up the cilium gateway controller and creates
the `cilium` GatewayClass, but does NOT install the
gateway.networking.k8s.io CRDs themselves; cilium 1.16 has no
`installCRDs`-equivalent knob for gateway-api so the upstream CRDs must
ship via a separate Blueprint.

Pattern locked in by docs/INVIOLABLE-PRINCIPLES.md and reinforced by
the founder for ALL similar future cases: intra-chart CRD-ordering
breaks → split into two charts + Flux dependsOn. Mirrors the
bp-crossplane/bp-crossplane-claims and bp-external-secrets/
bp-external-secrets-stores splits.

Files:
- platform/gateway-api/{blueprint.yaml,chart/} — new Blueprint with
  per-CRD templates vendored from kubernetes-sigs/gateway-api v1.2.0
  standard-install.yaml; helm.sh/resource-policy: keep on every CRD so
  Helm uninstall does not orphan every HTTPRoute on the cluster
- platform/gateway-api/chart/scripts/regenerate.sh — developer tool
  for re-vendoring on upstream version bump (annotation-driven)
- platform/gateway-api/chart/tests/crd-render.sh — chart integration
  test (5 CRDs, keep annotation, bundle-version matches Chart.yaml pin)
- clusters/_template/bootstrap-kit/01a-gateway-api.yaml — HelmRelease
  + HelmRepository, dependsOn bp-cilium
- clusters/_template/bootstrap-kit/{08-openbao,09-keycloak,10-gitea,
  11-powerdns,13-bp-catalyst-platform,19-harbor,25-grafana}.yaml —
  add `dependsOn: bp-gateway-api`
- clusters/_template/bootstrap-kit/kustomization.yaml — register
  01a-gateway-api.yaml between 01-cilium and 02-cert-manager
- scripts/expected-bootstrap-deps.yaml — declare slot 1a + add
  bp-gateway-api to depends_on of every HTTPRoute-using slot

Closes #503

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 01:30:50 +04:00
e3mrah
1865ac8975
fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340) (#504)
* fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340)

The upstream seaweedfs/seaweedfs 4.22.0 chart now ships
templates/shared/security-configmap.yaml which calls fromToml — a Sprig
function added in Helm 3.13. Flux v1.x helm-controller bundles a Helm
SDK older than 3.13 and PARSES every template before any
{{- if .Values.global.seaweedfs.enableSecurity }} gate fires, so the file's
mere presence breaks install on every Sovereign with:

  parse error at (bp-seaweedfs/charts/seaweedfs/templates/shared/security-configmap.yaml:21):
    function "fromToml" not defined

even though enableSecurity defaults to false. Setting the gate value
does NOT skip parsing — only deleting / never-shipping the file does.

Fix shape (per ticket #340):

1. Vendor upstream seaweedfs/seaweedfs 4.22.0 into chart/charts/seaweedfs/
   (committed bytes, not auto-pulled at build time). Required because the
   upstream Helm repo overwrites 4.22.0 in place — re-pulling would
   re-introduce the broken file.
2. Delete charts/seaweedfs/templates/shared/security-configmap.yaml.
   Every other template that references the deleted ConfigMap is gated
   under {{- if enableSecurity }} so removing it is a no-op for our
   default deployment shape (Catalyst SeaweedFS auth happens at the S3
   layer via IAM creds from External Secrets, not via the upstream
   chart's TLS/JWT machinery).
3. Drop the dependencies: block from chart/Chart.yaml; add
   annotations.catalyst.openova.io/no-upstream=true so the
   blueprint-release workflow's hollow-chart guard (issue #181) skips
   the auto-pull/round-trip checks for this chart.
4. Whitelist platform/seaweedfs/chart/charts/ in .gitignore so the
   vendored bytes are tracked.
5. Bump bp-seaweedfs 1.0.1 → 1.1.0 (signal: vendored, not auto-pulled).
6. Add tests/no-fromtoml.sh — chart-test that asserts the offending
   file stays deleted across future re-vendors. Runs in
   .github/workflows/blueprint-release.yaml as a publish-gating check.

Unblocks Phase-8a observability + storage chain on otech (bp-loki,
bp-mimir, bp-tempo, bp-velero, bp-harbor, bp-grafana all dependsOn
bp-seaweedfs).

Closes #340

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): align expected-bootstrap-deps.yaml with bp-harbor's actual deps

The bp-harbor HR at clusters/_template/bootstrap-kit/19-harbor.yaml lines
35-37 already removed `bp-seaweedfs` from its dependsOn (cloud-direct
architecture per ADR-0001 §13 — Harbor writes blobs directly to cloud
Object Storage on Sovereigns, not via SeaweedFS), but the expected DAG
in scripts/expected-bootstrap-deps.yaml was never updated to match.

Pre-existing drift on main; surfaced by the dependency-graph-audit
check on PR #504 (bp-seaweedfs vendoring fix). Fixing it inline so the
audit passes on the same PR — the two changes are both about the
storage chain on Sovereigns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 01:20:59 +04:00
e3mrah
87ba48c44e
fix(ci): vendor-coupling guardrail path - products/catalyst/bootstrap/api/internal/objectstorage (closes #438) (#440)
The mode-gate check was looking for ${REPO_ROOT}/internal/objectstorage
but the actual Go package lives at products/catalyst/bootstrap/api/internal/objectstorage.
Update the path so hard-fail mode auto-engages on this repo.

Validation:
  bash scripts/check-vendor-coupling.sh
  -> HARD-FAIL mode banner emitted, exit 0 on clean tree
  Synthetic 'hetzner-object-storage' under platform/ -> exit 1.

Refs: PR #437 (#383) which surfaced the bug.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 18:21:57 +04:00
e3mrah
0fdd411e79
ci(guardrail): vendor-coupling check - fail CI if chart values use vendor name (closes #428) (#431)
Adds scripts/check-vendor-coupling.sh + .github/workflows/check-vendor-coupling.yaml
that scan platform/, clusters/, products/catalyst/bootstrap/{api,ui} for vendor names
(hetzner|aws|gcp|azure|oci) appearing in capability-named slots:

  1. <vendor>-object-storage          (sealed-secret / overlay-secret name)
  2. <chart>Overlay\.<vendor>\.       (chart values block keyed to vendor)
  3. <vendor>ObjectStorage            (camelCase payload field)

Excludes legitimately-per-provider paths (infra/<provider>/, internal/<provider>/,
internal/objectstorage/<provider>/, core/pkg/<provider>/), Crossplane Provider CR
refs (lines containing "crossplane-contrib/provider-"), and *.md files (docs may
discuss the rule).

Mode gate: warn-only while internal/objectstorage/ does not exist (pre-#425
work-in-progress); hard-fail once that directory lands. Locally on this branch
the script emits 49 warnings to stderr and exits 0 against the existing
hetzner-coupled references in platform/velero, platform/seaweedfs, and
clusters/.../bootstrap-kit/34-velero.yaml; once #425's rename lands those
warnings disappear and any future re-introduction fails CI.

Workflow trigger surface: push-to-main + pull_request on the scanned paths +
workflow_dispatch. No schedule: cron per CLAUDE.md "every workflow MUST be
event-driven, NEVER scheduled".

Canonical seam used: scripts/ + .github/workflows/ (mirrors
scripts/check-bootstrap-deps.sh + .github/workflows/blueprint-release.yaml
shape). NOT a duplicate - no prior vendor-coupling guard existed.

Refs: docs/omantel-handover-wbs.md §3a (canonical-seam map)
      docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode)

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:49:49 +04:00
e3mrah
92b7db622d
fix(bp-external-secrets-stores): split ClusterSecretStore into separate chart per #247 pattern (closes #331) (#426)
* fix(bp-external-secrets): split ClusterSecretStore into bp-external-secrets-stores chart (resolves CRD ordering, closes #331)

bp-external-secrets@1.0.0 deadlocked on first install on otech.omani.works:

  Helm install failed for release external-secrets-system/external-secrets
  with chart bp-external-secrets@1.0.0:
  failed post-install: unable to build kubernetes object for deleting hook
  bp-external-secrets/templates/clustersecretstore-vault-region1.yaml:
  resource mapping not found for name: "vault-region1" namespace: ""
  no matches for kind "ClusterSecretStore" in version "external-secrets.io/v1beta1"

Root cause: Helm's `helm.sh/hook-delete-policy: before-hook-creation` ran
a kubectl-style lookup of the existing ClusterSecretStore CR before the
upstream `external-secrets` subchart's CRDs finished registration. The
in-line ClusterSecretStore template (templates/clustersecretstore-vault-
region1.yaml) and the upstream subchart's CRDs co-installed in the same
release; admission ordering wasn't deterministic enough to make the
post-install hook safe.

Fix — same pattern as PR #247 (bp-crossplane@1.1.3 ↔ bp-crossplane-claims@1.0.0):
split the chart into controller + stores. Flux dependsOn orders them.

  - bp-external-secrets@1.1.0 — controller-only (just upstream subchart
    + NetworkPolicy + ServiceMonitor toggle). CRDs register here.
  - bp-external-secrets-stores@1.0.0 (NEW) — the default
    ClusterSecretStore CR; depends on bp-external-secrets being Ready.
    No Helm hooks needed: by the time this chart's HelmRelease starts,
    Flux has already verified bp-external-secrets is Ready=True and
    therefore the CRDs are registered.

Files:
  NEW: platform/external-secrets-stores/blueprint.yaml             (1.0.0)
  NEW: platform/external-secrets-stores/chart/Chart.yaml           (1.0.0; no upstream subchart, annotation `catalyst.openova.io/no-upstream: "true"`)
  NEW: platform/external-secrets-stores/chart/values.yaml          (clusterSecretStore.* knobs moved from controller chart)
  MOVED: platform/external-secrets/chart/templates/clustersecretstore-vault-region1.yaml
       → platform/external-secrets-stores/chart/templates/clustersecretstore-vault-region1.yaml
       (Helm hook annotations removed — Flux dependsOn now handles ordering)
  TOUCHED: platform/external-secrets/chart/Chart.yaml              (1.0.0 → 1.1.0; description note appended)
  TOUCHED: platform/external-secrets/blueprint.yaml                (1.0.0 → 1.1.0)
  TOUCHED: platform/external-secrets/chart/values.yaml             (clusterSecretStore block removed; pointer comment added)
  NEW: clusters/_template/bootstrap-kit/15a-external-secrets-stores.yaml
       (Flux HelmRelease, dependsOn: [bp-external-secrets, bp-openbao])
  TOUCHED: clusters/_template/bootstrap-kit/15-external-secrets.yaml
       (chart version 1.0.0 → 1.1.0)
  TOUCHED: clusters/_template/bootstrap-kit/kustomization.yaml
       (slot 15a inserted after 15)

Out of scope for this PR (separate tickets):
  - blueprint-release.yaml CI fan-out: verify the path-matrix picks up
    the new platform/external-secrets-stores/ directory automatically;
    if not, add the directory to the matrix in a follow-up.
  - Per-Sovereign cluster directory edits (#257 will delete those).
  - Phase 0 minimum trim (#310 will renumber slots; this PR uses 15a as
    a non-disruptive sub-slot insertion that works with both the current
    35-slot kustomization and the eventual 15-slot canonical layout —
    when #310 renumbers, 15 + 15a become 08 + 09 in the canonical order).

Refs: #331 (this issue), #247 (pattern reference — bp-crossplane split),

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): register bp-external-secrets-stores in expected-bootstrap-deps.yaml

The dependency-graph-audit CI step rejected PR #334 because the new
bp-external-secrets-stores HR was on disk at slot 15a but missing from
the expected DAG. This commit adds it with the same dependsOn shape as
clusters/_template/bootstrap-kit/15a-external-secrets-stores.yaml:
[bp-external-secrets, bp-openbao].

Refs: #331, #310 (Phase 0 minimum), PR #334.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(bp-external-secrets): retire CR cases from controller test, add stores-toggle (#331)

After splitting the default ClusterSecretStore into bp-external-secrets-stores
@1.0.0, the controller chart's observability-toggle integration test still
expected the CR to render in the controller chart (Cases 4 + 5). Those
assertions now belong on the new chart.

Changes:
  - platform/external-secrets/chart/tests/observability-toggle.sh:
    Replace Cases 4+5 with a single inverted assertion — the controller
    chart MUST render ZERO ClusterSecretStore CRs (top-level kind:); only
    the upstream subchart's CRD definition (whose spec.names.kind value is
    "ClusterSecretStore" at non-zero indent) is allowed.
  - platform/external-secrets-stores/chart/tests/clustersecretstore-toggle.sh:
    NEW. Mirrors the retired Cases 4+5 against the stores chart, plus a
    Case 3 that asserts clusterSecretStore.server overrides propagate.

Local smoke:
  bash platform/external-secrets/chart/tests/observability-toggle.sh         → 4/4 PASS
  bash platform/external-secrets-stores/chart/tests/clustersecretstore-toggle.sh → 3/3 PASS

Refs: #331, PR #334.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): handle alphanumeric sub-slot suffixes in check-bootstrap-deps.sh

PR #334 (issue #331) added slot 15a-external-secrets-stores as a sub-slot
between numeric slots 15 and 16. The bootstrap-deps audit script's
`printf '%02d'` formatter rejected `15a` with:

  scripts/check-bootstrap-deps.sh: line 390: printf: 15a: invalid number

Fix: detect non-numeric slot tokens and pass them through verbatim. Numeric
slots still render as zero-padded `01..49` for output alignment.

Local smoke:
  $ bash scripts/check-bootstrap-deps.sh
  ...
    [P] slot 15  bp-external-secrets        <-- bp-cert-manager bp-openbao
    [P] slot 15a bp-external-secrets-stores <-- bp-external-secrets bp-openbao
  ...
  OK: bootstrap-kit dependency graph audit PASSED

Refs: #331, PR #334.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(wbs): tick #331 chart-released

bp-external-secrets@1.1.0 (controller-only) + bp-external-secrets-stores@1.0.0
(NEW) shipped in PR #426. Helm-template acceptance + both toggle tests +
dependency-graph-audit all green. Sovereign-impact deferred to Phase 8.

Refs: #331, PR #426.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 17:33:47 +04:00
e3mrah
f7796ef807
feat(bp-velero): Hetzner Object Storage backend wiring (closes #384) (#423)
* feat(bp-velero): Hetzner Object Storage backend wiring (closes #384)

Velero on a Hetzner Sovereign now writes its backups DIRECTLY to Hetzner
Object Storage per ADR-0001 §13 (S3-aware app architecture rule) +
docs/omantel-handover-wbs.md §3 — NOT SeaweedFS, which is reserved as a
POSIX→S3 buffer for legacy POSIX-only writers and is not in the minimal
Sovereign set.

Mirrors the Hetzner-direct backend pattern Agent #383 is wiring for
Harbor; both consume the canonical flux-system/hetzner-object-storage
Secret shipped by issue #371 (cloud-init writes 5 keys: s3-endpoint /
s3-region / s3-bucket / s3-access-key / s3-secret-key, derived from
the operator-issued Hetzner-Console keys + the per-Sovereign bucket
provisioned by OpenTofu's aminueza/minio resource).

platform/velero/chart/ (umbrella chart, bumped to 1.1.0):
  - templates/_helpers.tpl: NEW — bp-velero.fullname / bp-velero.labels
    helpers + bp-velero.hetznerCredentialsSecretName (default
    `velero-hetzner-credentials`).
  - templates/hetzner-credentials-secret.yaml: NEW — synthesises a
    velero-namespace Secret with a single `cloud` key in AWS-CLI INI
    format from .Values.veleroOverlay.hetzner.s3.{accessKey,secretKey}.
    The upstream Velero deployment mounts this at /credentials/cloud
    via existingSecret + AWS_SHARED_CREDENTIALS_FILE. Skip-render path
    when veleroOverlay.hetzner.enabled is false (default — keeps
    contabo render clean) or useExistingSecret is true (operator
    supplied Secret out-of-band).
  - values.yaml: BSL provider/region/s3Url/bucket fields populated as
    placeholders the per-Sovereign HelmRelease overrides via Flux
    valuesFrom; backupsEnabled defaults FALSE so default render emits
    no half-broken BSL; veleroOverlay.hetzner block surfaces the
    operator-overridable fields. Long-form rationale comments inline
    on each value per the chart's existing docstring style.

clusters/_template/bootstrap-kit/34-velero.yaml (+ omantel + otech):
  - dependsOn: bp-seaweedfs REMOVED — Velero is no longer a SeaweedFS
    consumer on Sovereigns (was the old SeaweedFS-tiered architecture
    that minimal-omantel retired in favour of cloud-native S3).
  - chart version bumped 1.0.0 → 1.1.0.
  - valuesFrom block added: 5 Secret-key entries pull each canonical
    s3-* key into the matching umbrella value path. Plaintext
    credentials never appear in the committed manifest; Flux
    dereferences valuesFrom at HelmRelease apply time.
  - values block adds the baseline veleroOverlay.hetzner.enabled=true
    + velero.credentials.{useSecret:true,existingSecret:velero-hetzner-
    credentials} + BSL provider/credential/s3ForcePathStyle scaffolding
    that the valuesFrom entries fill in.

docs/omantel-handover-wbs.md:
  - §2 row 19: " chart needs S3 endpoint rework" → "🟢 chart-released
    v1.1.0 — Hetzner Object Storage backend wired to #371 secret".
  - §9 #384 row: detailed status with smoke evidence.

Smoke evidence (contabo, default values — no Hetzner credentials):
  - helm template t . → renders cleanly (no Hetzner Secret, no BSL).
  - helm template t . --set veleroOverlay.hetzner.enabled=true \
      --set ...accessKey=AK_TEST --set ...secretKey=SK_TEST \
      --set velero.backupsEnabled=true (+ BSL config) →
      Secret/velero-hetzner-credentials with `cloud` INI key emitted +
      BackupStorageLocation/default with provider=aws,
      bucket=omantel-velero, region=fsn1,
      s3Url=https://fsn1.your-objectstorage.com.
  - helm install velero-smoke . -n velero-smoke (defaults) → pod
    velero-69bb84c5-669sh Ready 1/1 in 48s. Smoke torn down clean.

Hetzner-S3 E2E deferred to Phase 8 (first omantel run) — contabo has
no Hetzner Object Storage credentials so end-to-end backup→restore
verification can't run here.

Anti-duplication rule: NO bash scripts authored, NO parallel
implementations of upstream Velero functionality. Upstream Velero +
velero-plugin-for-aws natively support any S3-compatible backend; the
work here is values + a credential-shape adapter Secret, not a fork.

Closes #384.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): drop bp-seaweedfs dep from bp-velero expected DAG (#384)

Mirrors the dependsOn removal in clusters/_template/bootstrap-kit/34-
velero.yaml from the parent commit. Velero on Hetzner Sovereigns now
writes directly to Hetzner Object Storage (ADR-0001 §13 + WBS §3); no
in-cluster prerequisite Blueprint is required.

Local `bash scripts/check-bootstrap-deps.sh` now passes (0 drift,
0 cycles). The CI failure on the parent commit's PR was the audit
flagging bp-velero as having a missing edge to bp-seaweedfs because
this expected-DAG file still listed it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:24:44 +04:00
e3mrah
6e0f734d62
fix(bootstrap-kit): renumber bp-cert-manager-powerdns-webhook 36→49 + register in expected DAG (#373 followup) (#412)
PR #410 landed slot 36 for bp-cert-manager-powerdns-webhook, but slot 36
was already reserved in scripts/expected-bootstrap-deps.yaml for
bp-stunner (W2.K4 forward-declaration). The bootstrap-kit dependency
audit failed on the merge SHA 04308af7 with:

  ERROR: HR 'bp-cert-manager-powerdns-webhook' (file
  clusters/_template/bootstrap-kit/36-bp-cert-manager-powerdns-webhook.yaml)
  is present on disk but NOT declared in
  scripts/expected-bootstrap-deps.yaml.

Two fixes here:

  1. Move the file to slot 49 (first free slot after W2.K4's 35-48
     forward declarations). File renamed; kustomization.yaml updated;
     in-file comment block updated to explain the slot choice.

  2. Register slot 49 in scripts/expected-bootstrap-deps.yaml as
     `wave: present` with `depends_on: [bp-cert-manager, bp-powerdns]` —
     matches the HelmRelease's actual dependsOn block.

Local audit:
  $ bash scripts/check-bootstrap-deps.sh
  Present on disk:       36
  Declared expected:     49
  Deferred (W2.K1-K4):   13
  Drift:                 0
  Cycles:                0
  OK: bootstrap-kit dependency graph audit PASSED

This is a CI-only follow-up; chart and runtime semantics from #410 are
unchanged. Sovereign-impact deferred to Phase 8 per chart-only DoD.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:46:49 +04:00
e3mrah
0289f0388d
feat(scripts): bootstrap-kit dependency-graph audit script (W2.K0) (#259)
Adds scripts/check-bootstrap-deps.sh + scripts/expected-bootstrap-deps.yaml,
the W2.K0 deliverable from docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md §2 + §3.

The script parses every clusters/_template/bootstrap-kit/*.yaml, extracts
metadata.name + spec.dependsOn for the HelmRelease document(s), and
mechanically verifies the actual graph against the expected DAG declared
in scripts/expected-bootstrap-deps.yaml. It detects cycles via Kahn's
algorithm and prints the rendered DAG as ASCII grouped by Wave 2 batch
(W2.K1-K4) on success.

Behaviour against the in-flight expansion: HRs declared expected but not
yet on disk are reported as "deferred" (informational, not an error), so
that this script can be the static authoritative list while W2.K1-K4
PRs land their HR files in series. After all four W2 PRs merge, the
"deferred" count drops to 0 and the audit goes 100% green.

Wired into the existing .github/workflows/test-bootstrap-kit.yaml as a
new dependency-graph-audit job that runs on every PR touching:
  - clusters/** (any HR file edit)
  - scripts/check-bootstrap-deps.sh
  - scripts/expected-bootstrap-deps.yaml
  - .github/workflows/test-bootstrap-kit.yaml

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 17:16:16 +04:00
hatiyildiz
9e3268f2c5 docs(ops): comprehensive operator runbook + remediation playbook + idempotent recovery script
Adds docs/RUNBOOK-OPERATIONS.md as the single operator-facing entry point for
provisioning, troubleshooting, and recovering Catalyst Sovereigns:

A. Pre-provision checklist — Hetzner project + token, Dynadot pool zones +
   credentials, GHCR pull token (cross-link SECRET-ROTATION.md), PowerDNS pool
   zones bootstrapped, PDM healthy, bp-* chart versions, subchart-guard CI green.
B. Step-by-step walkthrough with timing — Phase 0 OpenTofu (30-60s plan +
   60-120s apply), PDM /commit (~5s), cloud-init (3-5min), Phase 1
   bootstrap-kit (10-15min), cert-manager + Cilium Gateway (1-2min). Total
   15-25min for a solo Sovereign.
C. 18 known failure modes with SYMPTOM / ROOT CAUSE / DIAGNOSIS / RECOVERY,
   each pinned to the canonical fix commit (c6cbfe68, e571ec7a, 54872009,
   2022e1af, 34c8de84, dddbab4b, 43aff202, 418cead0, 64d7de97, 330211d2,
   41c7ac13) or marked fix-in-flight where applicable.
D. Idempotent recovery script (Hetzner purge with DELETE-204-but-resource-
   persists verification sweep, PDM allocation release, catalyst-api
   deployment-record cancel). Dry-run by default; --apply gates real deletes
   on a validated HETZNER_API_TOKEN.
E. Cross-links to INVIOLABLE-PRINCIPLES, SOVEREIGN-PROVISIONING,
   RUNBOOK-PROVISIONING, BLUEPRINT-AUTHORING, CHART-AUTHORING, SECRET-ROTATION,
   PLATFORM-POWERDNS, IMPLEMENTATION-STATUS — references, doesn't duplicate.
F. Mermaid phase timeline diagram at the top showing ownership boundaries
   (catalyst-provisioner -> cloud-init -> Sovereign cluster) and hand-off points.
G. Mermaid failure decision tree at the end — operators land at the right §C
   entry in 4-6 yes/no questions.

Recovery script gracefully degrades to a name-only preview when
HETZNER_API_TOKEN is unset in dry-run mode (apply mode still hard-fails on
missing/invalid token), so operators can review what WOULD happen before
exporting the token.

Verified dry-run output against the live omantel.omani.works Sovereign:
- Step 1 lists 8 Hetzner kinds + 8 verification-sweep targets to inspect
- Step 2 confirms PDM reports the subdomain currently RESERVED (live state)
- Step 3 correctly identifies catalyst-api deployment 6274daeb7a9873cd

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:26:29 +02:00