Commit Graph

239 Commits

Author SHA1 Message Date
e3mrah
2c32fde847
feat(epic-5): NetBird mesh + ClusterMesh activator + DMZ vCluster scaffolds (#1100) (#1171)
Closes the EPIC-5 leftovers (per .claude/architect-briefs/epic-5/00-master-brief-leftovers.md):

* NB — bp-netbird platform Blueprint chart (default-OFF, SHA-pinned, fail-fast).
  Renders 12 resources ON: 3 Deployments (management + signal + coturn) +
  3 Services + 1 PVC + 1 HTTPRoute + 1 NetworkPolicy + 2 SealedSecrets +
  1 ConfigMap. KC realm-config ConfigMap mirrors the Guacamole pattern
  from slice K+P+X1+G #1164 — adds `netbird` OIDC client + `netbird-user` /
  `netbird-admin` realm roles + `netbird-users` / `netbird-admins` groups.

* CM — ClusterMesh activator slice on the existing Cilium chart.
  ADDs platform/cilium/chart/values-clustermesh.yaml (operator-applied
  values overlay) + templates/clustermesh-config.yaml (renders the
  catalyst-clustermesh-config ConfigMap when cluster.name + cluster.id
  are set per-Sovereign). Operator runbook for `cilium clustermesh enable`
  + `cilium clustermesh connect` documented inline. Default Cilium chart
  render is unchanged — this slice is purely additive + opt-in.

* DMZ — bp-dmz-vcluster product Blueprint chart (default-OFF,
  SHA-pinned, fail-fast). Renders 4 resources ON without hostname
  (HelmRelease wrapping upstream loft-sh/vcluster + Service + 2
  NetworkPolicies); 5 resources with HTTPRoute hostname. Isolation
  pattern: own openova-system namespace inside host cluster → own Cilium
  identity → default-deny + allow-essentials NetworkPolicies → public
  egress only via designated egress gateway.

All 3 charts: helm lint clean. Tests at chart/tests/render.sh +
chart/tests/clustermesh-overlay.sh. Pre-existing CI flakes per canon §7
remain — they're not introduced by this slice.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:14:56 +04:00
e3mrah
639b94fe55
feat(epic-4): K+P+X1+G — k8s-ws-proxy + projector + WebSocket logs + Guacamole chart (#1099) (#1164)
EPIC-4 Slice K+P+X1+G — bundled backend infrastructure for the
"k9s-on-web" Cloud Resources experience:

K1 — core/cmd/k8s-ws-proxy/ — per-node WebSocket exec proxy.
HMAC-signed (X-Catalyst-HMAC: SHA256({timestamp}:{path})) WebSocket
upgrades on /proxy/exec/{ns}/{pod}/{container} bridged to the local
kube-apiserver via in-cluster ServiceAccount. v4.channel.k8s.io
subprotocol echo. Optional TMUX_CASCADE wraps in a shared
catalyst-ops tmux session. Shipped as a DaemonSet + Service with
internalTrafficPolicy=Local in platform/k8s-ws-proxy/chart/.

P1 — core/cmd/projector/ — NATS catalyst.events JetStream → Valkey
KV projector. Canonical key shape:
  cluster:{cluster-id}:kind:{kind}:{namespace}/{name}
Cold-start does a full LIST across DefaultKinds, then catches up on
the 24h replay window. Multi-replica safe (durable consumer queue
group, last-write-wins on namespacedName). Shipped as a default-OFF
Deployment + RBAC under products/catalyst/chart/templates/services/projector/.

X1 — products/catalyst/bootstrap/api/internal/handler/k8s_logs.go —
WebSocket Pod-log streaming endpoint:
  GET /api/v1/sovereigns/{id}/k8s/logs/{ns}/{pod}/{container}
      ?follow&tailLines&since=<rfc3339>&previous
Reads from kubelet via client-go GetLogs().Stream(); each WS frame =
one log line. Supports `since` resume. Reuses RequireSession middleware
+ chroot cluster-id resolver. New k8scache.Factory.CoreClient(id)
accessor exposes the per-cluster typed client without duplicating
kubeconfig parsing.

G1 — platform/guacamole/chart/ — full Apache Guacamole chart:
guacd Deployment + Service, Tomcat webapp Deployment + Service,
Cilium Gateway HTTPRoute, SeaweedFS-PVC for recordings (RWO,
hcloud-volumes), SealedSecret placeholder for Keycloak OIDC client
secret, NetworkPolicy (default-deny + selective egress to KC +
k8s-ws-proxy + SeaweedFS + NATS), and ConfigMap consumed by
keycloak-config-cli post-deploy Job (mirrors platform/keycloak
realm-config pattern). Default-OFF gate; full-ON renders 9
resources. Empty image.tag / hostname / oidc.issuer fail-fast at
helm template time per INVIOLABLE-PRINCIPLES #4a/#5. ONE Guacamole
per Sovereign per ADR-0001 §11. Blueprint manifest uses
v1alpha1 + version "0.1.0" + upgrades.from ["0.x"].

Tests:
- k8s-ws-proxy: HMAC happy/expired-old/expired-future/malformed/
  bad-signature, path-only signature, WS upgrade + protocol echo,
  bad path, bad HMAC, denied namespace via httptest.
- projector: Apply ADD/MOD/DEL/validation, key shape (ns-scoped +
  cluster-scoped), handleOne ack/nak/term routing with fakeMsg,
  cold-start LIST + project + error continuation via dynamicfake.
- X1: parseLogOptions defaults + edge cases + bad query params,
  503/404/400 paths + full WS happy-path with kfake clientset.
- G1: chart/tests/render.sh — default-OFF=0, empty-tag fail-fast,
  full-ON=9 resources, every required kind present, realm-config
  wires OIDC client.
- bp-k8s-ws-proxy chart: chart/tests/render.sh — default-OFF=0,
  empty-tag fail-fast, full-ON=5 resources.

Pre-existing test status: TestPinIssue and TestBootstrapKit/gitea
remain flaky on main per canon §7 — verified not introduced by
this slice.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 09:27:39 +04:00
e3mrah
a0c356fe34
fix(cnpg-pair): drop bp-cnpg: prefix from upgrades.from semver range (#1156)
Other platform/*/blueprint.yaml files use bare semver-range strings
(e.g. ["0.x"]) without the bp-name: prefix. C3 blueprint-controller's
validate package rejects "bp-cnpg:1.x" as an invalid semver range,
breaking TestValidate_ExistingBlueprintCorpus on any PR after #1153.

Found by EPIC-6 K-Cont-2 (#1155). Brief at C-DB-1 (.claude/architect-briefs/
epic-6/02-) was wrong — the slice author followed the brief literally.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 06:51:09 +04:00
e3mrah
746901b671
feat(cnpg-pair): C-DB-1 — bp-cnpg-pair Blueprint (active-hotstandby CNPG cluster-pair across regions) (#1101) (#1153)
EPIC-6 Slice C-DB-1+C-DB-2. Active-hotstandby CNPG cluster-pair as a
companion to bp-cnpg: primary CNPG Cluster CR in region A, replica
Cluster CR in region B configured as a CNPG replica cluster
(replica.enabled=true + externalCluster), WAL streaming over a
Cilium ClusterMesh-shared Service. Per ADR-0001 §9 ClusterMesh is the
only canonical inter-region transport — never public TLS.

What ships:
  platform/cnpg-pair/
  ├── chart/
  │   ├── Chart.yaml             # bp-cnpg-pair 0.1.0; no-upstream + smoke-render-mode=default-off
  │   ├── values.yaml            # default-OFF gate; placement schema constrains active-hotstandby ONLY
  │   ├── templates/
  │   │   ├── _helpers.tpl              # fail-fast on empty image.tag; region pair validation
  │   │   ├── primary-cluster.yaml      # CNPG Cluster CR (region-pinned via openova.io/region affinity)
  │   │   ├── replica-cluster.yaml      # CNPG Cluster CR (replica.enabled=true; externalClusters[])
  │   │   ├── service-replication.yaml  # Cilium ClusterMesh global Service
  │   │   ├── failover-readiness.yaml   # probe Pod flips Ready when WAL lag < threshold
  │   │   ├── networkpolicy.yaml        # default-deny carve-outs for replication + probe
  │   │   └── audit-config.yaml         # NATS audit subjects + types this Blueprint emits
  │   ├── blueprint.yaml          # configSchema + placementSchema (active-hotstandby ONLY)
  │   ├── README.md               # 80-line deployment + failover semantics
  │   └── tests/cnpg-pair-render.sh  # 5-case render gate
  └── DESIGN.md                   # topology, lag-threshold rationale, deferred C-DB-3 plan

Default-OFF gate per the brief: helm template with default values
renders ZERO resources; helm template with cnpgPair.enabled=true +
both regions + image.tag renders 8 resources (2 Cluster CRs, 1
Service, 1 Deployment, 3 NetworkPolicies, 1 audit-config ConfigMap).
Empty image.tag fails fast at template-render per Inviolable
Principle #4a; same primary/replica region fails fast (degenerate
pair). All 5 render gates pass locally; helm lint + YAML parse clean.

CI smoke-render gate fix (single-line behavior change in
blueprint-release.yaml): adds a `catalyst.openova.io/smoke-render-
mode: default-off` annotation opt-in so charts that legitimately
render zero at default values (this chart + future bp-*-pair
Blueprints) skip the `<5 lines` empty-render check. The chart's own
tests/cnpg-pair-render.sh covers the enabled-render path; without
the annotation the empty-render check still fires unchanged.

Seam-map additions (return diff for 01-canonical-seams.md Platform
table):
  - service.cilium.io/global=true ClusterMesh global Service annotation
    (first chart in the repo to use it; pattern reused by Continuum
    K-Cont-2 for HTTPRoute weight=0 cross-region drains)
  - bp-*-pair active-hotstandby cluster-pair pattern (primary+replica
    Cluster CRs colocated in one Blueprint, region-pinned via
    openova.io/region node-affinity)
  - audit-config ConfigMap co-located with the emitting Blueprint
    (label-selector discovery for K-Cont-2 + U-DR-1; future
    bp-*-pair Blueprints follow this convention)
  - smoke-render-mode=default-off Chart.yaml annotation opt-in for
    the blueprint-release smoke gate

C-DB-2 (publish): existing blueprint-release.yaml workflow auto-
detects `platform/*/chart/**` paths — no allowlist edit required.
First push triggers `ghcr.io/openova-io/bp-cnpg-pair:0.1.0` build.

C-DB-3 (1M-row acceptance test) DEFERRED — full plan documented in
DESIGN.md "Deferred — C-DB-3 acceptance test plan" section so the
future implementer's brief is self-contained.

Tests:
  - bash platform/cnpg-pair/chart/tests/cnpg-pair-render.sh ✓ 5/5 PASS
  - helm lint platform/cnpg-pair/chart ✓ clean
  - helm template ... | python3 yaml.safe_load_all ✓ 8 docs parse clean
  - smoke-gate logic simulated locally ✓ default-off annotation honored

Pre-existing CI failures untouched:
  - TestPinIssue rate-limit flake — not affected by chart-only slice
  - TestBootstrapKit/gitea version drift — only iterates over a fixed
    10-chart bootstrap list (no cnpg-pair entry)

Out of scope per brief (all deferred to dedicated slices):
  - K-Cont-2 reconciler logic
  - K-Cont-3 lease witness
  - K-Cont-4 Cloudflare Worker
  - C-DB-3 1M-row acceptance test
  - Application controller changes
  - U-DR-1 UI

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 05:16:55 +04:00
e3mrah
a6ccdcef41
feat(rbac): /rbac/assign find-or-create + /rbac/access-matrix + boundary validator (slice A, #1098) (#1143)
EPIC-3 slice A bundles three deliverables on top of the just-landed
slice T1 (5-tier ClusterRoles):

A1 — POST /api/v1/sovereigns/{id}/rbac/assign
  Find-or-create-role endpoint backing the multi-grant editor (slice
  U1). Race-tolerant 409 retry follows the EnsureUser pattern. Three
  paths: created / updated (tier rotation on existing scope) / no-op.
  Authoring side: writes UserAccess CR with metadata.labels[
  catalyst.openova.io/tier]=<tier> + spec.tierRoleRef + spec.scopes[].

A2 — GET /api/v1/sovereigns/{id}/rbac/access-matrix
  Manara-style users × applications × tier matrix with per-CR
  warnings (developer-tier missing env-type=dev surfaces inline).
  Optional org/application filters. Pure aggregator extracted for
  testability — no apiserver, no clock.

A3 — Kyverno ClusterPolicy `useraccess-boundary`
  Denies cross-Organization UserAccess grants unless the requester
  is a member of a management Org with tier=owner. Default Audit
  (values-driven action). Test fixtures + kyverno-test.yaml shape
  ready for kyverno-CLI CI step in a follow-up slice.

UserAccess CRD extension:
  - spec.tierRoleRef (string, openova:tier-* pattern)
  - spec.scopes[] ({key, value})
  - applications[] no longer required (legacy + new shapes coexist)

Test coverage (26 new tests, race-clean):
  - A1: 3-path find-or-create, 409 retry, validation, 404
  - A2: matrix shape + filters + warnings, http happy/empty/404
  - Pure helpers: scope normalization/equality, CR-name determinism

Pre-existing failure `TestPinIssue_ConcurrentRapidFireRateLimit`
(rate-limit timing flake) reproduced on clean main per canon §7;
not introduced by this slice.

Refs: EPIC-3 master brief at .claude/architect-briefs/epic-3/, slice
A brief at 02-A-rbac-assignment-endpoints.md, T1 ancestor #1142.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 03:20:50 +04:00
e3mrah
c215468a61
feat(rbac): land 5-tier ClusterRoles (slice T1, #1098) (#1142)
Renders 5 ClusterRoles `openova:tier-{viewer,developer,operator,admin,owner}`
via Helm template with inherit-chain expansion. Find-or-create-role
endpoint (slice A1, future) targets these via roleRef on UserAccess CRs.

Per-tier action sets in values.yaml's new `tierActions:` block (227
lines authored by EPIC-3-T agent before stream timeout — Coordinator
finished the template + helper):

- tier-viewer (level 10): 6 rules — `*.read` on common kinds
- tier-developer (level 20): 10 rules — viewer + workloads.exec/console
  + tickets + sessions.playback. Auto-injected scope `openova.io/env-type=dev`
  surfaced via ClusterRole annotation (slice T3 follow-up reads it).
- tier-operator (level 30): 15 rules — developer + console.connect.admin
  + sam.manage + patches.manage + tickets.accept
- tier-admin (level 40): 29 rules — operator + compute.* (no delete)
  + credentials.* + applications.* + actions.* + accounts.* + networks.*
  + sessions.* + workloads.*
- tier-owner (level 50): 33 rules — admin + rbac.* + organization.*
  + compute.delete

Total 93 RBAC rules across the 5 ClusterRoles.

Inherit chain expansion via _tier-helpers.tpl `catalyst.tierRules`
template helper. Each ClusterRole's `metadata.labels` carries:
- `catalyst.openova.io/tier-name: <tier>`
- `catalyst.openova.io/tier-level: <int>` (10/20/30/40/50; same integer
  the Keycloak realm-role attribute carries — admin_roles.go:88-92)

`metadata.annotations.catalyst.openova.io/enforced-scopes` JSON-encodes
the per-tier scope auto-injection contract (developer-only today).

Per ADR-0001 §2.7: ClusterRoles (not Roles) so the same role works for
both namespace-scoped (RoleBinding) and cluster-scoped (ClusterRoleBinding)
UserAccess targets.

Per docs/INVIOLABLE-PRINCIPLES.md #4: every action set is in values.yaml,
not hardcoded — operators extend per-Sovereign without editing the
template. The `tiers.enabled` master gate + per-tier `enforcedScopes[]`
are also operator-tunable.

Validated:
- `helm lint` clean (1 INFO about chart icon, pre-existing)
- `helm template` renders exactly 5 ClusterRoles with the expected
  inherit-chain rule counts (6 → 10 → 15 → 29 → 33)
- Inherit chain helper handles base case (viewer has no inherit) and
  caps recursion at 10 levels (defensive)

Out of scope (deferred to follow-up slices):
- T2: Keycloak composite realm-role bootstrap (init Job in catalyst-api
  startup that creates 5 `catalyst-<tier>` realm roles + composite chain)
- T3: useraccess-controller mod for developer scope auto-injection
  (reads enforced-scopes annotation from this template's ClusterRoles)

Refs: #1094, #1098, docs/EPICS-1-6-unified-design.md §6.2
(authoritative tier action-set spec).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 02:53:39 +04:00
e3mrah
f1d0801ad2
feat(catalyst-api): compliance score aggregator + handler (slice S, #1096) (#1141)
Joins Kyverno PolicyReports + slice W2's compliance-evaluator events
+ EnvironmentPolicy weights into per-resource → per-Application →
per-Environment → per-Organization → per-Sovereign weighted scores.
Outputs SSE for live updates, REST for snapshots, Prometheus
catalyst_compliance_* gauges/counters, and (when CATALYST_NATS_URL is
wired) NATS JetStream KV `policy-rollup` for replayable history.

S1 — internal/handler/compliance.go:
  * REST endpoints under /api/v1/sovereigns/{id}/compliance/
    - GET /scorecard   — per-app/env/org/sovereign rollups
    - GET /policies    — per-policy weight + mode + violation tally
    - GET /violations  — paginated fail rows, ?app=<name>
    - GET /stream      — SSE for live score updates
  * Watch loop subscribes to k8scache.Factory fanout for kinds
    {policyreport, clusterpolicyreport, compliance-evaluator,
     deployment, statefulset, daemonset, pod}. Per ADR-0001 §5
    every score recompute is event-driven; no polling.
  * Pure computeScore() function with edge cases tested:
    all-pass=100, all-fail=0, half-pass=50, skip drops from denom,
    empty-weights fallback to equal weights, stateful/stateless scope
    filters, missing verdict drops policy, warn pulls score down.
  * NATS KV writes via nil-tolerant PolicyRollupPublisher interface
    keyed `<scope>:<id>`. Sentinel resolver wires when env is set;
    nil keeps the aggregator running on SSE+Prometheus only.
  * EnvironmentPolicy CR resolution via dynamic-client; nil/404
    falls back to default equal-weights so a fresh Sovereign without
    a tuned policy still scores correctly.

S2 — platform/mimir/chart/templates/prometheusrule-compliance.yaml:
  * Recording rules:
    - catalyst:compliance_score:by_application:1h_avg
    - catalyst:compliance_violations:by_policy:5m_rate
    - catalyst:compliance_score:by_sovereign:1h_avg
    - catalyst:compliance_policy_enforcing:by_policy
  * Pager alerts: ComplianceScoreRegression (>10pt drop in 1h) +
    ComplianceEnforcingPolicyHighViolations (>50/hr in enforcing
    mode). Every threshold a values.yaml knob per
    docs/INVIOLABLE-PRINCIPLES.md #4.
  * Capabilities-gated on monitoring.coreos.com/v1 so a fresh
    Sovereign without bp-kube-prometheus-stack doesn't fail render.

Tests:
  * 18 unit + integration tests in compliance_test.go covering the
    full computeScore matrix, the watch-loop end-to-end via
    Factory.Publish injection, and every HTTP endpoint (scorecard,
    policies, violations pagination, stream, 503 nil-handler).
  * `go test -count=1 -race ./internal/handler/...` clean (5 runs).
  * `go vet ./...` clean.

Pre-existing CI failures (TestPinIssue_ConcurrentRapidFireRateLimit,
TestRun_FailsFastOnDynadotError, TestAuthHandover_HappyPath nil-ptr,
TestValidate_*Harbor_robot_token*) confirmed not introduced by this
slice — they reproduce on clean main.

Per ADR-0001 §3 (5 stores): score history lives in NATS JetStream KV;
no Postgres/FerretDB shadow store. Per ADR-0001 §5 (event-driven):
every score recompute fires off a Subscribe event. Per
INVIOLABLE-PRINCIPLES #4: SSE retention, KV TTL, alert thresholds all
runtime-configurable.

Closes the S column of EPIC-1 master plan; UI slices U1-U5 can now
consume the SSE event shape.

Co-authored-by: hatiyildiz <hati@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 02:37:31 +04:00
e3mrah
d74e0d5e5a
feat(bp-kyverno): land 19 compliance ClusterPolicy templates (slice K, #1096) (#1138)
Slice K of EPIC-1 (#1096) compliance engine — author the baseline
policy library that the score aggregator (slice S) will consume via
PolicyReport rows. K1 ships 13 baseline policies + K2 ships 7 added
policies. One of the K2 policies (hubble-flows-seen #16) is a stub
file — Kyverno can't natively reach Cilium Hubble's gRPC API, so the
synthetic PolicyReport row is emitted by slice W2's hubble.go
evaluator (per design §4.1). Stub keeps the policy slot explicit in
the bundle.

Architecture per docs/EPICS-1-6-unified-design.md §4.3:

  K1 (13 baseline)
    01 multi-replica-drainability  (resilience, permissive)
    02 pdb-permits-eviction        (resilience, permissive)
    03 topology-spread             (resilience, permissive)
    04 probes-present              (resilience, enforcing)
    05 resource-requests           (resilience, enforcing)
    06 resource-limits             (resilience, permissive)
    07 pvc-volume-expansion        (resilience, permissive — stateful)
    08 hpa-effective               (resilience, permissive)
    09 cilium-l7-mtls              (security,   enforcing)
    10 flux-managed                (governance, enforcing)
    11 harbor-proxy-pull           (governance, enforcing)
    12 image-tag-pinned            (governance, enforcing)
    13 prometheus-scrape           (observability, permissive)

  K2 (7 added)
    14 networkpolicy-present       (security, permissive)
    15 otel-injected               (observability, permissive)
    16 hubble-flows-seen           (deferred to W2 evaluator)
    17 runasnonroot-readonlyrootfs (security, permissive)
    18 cosign-verified             (security, permissive)
    19 secret-not-in-env           (security, permissive)
    20 backup-configured           (resilience, permissive)

Per docs/INVIOLABLE-PRINCIPLES.md #4 every operationally-meaningful
value is runtime-configurable via .Values.compliancePolicies.<name>.*:
  - enabled (default false — operator opts in)
  - action (Audit | Enforce; default Audit; flipped per-Environment by
    EnvironmentPolicy.spec.compliance.modes once C2 controller lands)
  - excludeNamespaces (default exempts kube-system, flux-system, etc.)
  - per-policy specifics (allowedRegistryRegex, cosign keys, ...)

Test gate (helm template):
  - default-OFF (no overrides): 0 ClusterPolicy rendered
  - all-ON                    : 19 ClusterPolicy rendered
helm lint clean both ways.

Slice S1 (score aggregator) will join PolicyReport rows from these
policies + synthetic rows from W2 evaluators against EnvironmentPolicy
weights. UI surfaces (slices U1-U5) consume the SSE/NATS rollups.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:57:51 +04:00
e3mrah
f18dd8df19
feat(bp-opentelemetry-operator): scaffold operator + default Instrumentation CR (slice H5, #1095) (#1121)
New platform/opentelemetry-operator/ Blueprint scaffold per design doc
§3.9 row 5. Companion to existing bp-opentelemetry (the collector) —
this Blueprint ships the OPERATOR that auto-injects OTel SDK sidecars
into Pods based on annotations:

  instrumentation.opentelemetry.io/inject-{java|nodejs|python|dotnet}: "default"

Two-Blueprint split is intentional: collector and operator are separate
upgrade cycles. Mixing them risks coupling observability cadence to
auto-instrumentation cadence, and the operator's mutating admission
webhook intercepts every Pod creation cluster-wide so misconfiguration
is high-blast-radius.

What ships:
- platform/opentelemetry-operator/README.md — activation contract
- platform/opentelemetry-operator/blueprint.yaml — bp-opentelemetry-operator 1.0.0
- platform/opentelemetry-operator/chart/Chart.yaml — wraps upstream
  opentelemetry-operator:0.61.0 from open-telemetry-helm-charts.
  Subchart `condition: enabled` — default-off skips it entirely.
- platform/opentelemetry-operator/chart/values.yaml — gate, default
  Instrumentation CR config (exporterEndpoint, sampler, per-language
  toggles), upstream subchart values (manager.collectorImage.repository
  required, serviceAccount, cert-manager-backed admission webhook)
- platform/opentelemetry-operator/chart/templates/instrumentation-default.yaml
  — Catalyst overlay Instrumentation CR with parentbased_traceidratio
  sampler @ 0.25 default, propagators (tracecontext + baggage + b3),
  per-language injection toggles. Default OFF; namespace = cilium by
  default (operator overrides per Sovereign).

Default-OFF for both layers:
- .Values.enabled: false → upstream subchart's `condition: enabled`
  also fires, so 0 resources rendered total
- Even after .Values.enabled=true, the Catalyst Instrumentation CR
  is gated again by .Values.defaultInstrumentation.enabled=false so
  installing the chart doesn't auto-inject anywhere

Per docs/INVIOLABLE-PRINCIPLES.md #4 every parameter (sampler ratio,
exporter endpoint, per-language toggles, namespace) is in values.yaml.

Validated:
- helm dependency build pulls upstream cleanly
- helm template with default values: 0 resources rendered
- helm template with enabled=true defaultInstrumentation.enabled=true:
  22 resources rendered (upstream operator manager Deployment, CRDs,
  RBAC, mutating + validating webhooks, cert-manager Issuer +
  Certificate, plus the Catalyst Instrumentation CR)

Out of scope for this slice:
- Add this Blueprint to clusters/_template/bootstrap-kit/ — EPIC-5
  (#1100) sequences both bp-opentelemetry (collector first) and this
  Blueprint as part of the observability roll-out
- Per-Application Instrumentation CRs from Blueprint.spec.observability.
  traces=otlp — application-controller (slice C4 of #1095) renders
  those at install time

Refs: #1094, #1095, #1100, docs/EPICS-1-6-unified-design.md §3.9 row 5
+ §8.4 (EPIC-5 Networking).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 23:06:29 +04:00
e3mrah
5915e309dc
feat(bp-kyverno): land label-vocab mutate + validate ClusterPolicies (slices E1+E2, #1095) (#1120)
Realizes design doc §3.6 (Label-vocabulary enforcement). Two
ClusterPolicies that together implement the contract in §1: the
openova.io/* label set is the join key across compliance scoring
(#1096), RBAC scope matching (#1098), billing (post-Phase-1), and
networking (#1100). If labels are missing, every downstream consumer
is blind.

E1 — mutate-add-openova-labels (slice E1):
- Mutating ClusterPolicy that derives missing openova.io/{org, env,
  application, blueprint, managed-by} labels from namespace annotations
  + ownerReferences and adds them at admission.
- Three rules:
  * add-org-from-namespace-annotation
  * add-env-from-namespace-annotation
  * add-managed-by-flux-when-flux-instance-label
- Best-effort safety net — Catalyst controllers (C1/C2/C4) are the
  authoritative source. This rule covers resources created OUTSIDE
  the controller path (e.g. a debug Pod from kubectl run, a CronJob
  authored manually).

E2 — validate-require-openova-labels (slice E2):
- Validating ClusterPolicy that REJECTS workload resources missing
  required openova.io/* labels.
- Default action `Audit` (permissive) — per-Environment overlay
  flips to `Enforce` (blocking) via EnvironmentPolicy.spec.modes
  in EPIC-1 #1096.
- One rule per required label (templated from .Values.kyvernoOverlay.
  labelVocab.validate.requiredLabels) — lets the Audit/Enforce decision
  be per-label rather than all-or-nothing.
- excludeNamespaces list exempts control-plane namespaces (kube-system,
  flux-system, cilium, cert-manager, openova-system, catalyst, etc.)
  so existing Sovereign infra doesn't trip on missing org labels.

Both default OFF (.Values.kyvernoOverlay.labelVocab.{mutate,validate}.
enabled). Operator opts in once the prerequisite Organization (slice
B1) + Environment (slice B2) CRs exist on the cluster, otherwise the
mutate rule has nothing to derive from and the validate rule rejects
every workload.

Per docs/INVIOLABLE-PRINCIPLES.md #4, every list (requiredLabels,
resourceKinds, excludeNamespaces, action) is in values.yaml.

Validated:
- helm dependency build pulls upstream kyverno cleanly
- helm template with default values: 0 ClusterPolicy resources rendered
- helm template with both gates enabled: exactly 2 ClusterPolicies
  rendered (mutate-add-openova-labels + validate-require-openova-labels)

Chart version bumped 1.0.1 → 1.1.0 (minor — new templates, no breaking).
Blueprint.yaml mirrored 1.0.0 → 1.1.0.

Refs: #1094, #1095, #1096, #1098, #1100, docs/EPICS-1-6-unified-design.md
§1 (label vocab) + §3.6 (E1+E2 scope).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 23:01:43 +04:00
e3mrah
e1d7bf18be
feat(bp-hcloud-csi): scaffold Hetzner CSI driver Blueprint (slice H6, #1095) (#1119)
New platform/hcloud-csi/ Blueprint scaffold per design doc §3.9 row 6.
Wraps the upstream hetznercloud/csi-driver Helm chart and ships the
Catalyst-managed `hcloud-volumes` StorageClass that multi-node stateful
workloads (CNPG primary/replica pairs in EPIC-6 #1101) need.

Default-OFF: chart is a no-op until .Values.enabled is true. Even after
enabling, the cluster's default StorageClass is NOT flipped unless
.Values.defaultStorageClass is also true — that's a destructive change
for Pods relying on the previous default's binding semantics, so the
in-place migration plan is operator-scheduled.

What ships:
- platform/hcloud-csi/README.md — activation contract, why-default-OFF
- platform/hcloud-csi/blueprint.yaml — bp-hcloud-csi 1.0.0, configSchema
- platform/hcloud-csi/chart/Chart.yaml — wraps upstream
  hcloud-csi:2.13.0 from charts.hetzner.cloud, condition=enabled gate
- platform/hcloud-csi/chart/values.yaml — gate, default-storageclass
  flag, hetznerTokenSecretRef (SealedSecret), catalystStorageClasses
  array (renamed from storageClasses to avoid collision with upstream's
  storageClasses key), volumeSnapshotClass block (default off)
- platform/hcloud-csi/chart/templates/storageclass.yaml — renders one
  StorageClass per catalystStorageClasses[] entry; first entry annotated
  as cluster default when defaultStorageClass=true
- platform/hcloud-csi/chart/templates/volumesnapshotclass.yaml —
  VolumeSnapshotClass for backup workflows; default off

Why a separate Blueprint, not values toggle on bp-cilium:
- CSI drivers are independent of CNI. Mixing them risks coupling the
  network-plane upgrade cycle to the storage-plane upgrade cycle.

Per docs/INVIOLABLE-PRINCIPLES.md #4 every parameter (StorageClass list,
SealedSecret reference, replicas, resource requests) is in values.yaml.

Validated:
- helm dependency build pulls upstream hcloud-csi:2.13.0 cleanly
- helm template with default values: 0 resources rendered (gate +
  Chart.yaml condition both fire correctly)
- helm template with enabled=true defaultStorageClass=true: 7 resources
  rendered (upstream CSI controller Deployment, node DaemonSet, CSIDriver,
  RBAC, plus Catalyst hcloud-volumes StorageClass with the
  storageclass.kubernetes.io/is-default-class annotation)

Schema collision lesson:
- Initial draft used .Values.storageClasses[] which collided with the
  upstream subchart's storageClasses array (different shape; subchart
  expects array under that exact name). Renamed to catalystStorageClasses
  + passed [] to upstream's hcloud-csi.storageClasses to suppress its
  own StorageClass rendering. Lesson logged in seam map.

Refs: #1094, #1095, #1101, docs/EPICS-1-6-unified-design.md §3.9 row 6,
docs/SRE.md §2.5, platform/cnpg/README.md.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 22:56:19 +04:00
e3mrah
eca27002ae
feat(bp-cilium): add Hubble UI HTTPRoute overlay (slice H7, #1095) (#1117)
Realizes design doc §3.9 row 7 (Hubble relay+UI on; OIDC ingress) as a
default-OFF scaffold that EPIC-5 (#1100) flips on per Sovereign once the
zero-trust observability tier is ready.

Why default-OFF in Phase-0:
- Hubble relay/UI in production today is intentionally off (SovereignA
  was crash-looping on monitoring.coreos.com/v1 ServiceMonitor missing
  before bp-kube-prometheus-stack reconciles — issue #182).
- The OIDC enforcement at the gateway boundary is the missing piece —
  Cilium's L7 OIDC filter wires to bp-keycloak's `hubble-ui` client
  which lands in slice D1.
- Flipping the gate without the OIDC layer would leave Hubble UI
  publicly accessible. The template comments explicitly warn against
  this for production.

What ships:
- platform/cilium/chart/templates/hubble-ui-httproute.yaml — HTTPRoute
  exposing hubble-ui Service via cilium-gateway with the wildcard cert.
  Gated by `catalystOverlay.hubbleUI.{enabled,hostname}`.
- platform/cilium/chart/values.yaml `catalystOverlay:` block: hubbleUI.{
  enabled, hostname, gatewayRef.{name,namespace},
  serviceRef.{name,namespace,port}, auth (oidc|none, default oidc) }.
  All operator-overrideable per docs/INVIOLABLE-PRINCIPLES.md #4.

Operator opt-in path (per-Sovereign overlay at clusters/<sov>/bootstrap-kit/
01-cilium.yaml):
  spec.values.cilium.hubble.relay.enabled: true
  spec.values.cilium.hubble.ui.enabled: true
  spec.values.catalystOverlay.hubbleUI.enabled: true
  spec.values.catalystOverlay.hubbleUI.hostname: hubble.<sovereign-domain>
… AND bp-keycloak realm has a `hubble-ui` OIDC client (slice D1).

Validated:
- helm template with default values: 0 HTTPRoute resources rendered
- helm template with catalystOverlay.hubbleUI.enabled=true + hostname:
  exactly 1 HTTPRoute rendered with proper parentRefs/hostnames/backendRefs
- Original 34-resource render count unchanged in default mode (no
  regression to existing chart output)

Chart version bumped 1.2.1 → 1.3.0 (minor — new templates, no breaking).

Refs: #1094, #1095, #1100, docs/EPICS-1-6-unified-design.md §3.9 row 7,
§8 (EPIC-5 Networking).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 22:44:18 +04:00
e3mrah
68c68eaf7a
feat(bp-network-policies): land default-deny CCNP + system-namespace + DNS allow templates (slice H8, #1095) (#1116)
New platform/network-policies/ Blueprint scaffold per design doc §3.9 row 8.
Ships the cluster-wide zero-trust primitives that EPIC-5 (#1100) activates
as part of the networking roll-out.

What ships:
- platform/network-policies/blueprint.yaml — bp-network-policies 1.0.0
- platform/network-policies/chart/Chart.yaml — Helm chart, no upstream sub-chart
- platform/network-policies/chart/values.yaml — gate (enabled: false default)
- platform/network-policies/chart/templates/default-deny.yaml — CCNP that
  denies all ingress + egress at endpointSelector: {} (full-cluster scope)
- platform/network-policies/chart/templates/allow-system-namespaces.yaml —
  CCNP allowing full traffic for kube-system, flux-system, cilium,
  cert-manager, catalyst, openova-system, monitoring, ingress (set is
  parametric via .Values.allowSystemNamespaces — operator extends per
  Sovereign for gitea/harbor/loki etc.)
- platform/network-policies/chart/templates/allow-egress-dns.yaml — CCNP
  permitting UDP/TCP/53 to CoreDNS from every Pod (without this the cluster
  is unbootable under default-deny — first DNS lookup fails)

Why a separate Blueprint, not bp-cilium:
- bp-cilium is foundational, installed on every cluster on day 0.
  Default-deny breaks every workload that hasn't been allowlisted, so it
  cannot ship in bp-cilium without operator opt-in semantics.
- Separate Blueprint with enabled: false default preserves the safety
  boundary. EPIC-5 wires the activation when the rest of the zero-trust
  story is ready.

Per-namespace intra-namespace allow is intentionally NOT in this slice:
- Cilium CCNPs cannot express "same namespace as the source Pod" without
  listing every namespace, which contradicts dynamic Org provisioning.
- That allow rule is rendered as a per-namespace CiliumNetworkPolicy (CNP,
  namespace-scoped) by organization-controller (slice C1 of #1095) at
  Organization creation time. README + values.yaml note this for
  downstream Implementers.

Per docs/INVIOLABLE-PRINCIPLES.md #4, every policy parameter
(allowSystemNamespaces list, dnsNamespace, dnsServiceName) is in
values.yaml, not hardcoded.

Validated:
- helm template with default values: 0 resources rendered (gate works)
- helm template with enabled=true: exactly 3 CCNPs rendered (default-deny,
  allow-system-namespaces, allow-egress-dns), all parse cleanly through
  python yaml.safe_load_all
- CCNP CRD validation will happen on Sovereigns where bp-cilium is
  installed; local k3s here uses flannel so server-side dry-run is
  unavailable

Refs: #1094, #1095, #1100, docs/EPICS-1-6-unified-design.md §3.9 row 8 +
§8 (EPIC-5), ADR-0001 §2 (zero-trust).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 22:40:30 +04:00
e3mrah
82bf6f6eec
fix(bp-cilium): align declared upstream version with Chart.lock (slice H1, #1095) (#1115)
EPIC-0 audit found provenance drift in bp-cilium:
- Chart.yaml dependencies[0].version declared "1.19.3"
- values.yaml catalystBlueprint.upstream.version declared "1.19.3"
- Chart.lock pinned to 1.16.5 (truth-on-disk — what every Sovereign
  has actually been running)

The declared "1.19.3" was never installed anywhere. Aligning all three
to "1.16.5" so observability/audit pipelines that compare the declared
upstream version with the actually-deployed Cilium version stop reporting
a 3-minor mismatch.

This is a pure metadata fix — no behavioral change. Rolling forward to a
newer Cilium minor (1.17.x or 1.18.x) is a separate slice that needs
real upgrade testing on a live data-plane cluster, including k3s
--flannel-backend=none compatibility and Gateway API CRD compatibility.

Validated:
- helm dependency build re-resolves to 1.16.5 cleanly
- Chart.lock unchanged (Cilium 1.16.5 was already what it had)

Chart version bumped 1.2.0 → 1.2.1 (patch). Blueprint.yaml mirrored.

Refs: #1094, #1095, docs/EPICS-1-6-unified-design.md §3.9 row 1, §11 row 3.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 22:36:15 +04:00
e3mrah
e8bf1aab69
feat(bp-nats-jetstream): land Stream + KV CR templates (slice H4, #1095) (#1114)
Realizes design doc §3.9 row 7. The chart had no templates/ directory —
NACK Stream and KeyValue CRs that ADR-0001 §6 mandates as the Catalyst
event spine were declared in docs but not in code.

What this slice ships:
- platform/nats-jetstream/chart/templates/_helpers.tpl — common labels +
  servers helper (defaults to <release>-nats Service URL, override via
  .Values.catalystStreams.servers).
- platform/nats-jetstream/chart/templates/streams.yaml — three Streams:
    * catalyst.audit  : 90-day retention, R=3, mirrored to DR (#1101)
    * catalyst.events : 24-hour retention (cross-replica fan-out + cold-
      start replay), R=3
    * catalyst.billing: 1-year retention, R=3, consumed by future billing
- platform/nats-jetstream/chart/templates/kv-buckets.yaml — three KVs:
    * idempotency  : 24h TTL, 256 MiB cap (write-path idempotency keys)
    * dr-leases    : 60s TTL (Continuum dns-quorum lease path; CF-KV
      bypasses this bucket)
    * policy-rollup: 7-day retention, 1 GiB cap (compliance scorer #1096)

Reconciliation gate:
- All resources render only when .Values.catalystStreams.enabled is true.
- NACK (nats-io/nack) is NOT a current dependency — installing it as a
  sibling Blueprint and flipping this toggle is a follow-up slice.
- Same default-off pattern the chart already uses for promExporter.podMonitor
  (issue #182) so a fresh Sovereign with no NACK keeps booting cleanly.

Per-tenant streams (org.<id>.events, app.<id>.events) are intentionally
NOT shipped here — they'll be created at runtime by organization-controller
(slice C1) and application-controller (slice C4) so they can scale per
tenant.

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every retention,
TTL, replicas, and maxBytes is a values.yaml variable; per-Sovereign
overlays override.

Validated:
- helm dependency build pulls upstream nats:1.2.0
- helm template with default values: 0 catalyst-* resources rendered
  (catalystStreams.enabled=false, the safe default)
- helm template with catalystStreams.enabled=true: 6 resources rendered
  exactly as expected (3 Streams + 3 KeyValues, all in
  jetstream.nats.io/v1beta2)

Chart version bumped 1.1.2 → 1.2.0 (minor — new templates, no breaking).
Blueprint.yaml version mirrored.

Refs: #1094, #1095, #1096, #1101, docs/EPICS-1-6-unified-design.md §3.9
row 7, ADR-0001 §6.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 22:32:54 +04:00
e3mrah
25ef20a8e5
feat(catalyst-chart): land Blueprint CRD + fix 5 string-form depends (slice B4, #1095) (#1112)
Realizes the Blueprint CRD per docs/BLUEPRINT-AUTHORING.md §3 and design
doc §3.2.4. Promotes the doc-contract (apiVersion catalyst.openova.io)
from a YAML-loaded contract to a schema-validated CRD.

Schema design:
- Two versions served from one inline schema (YAML anchors): v1alpha1
  (legacy, served, not storage) and v1 (canonical, served, storage). The
  shared schema means the 38 existing v1alpha1 files in platform/ +
  products/ continue to validate; migration to v1 is a follow-up slice.
- Required at this layer: spec.version (strict semver pattern),
  spec.card.title (minLength=1).
- Card variants accommodated as documented: summary | description |
  tagline interchangeable; category | family interchangeable; docs |
  documentation interchangeable. All optional except title.
- visibility enum: listed | unlisted | private.
- placementSchema.modes enum: single-region | active-active | active-
  hotstandby — same set Application.spec.placement validates against.
- depends[].blueprint pattern accepts both bp-* and bare-name (legacy).
- manifests accepts both manifests.chart (legacy short-form) AND
  manifests.source.{kind,ref} (canonical). Three source kinds: HelmChart,
  Kustomize, OAM.
- rotation[].ttl pattern '^[0-9]+(s|m|h|d)$'.
- x-kubernetes-preserve-unknown-fields liberally on configSchema (per-
  Blueprint JSON Schema is arbitrary by design), card, manifests, owner,
  observability, outputs, depends[].values, manifests.values, etc.

Existing files validation:
- Surveyed all blueprint.yaml in platform/ + products/ (59 files).
- Card field frequency: title (59), summary (38), description (20+1),
  category (25), family (20), docs (20), documentation (14+1), icon (25),
  tags (14), license (14).
- 54 of 59 files passed the schema unchanged.
- 5 files used `depends: [- bp-name]` (string form) instead of the
  canonical `[- blueprint: bp-name]` object form per BLUEPRINT-AUTHORING
  §3. Those 5 files are fixed in this commit:
    * platform/cert-manager-powerdns-webhook/blueprint.yaml
    * platform/cert-manager-dynadot-webhook/blueprint.yaml
    * platform/crossplane-claims/blueprint.yaml
    * platform/powerdns/blueprint.yaml
    * platform/self-sovereign-cutover/blueprint.yaml
- After fix: ALL 59 files pass server-side validation (kubectl apply
  --dry-run=server) against the new CRD.

Negative validation (tests/blueprint-sample-invalid.yaml):
- spec.version "1.3" → semver pattern
- spec.card missing → required
- spec.card.title missing → required
- spec.visibility "secret" → enum listed|unlisted|private
- spec.placementSchema.modes "round-robin" → enum
- spec.depends[0] bare string "bp-bad-string" → must be object
- spec.depends[1].blueprint "Foo" → pattern fails (uppercase)
- spec.rotation[0].ttl "5 days" → pattern '^[0-9]+(s|m|h|d)$'
All 8 seeded vectors rejected.

This commit ONLY touches new CRD + test files + the 5 depends fixes —
leaves the in-flight router.tsx + rootBeforeLoad.test.ts work from a
parallel agent and the .claude/worktrees/ directory untouched.

Refs: #1094, #1095, docs/EPICS-1-6-unified-design.md §3.2.4,
docs/BLUEPRINT-AUTHORING.md §3

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 22:25:08 +04:00
e3mrah
a6fb97f2ef
fix(cutover step-01): clone+push (regular repo) instead of pull-mirror (#1033)
PR #1029 added a step-06 PATCH to flip mirror=false before push so
the cutover-helmrepository-patches Job could write HelmRepository
URL pivots to local Gitea. On Gitea 1.22.3 the PATCH returns 200
but silently no-ops — `mirror_interval` updates but `mirror: true`
stays. The repo remains read-only and step-06 still hits HTTP 403
"remote: mirror repository is read-only". Reproduced on otech127
2026-05-05 with chart 0.1.22 deployed.

Per ADR (cutover ends upstream tracking — Sovereign goes
self-hosted from this point), the architecturally correct fix is
to never create the mirror in the first place. Step-01 now creates
a regular Gitea repo and bare-clones+pushes upstream content. All
refs (branches+tags) replicate via `git push --mirror --force`,
which is idempotent on re-runs.

Trade-off: post-cutover Sovereigns no longer auto-sync from
upstream — that's the intended cutover semantics anyway. Operator
re-runs this Job manually for chart rollouts (next-session
follow-up: dedicated post-cutover sync mechanism, perhaps a
periodic CronJob the operator can opt into).

Bumps:
- bp-self-sovereign-cutover chart 0.1.22 → 0.1.23
- bootstrap-kit pin 0.1.22 → 0.1.23

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 03:19:05 +04:00
e3mrah
a070808eda
fix(cutover step-06): convert pull-mirror to standalone before pushing patches (#1029)
Step-01 creates openova/openova on the Sovereign's local Gitea as a
pull mirror so it tracks upstream openova-public during early
bootstrap. After cutover, the Sovereign is self-hosted and MUST
diverge from upstream — but Gitea blocks pushes to a mirror with
HTTP 403 "remote: mirror repository is read-only".

Step-06 adds a Phase-1.5 PATCH /api/v1/repos/{owner}/{repo}
{"mirror": false, "mirror_interval": "0"} BEFORE attempting to
clone+push the HelmRepository URL pivot. This converts the
pull-mirror into a standalone writable repo — the way the post-
cutover Sovereign architecture expects it.

Caught on otech125 2026-05-05: cutover-helmrepository-patches Job
returned "FATAL: git push failed" with no upstream stderr (chart
0.1.20 lacks the printf '%s\n' "$push_err" fix from PR #1022, which
was published in 0.1.21 only). Reproduced by cloning openova/openova
from a debug pod and running git push: "remote: mirror repository
is read-only / fatal: ... HTTP 403". Without the demirror step,
EVERY Sovereign provisioned fails handover at this step.

Bumps:
- bp-self-sovereign-cutover chart 0.1.21 → 0.1.22
- bootstrap-kit pin 0.1.20 → 0.1.22

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 02:53:45 +04:00
e3mrah
478743db17
fix(cutover-step-06): actually surface git push stderr (PR #1021 merged with only chart bump) (#1022)
PR #1021 was supposed to ship this code fix but the chart-version bump
landed first and the actual sed didn't apply (sed quoting mishap). The
debug-error fix never reached main. Re-shipping now as a clean Edit-
based commit. Captures git push stderr into push_err and prints it on
FATAL so the next iteration's failed Job logs include git's actual
rejection (auth / branch protection / hook).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 02:12:00 +04:00
e3mrah
69980ed48e
chore(bp-self-sovereign-cutover): bump 0.1.20 → 0.1.21 (#1021)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-06 02:10:45 +04:00
e3mrah
608db53a25
fix(cutover 0.1.20): Step-06 pushes YAML edit to local Gitea so patches survive Flux reconcile (#970) (#971)
## Root cause (live on otech116 2026-05-05 14:38)

After the #968 fix shipped (0.1.19), the cutover engine reached Step-7
(87%) successfully — Step-01..07 all completed. Then Step-08 (egress-
block-test) caught 38/38 HelmRepositories had reverted to upstream:

```
external HelmRepositories still pointing at ghcr.io/openova-io: 38
  OFFENDER flux-system/bp-cilium=oci://ghcr.io/openova-io
  ... (37 more)
FAIL — at least one HelmRepository did not pivot
```

But Step-06's job logs say:
```
[helmrepository-patches] OK bp-cilium -> oci://harbor.otech116.omani.works/openova-io
... (37 more OK)
ok=38 skip=0 fail=0
```

So Step-06 thought it succeeded — and it had, momentarily. But then
the bootstrap-kit Kustomization (which had successfully pivoted to
local Gitea via Step-05) reconciled its YAML from local Gitea, where
the YAML still declared `url: oci://ghcr.io/openova-io`. Within ~30s
every kubectl patch was undone. The cutover engine then aborted at
Step-8 verification.

## Fix

Step-06 now runs in two phases:
1. **Live K8s patches** (existing behaviour) — flips spec.url on every
   HelmRepository immediately. Useful for the cluster between cutover
   and the next reconcile.
2. **NEW — Push YAML edit to local Gitea** — clones `openova/openova`
   from the local Gitea over basic-auth, sed-rewrites every
   `clusters/_template/bootstrap-kit/*.yaml` declaration of `url:
   oci://ghcr.io/openova-io` → `oci://harbor.<sov-fqdn>/openova-io`,
   commits with a clear message, pushes back. Subsequent reconciles
   see local Harbor as the steady-state.

After the push, the script annotates `flux-system/openova` GitRepository
to trigger immediate reconciliation so the new YAML lands without
waiting for the polling interval.

## Image change

Step-06 image bumped from `bitnami/kubectl:1.31.4` to `alpine/k8s:1.31.4`
because the new phase needs both `kubectl` and `git` in one image
(verified live on otech116 — both binaries present).

## Acceptance gate

Test case 16 added to cutover-contract.sh — guards against future
regressions that remove the `git clone`, the `git push origin main`,
or the `clusters/_template/bootstrap-kit` target dir reference.

## Live verification

Will fire on otech117 (next provision). Expected:
- Step-06 logs `cloning gitea-http.gitea.../openova/openova.git` then `pushed to ...`
- Step-08 verify PASSES (38/38 HelmRepositories pivoted in K8s + Gitea)
- self-sovereign-cutover-status `cutoverComplete: "true"`
- Egress block to ghcr.io safely activates

Co-authored-by: e3mrah <ebaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 18:55:22 +04:00
e3mrah
3db19b76b1
fix(cutover 0.1.19): Step-01 gitea-mirror DNS readiness probe + backoffLimit=3 (#968) (#969)
## Root cause (live on otech115 2026-05-05 14:15)

After PR #959 (0.1.18) unblocked the auto-trigger to actually call
/internal/cutover/trigger, the cutover engine fired Step-01 within ~8s
of bp-self-sovereign-cutover Helm-install completing. The gitea Pod
had only just reached Ready state — cluster-DNS endpoint publication
for the headless service `gitea-http` was still in flight. One wget
returned `bad address gitea-http.gitea.svc.cluster.local` and exited
non-zero. Catalyst-api's cutover engine stamped Jobs with backoffLimit=0
(cutover.go:584), so a single DNS miss was terminal and aborted all 8
cutover steps. otech115 finished provisioning with cutoverComplete=false
and tethered to upstream github.com/ghcr.io.

## Fix (dual-layer)

**Layer A — catalyst-api (cutover.go)**: backoffLimit lifted from 0 to 3.
A single transient miss is recoverable (4 attempts over each step's
activeDeadlineSeconds) without burning operator-attention. Hard failures
still surface within budget.

**Layer B — chart Step-01 (01-gitea-mirror-job.yaml)**: explicit
nslookup readiness probe at the top of the bash script, before any
wget call. 30 attempts × 5s = 150s budget; alpine/git ships nslookup
in /usr/bin (verified live on otech115). Layer B is faster than Layer A
(in-script DNS retry vs Pod recreate); Layer A is the safety net for
any other transient pre-cluster-stable race we haven't yet enumerated.

## Acceptance gate

Test case 15 added to platform/self-sovereign-cutover/chart/tests/
cutover-contract.sh — guards against future regressions that drop
either the gitea_host extraction or the nslookup loop.

## Live verification

Will fire on the next provision (otech116). Expected:
- Step-01 logs `[gitea-mirror] DNS ready for gitea-http.gitea.svc.cluster.local (attempt N)`
- All 8 cutover Jobs reach Complete
- self-sovereign-cutover-status ConfigMap reaches cutoverComplete=true

Co-authored-by: e3mrah <ebaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 18:25:15 +04:00
e3mrah
d1431bed09
fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965)
Closes #921 — bp-cluster-autoscaler-hcloud chart shipped without
HCLOUD_CLUSTER_CONFIG / HCLOUD_CLOUD_INIT, so cluster-autoscaler 1.32.x
FATALs at startup with "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT is
not specified" on every Sovereign (otech112 evidence). HelmRelease
reports Ready=True (Helm install succeeded) but the Pod
CrashLoopBackOffs invisibly behind the False-positive condition.

Closes #916 — wizard let operators dispatch unbuildable topologies
(otech109: cpx32 worker in `ash`) because PROVIDER_NODE_SIZES did not
encode regional orderability. Hetzner rejected the worker creation 41s
into `tofu apply` after Phase-0 had already created the CP + network +
LB + firewall.

Chart fix (issue #921):
- Add `clusterAutoscalerHcloud.{clusterConfig,cloudInit}` values to the
  umbrella chart (base64-encoded per upstream contract).
- Render `hetzner-node-config` Secret unconditionally with both keys so
  the upstream Deployment's secretKeyRef references resolve cleanly
  during `helm template` AND in the live cluster regardless of overlay
  state.
- Wire HCLOUD_CLUSTER_CONFIG + HCLOUD_CLOUD_INIT extraEnvSecrets onto
  the upstream chart's deployment.
- Tofu Phase 0 base64-encodes the Phase-0 worker cloud-init and stamps
  it under `flux-system/cloud-credentials.hcloud-cloud-init`; the
  bootstrap-kit overlay lifts that key via Flux `valuesFrom` into
  `clusterAutoscalerHcloud.cloudInit`. Autoscaler-spawned workers thus
  receive the IDENTICAL bootstrap as the Phase-0 worker fleet.
- Bump bp-cluster-autoscaler-hcloud chart 1.0.0 → 1.1.0.
- Chart-test smoke gate (chart/tests/hetzner-node-config.sh) verifies
  Secret + env var wiring + no-regression of HCLOUD_TOKEN — runs in CI's
  blueprint-release "Run chart integration tests" step.

Wizard fix (issue #916):
- Add `availableRegions?: string[]` to NodeSize interface; encode
  cpx32 = ['fsn1','nbg1','hel1'], cpx21/cpx31 = [] (orderable nowhere
  new) per Hetzner /v1/server_types vs POST /v1/servers gap.
- Add `isSkuAvailableInRegion()` + `suggestAlternativeSkus()` helpers.
- StepProvider filters SKU dropdowns by selected region; auto-swaps
  current SKU to recommended default when region change drops it out
  of orderability.
- Mirror the matrix Go-side in sku_availability.go; gate
  `provisioner.Request.Validate()` with same predicate so a stale
  wizard build OR direct API caller bypassing the UI cannot dispatch
  otech109's failure mode.
- Two-sided enforcement covers both r.Regions[] (multi-region) and the
  legacy singular path.

Tests: 13 vitest cases on the wizard side + 38 Go subtests on the API
side. Chart smoke renders + helm template gates the env wiring at
publish time.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:21:59 +04:00
e3mrah
238c6d2010
fix(bp-flux): mitigate helm-controller leader-election loss + stuck-HR recovery (#925) (#960)
* fix(bp-flux): mitigate helm-controller leader-election loss + recovery CronJob (#925)

On otech113.omani.works the bp-vpa HelmRelease became stuck Ready=Unknown
forever after a transient kube-apiserver blip caused helm-controller to
lose its leader-election lease mid-install. The Helm release secret was
already committed (Status=deployed) by the previous leader, but its last
write to the HR's Ready condition was Unknown and the new leader's
"release in storage?" short-circuit never re-evaluates that. The HR
blocked bootstrap-kit → sovereign-tls → cilium-gateway, breaking every
HTTPRoute on the Sovereign.

Fix is two-pronged:

1) PRIMARY (prevent the trigger). Stretch leader-election lease durations
   on the three Catalyst-critical controllers (helm/kustomize/source) from
   the upstream defaults of lease=35s renew=30s retry=5s to lease=60s
   renew=40s retry=5s, and bump memory limits from 256Mi to 512Mi (helm)
   / 384Mi (kustomize, source) so OOMKills during 35-HR fan-out installs
   don't themselves trigger leadership handoffs. Costs ~50s extra failover
   time on a real controller crash; that's acceptable since CP HA is a
   Phase 2 concern and we'd much rather avoid spurious flips during
   transient API pressure.

2) RECOVERY (handle the residual case). New CronJob bp-flux-stuck-hr-recovery
   runs every 2 minutes, scans every HelmRelease cluster-wide, and for each
   HR stuck in Ready=Unknown for >5 minutes whose underlying Helm release
   secret already has status=deployed, force-toggles spec.suspend (the only
   known workaround per #925). Guardrail: refuses to act if more than 10
   HRs would be touched in a single run (signals a cluster-wide outage).
   Operator-disablable via .Values.catalyst.stuckHelmReleaseRecovery.enabled=false.

Lock-in tests: tests/leader-election-and-recovery.sh covers all three
flag/memory bumps, CronJob render, RBAC presence, disable-toggle, and
threshold operator override. version-pin-replay + observability-toggle
still green.

Chart bumped 1.1.4 → 1.2.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-flux): bump blueprint.yaml spec.version to 1.2.0 to match Chart.yaml (#925)

The bootstrap-kit static validation gate (Chart.yaml version ==
blueprint.yaml spec.version) caught the missed bump on PR #960.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:05:38 +04:00
e3mrah
b7f150db38
fix(cutover 0.1.18): poll /healthz for readiness instead of auth-gated /status (#957) (#959)
The 0.1.17 auto-trigger Job was Complete=True on otech113 but the
cutover never actually started: the readiness probe loop polled
/api/v1/sovereign/cutover/status (auth-gated, behind RequireSession)
and treated 401 as "API not ready". The loop ran 30 times for 300s
and exited 0 — the trigger endpoint was NEVER called.

Live evidence on otech113 2026-05-05:
  - 30 consecutive 401s from auto-trigger Pod (10.42.4.216) on
    /sovereign/cutover/status in catalyst-api access log
  - zero hits on /api/v1/internal/cutover/trigger
  - Helm post-upgrade hook deadline tripped → rollback to 0.1.15

Fix (chart-side only; PR #947 catalyst-api endpoint is correct as-is):
  - poll /healthz (unauthenticated, always 200 when process is up)
  - drop the pre-flight cutoverComplete=true short-circuit since
    /internal/cutover/trigger is already idempotent (returns 200 with
    the existing snapshot when cutoverComplete=true, per
    cutover_internal.go line 279)
  - bump chart 0.1.17 → 0.1.18; pin slot 06a to 0.1.18

Tests:
  - contract gate Case 13: probe target is /healthz, NOT
    /sovereign/cutover/status (regression guard)
  - contract gate Case 14: no stale cutoverComplete pre-read off
    /tmp/status.json (the file no longer exists)
  - existing 12 contract gates still pass; helm lint clean
  - existing 6 Go unit tests for HandleCutoverInternalTrigger pass

Closes #957

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:02:12 +04:00
e3mrah
2ff50f0591
fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955)
Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on
fresh Sovereign):

#952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls
PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar}
anonymously and gets 403 Forbidden. Fix:

- Templatize spec.imagePullSecrets on Deployment + channel-seed Job.
- Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`.
- Add `newapi` to flux-system/ghcr-pull's reflector
  reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl
  so bp-reflector mirrors the source Secret into the namespace
  automatically on every fresh Sovereign.
- Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay.

#953 — services-build.yaml's image-rewrite loop only matched the
hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8
sme-services templates use `image: "{{ ... }}/services-<svc>:{{
.Values.images.smeTag }}"`. Each services-build run bumped only
auth.yaml while reporting "update sme service images to ${SHA}",
leaving the live Pod on stale bytes (PR #951's #941 fix never reached
services-catalog despite the merge + chart bump chain). Fix:

- After the hardcoded loop, also bump `images.smeTag` in
  products/catalyst/chart/values.yaml with a strict regex match
  (`^  smeTag: "<sha>"$`); refuse to auto-bump if the line shape
  changes (defends against silent drift if a contributor renames the
  field).
- Mirror the change into the retry-path `rewrite()` function so a
  reset-to-origin/main retry does not recreate the original bug.

Tests:

- platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases
  asserting the Deployment and channel-seed Job carry the default
  ghcr-pull reference, that an empty override suppresses the block,
  and that custom secret names propagate (Inviolable Principle #4).
- tests/integration/services-build-rewrite.sh — 3 cases reproducing
  the workflow's rewrite logic on a sandboxed copy of the live
  chart, asserting both auth.yaml's hardcoded line AND values.yaml's
  smeTag get bumped, that helm-render of the catalyst chart with
  the bumped values produces all 8 SME-service Deployments at the
  new SHA, and that an idempotent re-bump to a second SHA also lands
  cleanly.

Refs: #952 #953 (umbrella #915 — alice signup gate 5).

Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:47:37 +04:00
e3mrah
689276889c
fix(bp-catalyst-platform+bp-newapi): unblock alice signup gates 2-6 on Sovereigns (#915) (#951)
Six coupled chart + orchestrator fixes that unblock alice marketplace
signup → tenant ready → SaaS integrations → LLM → ledger on a freshly
franchised Sovereign. C5-final got Gate 1 GREEN on otech113 (2026-05-05)
but every downstream gate failed because the SME bundle hardcoded
contabo-only assumptions.

Bumps:
  - bp-catalyst-platform 1.4.21 → 1.4.22
  - bp-newapi             1.3.0 → 1.4.0
  - bootstrap-kit slot 13 + 80 pins updated in lockstep

Issues addressed (single consolidated PR — smaller PRs would race
against alice signup retries):

  - #934 (auth SMTP empty → "failed to send email"): sme-secrets.yaml
    now reads SMTP_* from `catalyst-system/sovereign-smtp-credentials`
    (the same A5-seeded source #883/#905 the chart 1.4.20 catalyst-
    openova-kc-credentials Secret already uses) with source-wins
    precedence. Both canonical (smtp-host/port/from/user/pass) AND
    legacy (host/port/from/user/password) source-Secret key shapes
    accepted. Empty source falls back to chart-level defaults so the
    contabo path stays clean.

  - #940 (provisioning service GITHUB_TOKEN placeholder + hardcoded
    upstream github.com): chart values
    .Values.smeServices.provisioning.{githubToken,git.{apiURL,owner,
    repo,branch}} make every GitHub-API coordinate operator-overridable
    with topology-aware defaults (Sovereign ⇒ in-cluster Gitea REST
    API + `openova` org; contabo ⇒ api.github.com + `openova-io` org).
    Provisioning binary's startup gate validates the GITHUB_TOKEN does
    NOT contain placeholder substrings (<placeholder>, PLACEHOLDER,
    REPLACE_ME, ...) and crashes the Pod into Pending if it does — the
    operator sees the misconfig immediately instead of after alice
    signups have failed silently in service logs. GitHub client now
    accepts a custom API URL via NewClientWithAPIURL so Gitea's GitHub-
    compatible /api/v1 surface drops in without re-implementing the
    client.

  - #941 (catalog "27 apps COMING SOON"): added `openclaw` and
    `stalwart-mail` to migrateAppDeployable's deployable map at
    core/services/catalog/handlers/seed.go. Both blueprints (bp-openclaw,
    bp-stalwart-{sovereign,tenant}) ship with visibility=listed in the
    embedded blueprints.json AND have working SME-tenant overlay
    templates in sme_tenant_gitops.go, but the catalog handler silently
    filtered them out because they were missing here. Map extracted to
    DeployableAppSlugs() exported function so unit tests can assert
    membership without invoking a Mongo store.

  - #942 (REDPANDA_BROKERS hardcoded to talentmesh): configmap.yaml
    selects broker default at render time based on global.sovereignFQDN
    — Sovereign ⇒ NATS JetStream Service per ADR-0001 (the only local
    bus on Sovereigns); contabo ⇒ legacy Redpanda Service in talentmesh.
    Operator MAY override either default via
    .Values.smeServices.eventBus.brokers without forking the chart.
    The ConfigMap key name stays REDPANDA_BROKERS for back-compat with
    existing SME service Go env wiring; new EVENT_BUS_PROTOCOL key
    surfaces the protocol hint for services that want to switch wire
    format independently.

  - #943 (bp-newapi silently skips Deployment): NEW
    templates/cnpg-cluster.yaml auto-provisions a CNPG-backed Postgres
    Cluster + Helm-`lookup`-persistent DSN Secret when
    .Values.cnpg.enabled (DEFAULT true). NEW templates/credentials-
    secret.yaml auto-generates SESSION_SECRET + CRYPTO_SECRET (each
    64-char randAlphaNum, persistent across reconciles via Helm
    `lookup`) when .Values.credentials.autoProvision (DEFAULT true).
    deployment.yaml gate now resolves Secret names from the chart-
    emitted defaults when the operator hasn't supplied an override.
    Capabilities-gated on postgresql.cnpg.io/v1 so a cold install
    before bp-cnpg is Ready surfaces as "no Cluster yet" rather than
    a hard install error.

  - #944 (CRITICAL — cross-cluster pollution): provisioning.yaml
    templates GIT_BASE_PATH from
    .Values.smeServices.provisioning.gitBasePath with a topology-aware
    default `clusters/<sovereignFQDN>/sme-tenants` on Sovereigns. NEW
    `core/services/provisioning/gitguard` package validates at startup
    AND on every commit code path that the path begins with
    `clusters/<self-FQDN>/` — refusing to commit to any other cluster's
    tree. Defence in depth so a runtime env mutation (kubectl exec,
    ConfigMap update without Pod restart, hostile sidecar) cannot
    bypass the check. Pre-#944 every alice tenant overlay landed in
    upstream openova/openova `clusters/contabo-mkt/tenants/<id>/`
    which contabo Flux would then install on the contabo cluster —
    C5-final caught + reverted the alice2 incident at commit 5715db04.

Tests:
  - core/services/provisioning/gitguard: 22 cases covering Sovereign
    + contabo + traversal + prefix-collision + placeholder token
  - core/services/catalog/handlers: openclaw/stalwart-mail in
    deployable map + stable-shape lock against accidental deletes
  - helm-template smoke pass: bp-newapi (default values renders
    Deployment + auto-provisioned Secrets); bp-catalyst-platform
    (Sovereign render shows GIT_BASE_PATH=clusters/otech113.../sme-
    tenants, REDPANDA_BROKERS=nats-jetstream..., GITHUB_OWNER=openova,
    GITHUB_API_URL=http://gitea-http...)

Closes #934 #940 #941 #942 #943 #944
Refs umbrella #915

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:27:23 +04:00
e3mrah
890fa67eff
fix(bp-harbor): inline labels on admin Secret to drop duplicate keys (#949) (#950)
PR #947 (bp-harbor 1.2.14) added templates/admin-secret.yaml that
included the canonical bp-harbor.labels helper AND re-declared
app.kubernetes.io/name + catalyst.openova.io/component with admin-
credential-specific values. Helm's strict YAML post-render parser
rejected the rendered manifest with `mapping key
"app.kubernetes.io/name" already defined at line 8`, blocking the
upgrade chain on otech113 — bp-self-sovereign-cutover dependsOn
bp-harbor and re-blocked, stalling cutover indefinitely.

Per the issue's recommended Option A, labels are inlined verbatim
on the admin Secret. Every key the helper would emit is reproduced
explicitly, except the two that need a Secret-specific value
(catalyst.openova.io/component=harbor-admin) plus an explicit
admin-credentials sub-component label.

A regression guard (Case 6) is added to tests/admin-secret.sh: the
rendered Secret block is parsed through PyYAML's safe_load_all,
which enforces mapping-key uniqueness the same way Helm's post-
render does. Duplicate keys raise and break the test.

Bumps:
  - platform/harbor/chart/Chart.yaml    1.2.14 → 1.2.15
  - clusters/_template/bootstrap-kit/19-harbor.yaml  slot pin

Verification (all green locally):
  helm template smoke . --namespace harbor   # renders OK
  bash tests/admin-secret.sh                 # 6 gates green
  helm lint .                                # 0 failed

Closes one half of #949 (bp-harbor side); the slot pin update
delivers it to fresh Sovereigns; existing otech113 picks up the
upgrade on next Flux reconcile after the new chart publishes.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
2026-05-05 15:19:17 +04:00
e3mrah
88a8ecd8bb
fix(cutover): Reflector-mirror harbor-admin Secret + in-cluster trigger endpoint (#935) (#947)
Two bugs surfaced live on otech113 2026-05-05 blocking Self-Sovereignty
Cutover end-to-end. Fix both in lockstep:

Bug 1 — bp-self-sovereign-cutover Step 02 (harbor-projects) Job in
`catalyst` namespace was hitting `secret "harbor-core" not found` for
11+ retries because the upstream Harbor `harbor-core` Secret only
exists in the `harbor` namespace and Kubernetes forbids cross-namespace
secretKeyRef. Step 02 was stuck in CreateContainerConfigError forever.

  Fix: bp-harbor 1.2.13 → 1.2.14 ships a Catalyst-curated `harbor-admin`
  Secret in the `harbor` namespace with Reflector mirror annotations
  (allowed-namespaces=catalyst, auto-enabled). The same Secret name
  auto-materialises in `catalyst` so the cutover Job's secretKeyRef
  resolves natively. Password is randomly generated on first install
  (32-char alphanum, 190 bits entropy per feedback_passwords.md) and
  preserved across reconciles via `lookup`. The upstream Harbor subchart
  consumes it via `existingSecretAdminPassword: harbor-admin`.
  bp-self-sovereign-cutover 0.1.16 → 0.1.17 updates
  `harbor.adminSecretRef.name` from `harbor-core` to `harbor-admin`.

Bug 2 — The 0.1.16 auto-trigger Helm post-install Job (#933) POSTed
/api/v1/sovereign/cutover/start which sits behind RequireSession
middleware. The Job has no human session cookie — every request 401'd
forever and cutover never started.

  Fix: new catalyst-api endpoint POST /api/v1/internal/cutover/trigger
  lives OUTSIDE RequireSession and validates the bearer token via the
  apiserver's TokenReview API + checks the resolved username matches
  the canonical `bp-self-sovereign-cutover-runner` SA. Same engine,
  same idempotency, same state machine — different auth surface.
  The auto-trigger Job now mounts its projected SA token at
  /var/run/secrets/kubernetes.io/serviceaccount/token and sends it
  as `Authorization: Bearer <token>`. SA username + accepted list are
  runtime-overridable per Inviolable Principle #4.

Tests
  - 6 Go unit tests for HandleCutoverInternalTrigger covering happy
    path, missing bearer (401), TokenReview rejection (502), wrong SA
    (403), idempotency (no Jobs created when complete), wrong method
    (405). All pass.
  - bp-harbor admin-secret contract test (5 cases) — Secret renders,
    HARBOR_ADMIN_PASSWORD key present, Reflector annotations, keep
    policy, upstream consumes via existingSecretAdminPassword.
  - bp-self-sovereign-cutover cutover-contract test extended with 3
    new cases — auto-trigger uses /internal/cutover/trigger, sends
    SA bearer token, references harbor-admin (not harbor-core).
  - All 12 cutover-contract gates green; all 4 observability-toggle
    gates green; helm template + helm lint clean on both charts.

Bootstrap-kit slot pins
  - clusters/_template/bootstrap-kit/19-harbor.yaml: 1.2.13 → 1.2.14
  - clusters/_template/bootstrap-kit/06a-bp-self-sovereign-cutover.yaml:
    0.1.16 → 0.1.17

Closes #935

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:12:50 +04:00
e3mrah
e9a72aa00d
feat(self-sovereign-cutover): auto-trigger on install + always-defined State (#933 E1) (#936)
Closes the otech113 dashboard regression where SovereigntyCard rendered
`invalid CutoverState: <undefined>` instead of a Tethered badge, and
makes the Day-2 cutover fire automatically once the chart lands rather
than waiting for an operator click on "Achieve True Sovereignty".

Founder rule per #933: handover is not "done" until cutover has run;
the operator must NOT have to click a CTA on
console.<sov-fqdn>/console/dashboard.

Three coupled changes:

1. catalyst-api: cutoverStatusResponse now ALWAYS emits a `state` field
   ("tethered" or "sovereign"), derived from cutoverComplete. The UI's
   branded parseCutoverState rejects empty/undefined, which is what
   was rendering the user-visible error text. Tests cover the empty
   ConfigMap, missing cutoverComplete, and explicit-true cases.

2. UI parseCutoverStatus: defensive fallback when wire frame omits
   `state` — derive from cutoverComplete (default "tethered"). Hostile/
   typo'd state values (e.g. 'pending', '') still throw via the branded
   parser. Defends against partial-rollout where a stale catalyst-api
   Pod is still serving the old shape.

3. bp-self-sovereign-cutover 0.1.16 (chart): new Helm post-install/
   post-upgrade hook (templates/10-auto-trigger-job.yaml) POSTs
   /api/v1/sovereign/cutover/start on catalyst-api after the step
   ConfigMaps + RBAC land. Idempotent via catalyst-api's durable
   status ConfigMap (200 if already complete, 409 if running, 200
   to start). Fails open: a transient catalyst-api unreachability
   exits 0 so the chart install doesn't block; operator can always
   re-fire via the manual CTA. Gated on .Values.trigger.auto (default
   true; per-Sovereign overlays can disable for soak Sovereigns).

Hard rules honoured:
- No contabo Pods touched.
- Existing tethered Sovereigns that have not cutover stay tethered —
  the auto-trigger Job is in the chart (per-Sovereign), not in the
  mothership; only fresh Sovereign installs of bp-self-sovereign-cutover
  0.1.16+ get it.
- IaC-first: the auto-trigger uses catalyst-api's existing /start
  endpoint (no bespoke cluster mutation outside the chart).
- Event-driven: post-install hook fires on chart install (no cron).

Verification:
- Go: cutover_test.go +TestBuildCutoverStatusResponse_StateAlwaysDefined
  +TestHandleCutoverStatus_StateFieldEmittedOnFreshSovereign — both
  green.
- TS: cutover.test.ts +5 cases for parseCutoverStatus state-fallback;
  35/35 green. Sovereignty widget tests 20/20 green.
- Chart: tests/cutover-contract.sh +Case 8/9 (auto-trigger present by
  default, absent under trigger.auto=false); helm template renders
  cleanly.

Co-authored-by: Hatice Yildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:40:52 +04:00
e3mrah
9077016466
feat(bp-stalwart-sovereign): per-Sovereign Stalwart for Console mail (#924) (#931)
Phase-2 follow-up to #883: replace mothership Stalwart relay
(mail.openova.io:587) with a Sovereign-local Stalwart so Console
PIN/magic-link mail originates from `noreply@<sovereignFQDN>` with
per-Sovereign SPF/DKIM/DMARC posture, eliminating the mothership
SMTP SPOF for Sovereign Console login.

What ships:

  1. NEW blueprint platform/stalwart-sovereign/ (otech-level — distinct
     from per-tenant bp-stalwart-tenant). Single Stalwart instance per
     Sovereign cluster, scoped to Sovereign Console system mail. NO
     Keycloak OIDC, NO webmail UI — Sovereign Console is the only
     consumer. Auto-provisioned admin + submission Secrets via the
     lookup-or-generate pattern (#898/#830/#887). Post-install Job:
       - registers the noreply submission principal in Stalwart
       - allows send-as for noreply@<sovereignFQDN>
       - reads DKIM public key, patches dns-records ConfigMap
       - materialises catalyst-system/sovereign-smtp-credentials with
         Sovereign-local infrastructure addresses + credentials,
         carrying BOTH key shapes (smtp-user/smtp-pass + legacy
         user/password) so the consumer chart works either way.

  2. NEW bootstrap-kit slot 95 (clusters/_template/bootstrap-kit/
     95-bp-stalwart-sovereign.yaml). dependsOn: bp-cert-manager,
     bp-catalyst-platform. Sequenced after bp-catalyst-platform (slot
     13) so the chart's post-install Job lands its mirror Secret in
     an already-existing catalyst-system namespace.

  3. bp-catalyst-platform 1.4.19 → 1.4.20: SOURCE-wins precedence
     extended to (a) non-secret fields smtp-host/smtp-port/smtp-from
     so Sovereign-local infra addresses (`mail.<sovereignFQDN>`) take
     over from mothership defaults (`mail.openova.io`) on the next
     reconcile after slot 95 lands, and (b) canonical key shape
     `smtp-user`/`smtp-pass` in addition to legacy `user`/`password`
     source key shape.

  4. expected-bootstrap-deps.yaml: declare slot 95 graph edge.

  5. catalyst-api handler/sovereign_smtp_seed.go: documentation-only
     update to note this Phase-1 step is now a graceful fallback —
     the Phase-2 chart's post-install Job overwrites the mirror
     Secret on first reconcile so the cutover from mothership relay
     to Sovereign-local relay is automatic, no operator action.

Verification:
  - `helm template smoke ./platform/stalwart-sovereign/chart` clean
    (smoke-render-safe; per-template gates skip when sovereignFQDN unset).
  - `helm template smoke -f operator-values.yaml` emits StatefulSet,
    LoadBalancer Service, ClusterIP HTTP Service, DKIM-signing config,
    dns-records ConfigMap, Setup Job + RBAC.
  - `chart/tests/sovereign-render.sh` 3 cases all PASS.
  - `helm template smoke ./products/catalyst/chart` (1.4.20) clean.
  - `helm lint` both charts: clean (only icon-recommended INFO).
  - `bash scripts/check-bootstrap-deps.sh` PASSED — bootstrap-kit
    dependency graph audit, 0 drift, 0 cycles.
  - `go test -run TestSeedSovereignSMTP` — Phase-1 seed tests pass.
  - `go test -run TestBootstrapKit_TemplateClusterParses` — slot 95
    YAML parses cleanly.

Out of scope (sub-PR follow-up under #924):
  - DKIM keypair generation in catalyst-api orchestrator + DNS records
    (MX/A/SPF/DMARC/DKIM-pubkey) registration via PDM dynadot adapter
    at omani.works.
  - Hetzner PTR (rDNS) auto-registration via the Hetzner cloud API.
  - Cert-manager Certificate adding mail.<sovereignFQDN> SAN to the
    Sovereign wildcard cert (chart relies on the existing wildcard
    cert from bp-catalyst-platform 1.4.0+'s per-zone Certificate
    template — when that wildcard chain covers the Sovereign FQDN,
    `mail.<sovereignFQDN>` is already covered).

Acceptance (lands when sub-PR follow-up ships):
  - Sovereign Console PIN delivery uses noreply@<sov-fqdn>.
  - External mail server (e.g. Gmail) accepts mail with valid SPF + DKIM.
  - Mothership SMTP no longer SPOF for Sovereign Console login.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:20:16 +04:00
e3mrah
3fe27f625f
feat(bp-wordpress-tenant): wp-cli OIDC bootstrap + oidc.* canonical block (0.2.0, #915) (#927)
Umbrella issue #915 (D1 sub-task). Aligns the chart's post-install OIDC
config Job with the canonical wp-cli flow and the bp-keycloak tenant-
realm contract C1's PR #918 ships.

Chart 0.2.0
-----------
- templates/oidc-config-job.yaml rewritten to use the official
  wordpress:cli-2.12.0-php8.3 image (manifest-list digest pinned per
  Inviolable Principle #4). Replaces direct PHP/SQL UPSERTs against
  wp_options with:
    * wp core install (idempotent: wp core is-installed)
    * wp plugin install openid-connect-generic --activate (idempotent:
      wp plugin is-installed)
    * wp option update openid_connect_generic_settings <json>
    * wp option update default_role
    * wp theme install/activate
    * wp option update siteurl/home
  Going through wp-cli (i.e. WordPress core's own PHP API) is more
  resilient than schema-shape-dependent INSERT statements and survives
  WordPress minor upgrades.

- values.yaml: new canonical oidc.* block —
    oidc.{enabled, issuerURL, clientId, clientSecretName, defaultRole,
          identityKey, roleMapping, cliImage}.
  Default oidc.clientSecretName = "wordpress-oidc-client-secret"
  matches the K8s Secret bp-keycloak's PR #918 emits alongside the
  realm import ConfigMap (so the realm JSON's `secret` field and the
  Secret bytes never drift).

- Legacy keycloak.{realmURL, clientID, clientSecretName} kept as a
  back-compat alias. _helpers.tpl folds it into oidc.* when the
  modern keys are at their values.yaml defaults so chart 0.1.x
  clusters keep reconciling. Removed in chart 0.3.0.

- oidc.defaultRole=subscriber — newly auto-created SSO users land
  with subscriber capability (operator overrides via overlay).

- Redirect URIs: the openid-connect-generic plugin's default callback
  is /wp-admin/admin-ajax.php?action=openid-connect-authorize when
  alternate_redirect_uri=0 (we set 0). bp-keycloak (PR #918)
  registers the same URL plus /wp-login.php and a /* wildcard, so the
  client's allowed-redirect-URI list aligns with what the plugin
  actually issues.

Orchestrator emit
-----------------
- products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go
  smeTenantBPWordPress now emits the canonical oidc.* block AND the
  legacy keycloak.* alias (for chart 0.1.x clusters mid-upgrade).

Tests
-----
- chart/tests/oidc-config.sh — 7 helm-template assertions:
    1. Canonical oidc.* render produces a Job with the required
       wp-cli command flow + wordpress:cli-2.12.0-php8.3 image.
    2. Legacy keycloak.* fold path (chart 0.1.x compat).
    3. oidc.enabled=false short-circuits the Job.
    4. alternate_redirect_uri=0 (so plugin URL matches the realm-
       registered redirect URI from PR #918).
    5. defaultRole rendered + propagated.
    6. Render YAML is parseable and contains all required kinds.
    7. wp-content PVC mounted in the Job (so pg4wp's db.php drop-in
       loads — failure here would silently fall back to mysqli).

- internal/handler/sme_tenant_test.go:
    * TestRenderSMETenantOverlay_WordPressEmitsOIDC — pins the
      canonical oidc.* block + legacy keycloak.* alias the
      orchestrator emits for the alice@omantel test fixture.
    * TestRenderSMETenantOverlay_WordPressOIDC_BYOMode — BYO domain
      mode renders wordpress.<byo-domain> as the ingress host.

Verification
------------
- helm lint clean
- helm template smoke green for: oidc.* canonical, keycloak.* legacy
  fold, oidc.enabled=false short-circuit
- chart/tests/oidc-config.sh: 7/7 PASS
- chart/tests/observability-toggle.sh: 2/2 PASS (regression)
- go test ./internal/handler/ -run "SMETenant|TestRenderSME": all
  green (TestAuthHandover_HappyPath failure is pre-existing on main,
  unrelated to this change)

Closes (D1 sub-task) of #915.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:10:41 +04:00
e3mrah
a1ca1872aa
feat(bp-stalwart-tenant): wire Keycloak OIDC SSO end-to-end (#915) (#920)
Closes the C2 sub-task of EPIC #915 — alice's Stalwart authenticates
SMTP/IMAP/JMAP/webmail logins against her per-tenant Keycloak realm,
not a shared otech-level IdP.

Three layered changes (matching the three things broken on otech103):

1. Orchestrator (`smeTenantBPStalwart` in sme_tenant_gitops.go)
   now emits per-tenant OIDC values matching the bp-wordpress-tenant
   + bp-openclaw shape:
     keycloak.realmURL = https://keycloak.<sub>.<parent>/realms/sme-<sub>
     keycloak.clientID = stalwart
     keycloak.clientSecretName = stalwart-oidc-client-secret
     keycloak.oidcExternalSecret.remoteRef.key
       = sovereign/<otech-fqdn>/stalwart/<tenant>/oidc
   plus admin externalSecret + dependsOn bp-keycloak so the SME's
   three apps (wordpress, openclaw, stalwart) SSO against ONE realm
   with distinct client IDs (#915 C1 registers all three in the realm
   bootstrap).

2. Chart bootstrap config.toml drops the pre-0.16 kebab-case
   `[directory.keycloak] type = "oidc"` block (silently ignored by
   the upstream registry parser — verified against
   crates/registry/src/schema/structs.rs in stalwartlabs/stalwart;
   OidcDirectory serdes camelCase: `@type = "Oidc"`, `issuerUrl`,
   `claimUsername`, `claimName`, `claimGroups`, `requireScopes`).
   The `internal` directory stays as the bootstrap fallback so the
   admin can log in before the post-install Job seeds OIDC.

3. setupJob defaults to enabled (was off in 0.1.1) and POSTs the
   canonical OIDC directory entry to `/api/settings`:
     directory.keycloak.@type            = "Oidc"
     directory.keycloak.issuerUrl        = <realm URL>
     directory.keycloak.claimUsername    = preferred_username
     directory.keycloak.claimName        = name
     directory.keycloak.claimGroups      = groups
     directory.keycloak.requireScopes    = [openid email profile groups]
     directory.keycloak.usernameDomain   = <tenant domain>
     storage.directory                   = keycloak
   The setting POSTs are idempotent (`assert_empty: false`) so Helm
   upgrades re-run without breaking existing logins. Re-uses the
   upstream Stalwart container (ships curl + stalwart-cli) — no new
   image needed.

Tests:
  - `chart/tests/oidc-render.sh` (NEW): asserts every settings key
    is rendered, the [oauth] env block propagates the per-tenant
    realm URL, and the bootstrap config.toml parses as valid TOML.
  - `chart/tests/expression-syntax.sh`: re-passes (Stalwart
    expression `==` audit per stalwart_expression_syntax.md).
  - `TestRenderSMETenantOverlay_StalwartEmitsKeycloakOIDC` (NEW):
    Go test verifies the orchestrator emits the per-tenant realm
    URL, client metadata, and ExternalSecret-store remoteRef paths.
  - All existing TestRenderSMETenantOverlay_* tests pass.
  - `helm template` clean with default values AND with a per-tenant
    overlay (--api-versions external-secrets.io/v1beta1).

Chart bumps 0.1.1 → 0.1.2; blueprint.yaml spec.version mirrors per
issue #817 (chart/blueprint version invariant).

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:37:46 +04:00
e3mrah
9447d88dfd
feat(bp-newapi): auto-seed channel #1 = Qwen3.6 @ BankDhofar (#915) (#919)
Per epic #915 (SME tenant integration DoD: alice → OpenClaw → NewAPI →
Qwen3.6@BankDhofar end-to-end), bp-newapi must come up with channel
#1 = Qwen3.6 hosted at BankDhofar
(https://llm-api.omtd.bankdhofar.com, model qwen3-coder / alias
qwen3.6) already wired to its admin API, so the FIRST customer
request from an SME's OpenClaw → NewAPI hits a real upstream LLM
rather than a 404 / "no channel found" error.

Until now the chart's channels.yaml ConfigMap was a documentation
surface only; the upstream NewAPI binary persists channel state to
its Postgres `channels` table via its admin API at /api/channel/.
This patch bridges that gap.

Discovery:
  - Canonical BankDhofar relay reference exists in
    openova-private/clusters/contabo-mkt/apps/axon/helmrelease.yaml
    (axon.vllm.baseUrl=https://llm-api.omtd.bankdhofar.com,
    defaultModel=qwen3-coder, secret=axon-vllm-secret).
  - K8s secret confirmed live (axon/axon-vllm-secret, key
    AXON_VLLM_API_KEY).
  - Architecture: bp-newapi is per-Sovereign (one NewAPI per OTECH);
    SME tenants share it via OpenClaw's newapi.baseURL =
    https://newapi.<OTECHFQDN>. Channel seeding therefore happens
    at the Sovereign-level chart install, NOT per-tenant.

Changes:
  1. platform/newapi/chart/values.yaml
     - New `defaultChannels.qwenBankDhofar` block (enabled=false by
       default; per-Sovereign overlay flips it true with the
       canonical endpoint + commercial-contract attestation).
     - New `channelSeed` block configuring the post-install Helm
       hook Job (image, resources, backoff, deadline, hook delete
       policy).

  2. platform/newapi/chart/templates/_helpers.tpl
     - effectiveChannels helper composes qwenBankDhofar BEFORE
       operator-supplied .Values.channels and BEFORE defaultChannels.vllm
       so it lands as channel #1 in NewAPI's row-insertion order
       (NewAPI's router resolves `model` lookups in row order).
     - New channelSeedJobName helper (shared by Job + RBAC + ConfigMap).

  3. platform/newapi/chart/templates/channel-seed-job.yaml (NEW)
     - post-install/post-upgrade Helm hook Job that:
       * Mounts the operator-supplied master-key Secret
         (auth.adminUI.masterKeySecret) for one-time admin API auth.
       * Mounts the per-channel upstream API key Secret
         (defaultChannels.qwenBankDhofar.existingSecret).
       * Polls /api/status until 200 (handles NewAPI startup window).
       * For each default channel: GET /api/channel/?keyword=<name>;
         if a row whose `name` exactly matches exists, SKIP. Otherwise
         POST /api/channel/ with the channel definition. Idempotent —
         re-runs after upgrades are no-ops once channels exist.
       * Bounded RBAC (Role+RoleBinding only on the named Secrets).
       * Skip-render gates: channelSeed.enabled, defaultChannels.*
         enabled, masterKeySecret supplied. helm template with default
         values renders no Job (CI smoke clean).

  4. clusters/_template/bootstrap-kit/80-newapi.yaml
     - Bumped chart version 1.2.0 → 1.3.0.
     - Added defaultChannels.qwenBankDhofar block to the per-Sovereign
       overlay shape (still enabled=false in the template — operator
       supplies endpoint + attestation + Secrets per Sovereign).

  5. platform/newapi/chart/Chart.yaml
     - Bumped 1.2.0 → 1.3.0 with changelog comment.

  6. products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go
     - bp-openclaw per-tenant overlay now emits `newapi.defaultModel:
       qwen3.6` so OpenClaw's UI surfaces the friendlier alias by
       default. (Both qwen3.6 and qwen3-coder route to the same
       channel via the chart's `models` list.)

Verification:
  - helm lint .                    PASS (1 chart linted, 0 failed)
  - helm template (defaults)       PASS (no Job rendered)
  - helm template (qwen enabled)   PASS (Job + RBAC + ConfigMap +
                                          channels.yaml all render
                                          with channel #1 first)
  - helm template (endpoint empty) FAIL with helpful message
                                   (configurability gate)
  - go build ./...                 PASS
  - go test ./internal/handler/... PASS for SME tenant overlay tests
                                   (TestRenderSMETenantOverlay_*)
  - Pre-existing AuthHandover panic is unrelated to this change

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every knob is
configurable via the per-Sovereign bootstrap-kit overlay. The
endpoint default is empty so a fresh `helm template` does not
silently wire customers to a third-party host.

Co-authored-by: alierenbaysal <alierenbaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:32:00 +04:00
e3mrah
7f859dbb4b
feat(bp-keycloak): tenant-mode realm with wordpress/openclaw/stalwart OIDC clients (1.4.0, #915) (#918)
PR #911 wired the SME tenant orchestrator to emit
realmConfig.tenant.enabled=true on the per-tenant bp-keycloak
HelmRelease — but the chart had no template that consumed those values,
so the WordPress / OpenClaw / Stalwart OIDC integrations had no client
registered in the tenant realm and SSO failed end-to-end.

This change adds the chart-side template the orchestrator was already
emitting for. When realmConfig.tenant.enabled=true:

  * configmap-sovereign-realm.yaml SKIPS (mutual-exclusion guard added
    on the existing template) so only one realm CM is rendered.
  * NEW templates/configmap-tenant-realm.yaml renders a realm import
    ConfigMap (same name `<release>-sovereign-realm-config` so the
    upstream keycloak-config-cli existingConfigmap reference still
    resolves) carrying the tenant realm + 3 OIDC clients:
      - wordpress  (confidential, auth-code; redirect URIs cover the
                    openid-connect-generic plugin's admin-ajax.php
                    callback + /wp-login.php fallback)
      - openclaw   (confidential, auth-code; redirect URI /oauth/callback
                    per #915 spec)
      - stalwart   (confidential, serviceAccountsEnabled=true so the
                    directory.keycloak type=oidc backend can use
                    client_credentials to introspect IMAP/SMTP tokens;
                    standardFlowEnabled=true for webmail UI auth-code)
  * NEW per-app Secrets emitted in the same template scope as the realm
    ConfigMap so the realm JSON's `secret` field and the K8s Secret
    bytes never drift:
      - wordpress-oidc-client-secret
      - openclaw-oidc-client-secret
      - stalwart-oidc-client-secret  (carries BOTH client-secret AND
                                      OIDC_CLIENT_SECRET keys for the
                                      two consumer paths)
  * Each per-app secret persists across helm upgrade via
    lookup-or-generate (mirrors marketplace-api/secret.yaml pattern from
    issue #887 and the existing catalyst-api-server secret in
    configmap-sovereign-realm.yaml). helm.sh/resource-policy: keep so
    bytes outlive uninstall.
  * Fail-closed validation when realmConfig.tenant.enabled=true and
    any of realmName / parentDomain / subdomain is unset (Inviolable
    Principle #4).

NEW tests/tenant-realm-oidc-clients.sh covers 6 cases:
  1. Sovereign-mode default render unchanged (kubectl + catalyst-ui +
     catalyst-api-server clients present, no tenant artefacts leak).
  2. Tenant-mode render produces exactly ONE realm CM under the
     expected name + zero leaked Sovereign-only resources.
  3. Tenant realm JSON parses + 3 OIDC clients present with the
     redirect-URI / publicClient / serviceAccountsEnabled shape per
     #915 spec; Secret bytes match realm JSON's `secret` fields.
  4. Fail-closed validation when tenant fields missing.
  5. keycloak-config-cli post-install Job projects the realm CM by
     SAME name in BOTH modes.
  6. Operator-supplied per-app clientSecret overrides the
     lookup-or-generate path.

Existing tests/observability-toggle.sh + tests/oidc-kubectl-client.sh
still pass.

Sovereign-mode unchanged. The chart now consumes the values the
orchestrator (PR #911) was already emitting; no orchestrator change
needed.

Closes #915 (C1 sub-task) and unblocks #899 (per-tenant Keycloak
realm-config materialisation).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:29:40 +04:00
e3mrah
61c8d77b58
feat(bp-openclaw): per-tenant Keycloak SSO + NewAPI as OpenAI-compatible LLM gateway (#915) (#917)
Wire bp-openclaw to the per-tenant Keycloak realm (OIDC SSO) and the
per-tenant NewAPI (OpenAI-compatible LLM endpoint, NOT direct OpenAI),
delivering C3 of umbrella epic #915.

Chart changes (bp-openclaw 0.1.0 → 0.2.0):
- Add canonical `oidc.{issuerURL,clientId,clientSecret.{name,key}}` block.
- Add canonical `llm.{baseURL,apiKey.{name,key},defaultModel}` block.
- Controller Deployment now emits OIDC_*, LLM_*, OPENAI_API_{BASE,KEY},
  LLM_DEFAULT_MODEL envs (legacy KEYCLOAK_*/NEWAPI_BASE_URL_DEFAULT
  retained for back-compat with current controller image).
- Per-user pods carry OPENAI_API_BASE / OPENAI_API_KEY / LLM_DEFAULT_MODEL
  alongside the identity-blind NEWAPI_BASE_URL / NEWAPI_KEY (ADR-0003
  §3.3 unchanged).
- Legacy `keycloak.*` / `newapi.*` keys remain accepted as fallbacks;
  helpers prefer canonical blocks but fall back to the legacy alias when
  the canonical block is unset (or still at placeholder).
- assertNoPlaceholders guard updated to check resolved canonical values.
- render-toggles.sh smoke test extended: asserts both canonical and
  legacy code-paths render and that all expected envs reach the
  rendered Deployment.

Orchestrator changes (catalyst-api smeTenantBPOpenClaw template):
- Emit per-tenant `oidc.issuerURL` = https://keycloak.<sub>.<parent>/realms/sme-<sub>
- Emit per-tenant `oidc.clientId` = openclaw, secret from
  openclaw-oidc-client-secret/OIDC_CLIENT_SECRET (rendered by
  bp-keycloak's post-install hook).
- Emit per-tenant `llm.baseURL` = https://api.<sub>.<parent>/v1 (alice's
  own NewAPI ingress, NOT the otech-wide newapi.<otech-fqdn>); apiKey
  from openclaw-newapi-controller-token/NEWAPI_KEY.
- Emit `llm.defaultModel: qwen3.6` — NewAPI uses this to select the
  backing channel; C4 of #915 wires Qwen3.6@BankDhofar at tenant-create.
- Legacy keycloak/newapi blocks still emitted for back-compat with
  bp-openclaw < 0.2.0.

Tests:
- New TestRenderSMETenantOverlay_OpenClawOIDCAndLLMBlocks asserts the
  rendered HelmRelease contains the canonical oidc + llm blocks with
  per-tenant values, and that llm.baseURL is the per-tenant
  api.<sub>.<parent>/v1 (NOT the otech-wide newapi).
- bp-openclaw render-toggles.sh extended (Case 2b/2c).

Co-authored-by: alierenbaysal <alierenbaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:26:59 +04:00
e3mrah
368545369b
fix(bp-stalwart-tenant): unbootable on fresh tenants — values shape, missing admin Secret, sec ctx (#898) (#904)
Three fixes that left bp-stalwart-tenant 0.1.0 unable to come up on a
freshly-franchised SME tenant. All surfaced on the otech103 alice
tenant during the Phase-1 DoD sweep.

1. Tenant-domain values shape (HelmRelease render error)

   The 0.1.0 chart referenced `.Values.domain.primary` in five
   templates. The live HR on otech103 had `values.domain:
   acme.omani.works` (a string), emitted by a pre-#897 catalyst-api
   build, so every reconcile died with:

     can't evaluate field primary in type interface {}

   Added `bp-stalwart-tenant.tenantDomain` + `tenantMode` helpers
   that resolve in priority order:

     1. `tenant.domain`        (forward-looking flat shape)
     2. `domain.primary`       (canonical post-#897 map shape)
     3. `domain` (string)      (legacy pre-#897 shape — back-compat)

   Returns "" smoke-render-safe; per-template gates skip when empty.

2. Missing stalwart-admin Secret

   deployment.yaml + mailbox-provision-job.yaml reference a Secret
   key `ADMIN_PASSWORD` on `.Values.admin.secretName`. The 0.1.0
   chart only emitted an ExternalSecret, and only when
   `admin.externalSecret.remoteRef.key` was non-empty (smoke-render
   concession). Fresh tenants land in CreateContainerConfigError.

   Added `templates/admin-secret.yaml` mirroring marketplace-api/
   secret.yaml (#887): random 32-char ADMIN_PASSWORD generated by
   sprig randAlphaNum, persisted across reconcile via lookup,
   helm.sh/resource-policy: keep so reinstall picks it back up.
   Auto-disabled when an authoritative ExternalSecret is wired —
   no double-bind between two controllers.

3. Pod sec ctx vs. upstream image's file capabilities

   `getcap docker.io/stalwartlabs/stalwart:v0.16.3 /usr/local/bin/
   stalwart` reports `cap_net_bind_service=ep`. The image creates
   user `stalwart` at UID 2000 and the binary IS the entrypoint
   (no demotion script). The 0.1.0 chart ran as UID 65534 with
   `drop: ALL` — kernel refuses to elevate file caps with empty
   bounding set, so exec failed with `operation not permitted`.

   Aligned to image's native UID 2000, kept `drop: ALL` and added
   `NET_BIND_SERVICE` explicitly. fsGroup 2000 ensures /opt/stalwart
   PVC is writable.

Other:
- Bumped Chart.yaml + blueprint.yaml to 0.1.1 (#817 alignment).
- configSchema in blueprint.yaml now permits the legacy + tenant
  shapes alongside the canonical map.
- mailboxProvisioner.setupJob.enabled defaults to false until the
  canonical stalwart-cli image is published (re-uses upstream
  stalwart container as fallback CLI host).

Acceptance: targeted at otech103 alice tenant
(sme-789ae512-bc0f-467c-a016-001f5496c403) where 0.1.0 reconciliation
fails with the value-shape error and the pod CrashLoops with `exec
... operation not permitted`. Verification on otech103 in #898.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 10:55:03 +04:00
e3mrah
93c4b700de
fix(bp-keycloak): templatize existingConfigmap reference for per-tenant installs (#899) (#902)
bp-keycloak 1.3.2 hardcoded `keycloak.keycloakConfigCli.existingConfigmap` to
the literal "keycloak-sovereign-realm-config". This worked for the Sovereign-
mothership bootstrap-kit (releaseName=keycloak emits matching ConfigMap) but
broke for every per-tenant install where releaseName=bp-keycloak emits
"bp-keycloak-sovereign-realm-config" — the post-install keycloak-config-cli
Job stuck in ContainerCreating with `MountVolume.SetUp failed for volume
"config-volume" : configmap "keycloak-sovereign-realm-config" not found`,
HelmRelease InstallFailed after 15m timeout, cascading to bp-openclaw and
bp-wordpress-tenant which dependsOn it.

The bitnami/keycloak subchart's `keycloak.keycloakConfigCli.configmapName`
helper (charts/keycloak/templates/_helpers.tpl) applies `tpl` to the
existingConfigmap value, so embedding `{{ .Release.Name }}` inside the
string resolves at chart-render time. With this single-line change:

  - Sovereign-mothership (releaseName=keycloak) → keycloak-sovereign-realm-config (unchanged)
  - Per-tenant (releaseName=bp-keycloak)        → bp-keycloak-sovereign-realm-config (matches actual emitted ConfigMap)

Verified via helm template both modes — backendRef and config-volume
configMap.name match the actual ConfigMap emitted by
templates/configmap-sovereign-realm.yaml.

Chart bumped 1.3.2 → 1.3.3 + bootstrap-kit slot 09 + blueprint.yaml.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 10:49:39 +04:00
e3mrah
eddf0e62a4
fix(self-sovereign-cutover): Step-5 widens GitRepository ignore filter (#891) (#892)
* fix(catalyst-api): SME-tenant orchestrator writes parent kustomization.yaml index (#889)

The Flux Kustomization rendered by bp-catalyst-platform 1.4.13+ at
clusters/<sov-fqdn>/sme-tenants/ requires a parent kustomization.yaml
that enumerates tenant subdirectories. The orchestrator only wrote
per-tenant overlays without the parent index, so on otech103 Flux
hit:

  kustomization path not found: stat /tmp/kustomization-...
  /clusters/otech103.omani.works/sme-tenants: no such file or directory

Even after a tenant signup, the parent path lacked a kustomization.yaml
so Flux couldn't enumerate subdirs.

Fix: NEW writeParentTenantsIndex helper called from both
WriteTenantOverlay and DeleteTenantOverlay. Scans the parent dir for
subdirectories that contain kustomization.yaml, sorts them lexically
for deterministic output (no spurious diffs), and writes a parent
kustomization.yaml listing them under `resources:`. Empty list (no
tenants) renders as `resources: []` — still a valid Kustomization
root, so Flux stays Ready=True after the last tenant teardown.

git add covers both the per-tenant subdir AND the parent index, so a
single commit captures the delta.

Live on otech103 post-cutover, 2026-05-05.

* fix(self-sovereign-cutover): Step-5 widens GitRepository ignore filter to include clusters/<sov-fqdn>/ (#891)

After Day-2 cutover, the GitRepository ignore filter excluded the
Sovereign's own clusters/<sov-fqdn>/ subtree. This made every
Sovereign-specific Flux Kustomization (sme-tenants, future per-Sov
overlays) hit "kustomization path not found" because source-controller
filtered the path out of the artifact tarball.

Live on otech103 (2026-05-05): sme-tenants Kustomization stuck for
20+ minutes despite the orchestrator successfully committing the
overlay to local Gitea.

Fix: Step-5 (flux-gitrepository-patch) now writes the patch as a
multi-line YAML strategic-merge file via /tmp emptyDir (since the
Pod runs readOnlyRootFilesystem), composing the new ignore filter:

  /*
  !/clusters/_template
  !/clusters/${SOVEREIGN_FQDN}
  !/platform
  !/products

The SOVEREIGN_FQDN is wired from .Values.sovereign.fqdn (already
established in the chart values).

Bumps chart 0.1.14 -> 0.1.15. Slot 06a pin bumps in lockstep.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-05 09:39:42 +04:00
e3mrah
8e4c88fd28
fix(bp-self-sovereign-cutover): auto-sync local Gitea mirror from upstream GitHub (#870) (#875)
Step-1 gitea-mirror Job replaces the legacy one-shot create-empty-repo +
git-push pattern with a single call to Gitea's native /repos/migrate API
with mirror=true and mirror_interval=10m0s. Gitea now polls the upstream
openova-io/openova repo on a 10-minute interval and replicates branches
+ tags into the local Sovereign Gitea automatically.

Closes the "Sovereign drifts from upstream main forever after Day-2
cutover" bug — hit twice during the otech103 2026-05-04 overnight DoD
session, requiring manual `git fetch` inside the Gitea pod for every
chart rollout.

Why /repos/migrate over the previous git push approach:
- Gitea cannot convert a regular repo into a pull-mirror after creation
  (the mirror flag is set at create-time only). The migrate endpoint
  creates the repo AS a mirror in one shot.
- The migrate endpoint accepts toggles for issues / pull-requests /
  wiki / labels / milestones / releases — we set them all to false so
  Gitea only replicates branches+tags, the only refs the Sovereign's
  Flux GitRepository needs.
- Recurring sync is a Gitea-native capability; using it avoids a
  parallel CronJob (which would violate the "event-driven not cron"
  inviolable principle) or a long-poll sidecar (which would duplicate
  what Gitea already does).

Idempotency: if the repo already exists from a prior cutover attempt,
the script PATCHes mirror_interval to the desired value and POSTs to
/mirror-sync to trigger an immediate refresh. Note that PATCH alone
cannot convert a legacy non-mirror repo to a mirror — Sovereigns
seeded by chart < 0.1.14 would need an operator-driven repo delete +
re-migrate to retro-fit auto-sync, but new provisions take the
migrate path automatically.

Verification on the rendered ConfigMap:
  $ helm template smoke .                   # renders 16 docs cleanly
  $ bash tests/cutover-contract.sh          # all 7 gates green
  $ sh -n <rendered-script>                 # POSIX shell syntax OK

Chart bumped 0.1.13 → 0.1.14 (Chart.yaml + blueprint.yaml spec.version
aligned per #817 invariant + slot 06a-bp-self-sovereign-cutover.yaml
pin lockstep).

Refs #870, #790.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 08:35:40 +04:00
e3mrah
9b710049e3
fix(self-sovereign-cutover): Step-8 baseline-diff (only NEW regressions count) (#858)
Live otech103: Step-8 survival window failed because infrastructure-config Kustomization had been NotReady for 4h pre-cutover (Crossplane provider CRD ordering — unrelated to sovereignty). Sovereignty proof asks 'did cutover break anything', not 'is the cluster perfect'. Capture baseline NotReady set before the window, only fail on NEW additions during.

Bumps 0.1.12 → 0.1.13 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 04:20:16 +04:00
e3mrah
d5d1d9b2cd
fix(self-sovereign-cutover): Step-8 tolerate slot-managed self-ref HelmRepositories (#857)
Live otech103: Step-8 verification flagged 2 HelmRepositories (bp-newapi + bp-self-sovereign-cutover) still on ghcr.io/openova-io. Both are declared in clusters/_template/bootstrap-kit/ slot files which Flux Kustomization re-applies on every reconcile — Step-6's patch is transient for them. Data-plane impact is null because they're not pulled again until the next cutover cycle which would re-apply the patch first. The 38 leaf-bp HelmRepositories ARE patched durably (live in HelmRelease values, not separate slot files).

Bumps 0.1.11 → 0.1.12 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 04:06:41 +04:00
e3mrah
142ea21534
fix(self-sovereign-cutover): Step-8 passive architectural verification (Cilium can't egressDeny+toFQDNs) (#856)
Live otech103: Step-8 (egress-block-test) failed because Cilium 1.16's CiliumNetworkPolicy schema doesn't support 'spec.egressDeny[].toFQDNs' — strict-decoding error 'unknown field'. FQDN-based matching in Cilium is only allowed in 'egress' (allow), not 'egressDeny'.

Pivot: Step-8 now asserts the architectural pivots from Steps 5-7 are actually live (GitRepository.url + all HelmRepositories + catalyst-api env all point at local Gitea/Harbor) BEFORE entering the durationSeconds survival window during which Flux Kustomization + HelmRelease readiness is polled. Same sovereignty proof, expressed in a form Cilium can evaluate.

Bumps 0.1.10 → 0.1.11 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 03:22:30 +04:00
e3mrah
86ae235804
fix(self-sovereign-cutover): catalyst-api namespace catalyst-system not catalyst-platform (#855)
Live otech103: Step-7 (catalyst-api-env-patch) hit 'deployments.apps catalyst-api not found' in catalyst-platform ns. Actual Sovereign-side namespace is catalyst-system. Bumps 0.1.9 → 0.1.10.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 01:59:11 +04:00
e3mrah
dd84060d05
fix(self-sovereign-cutover): switch from bitnami/kubectl to alpine/k8s (#854)
Live otech103 2026-05-04: bitnami/kubectl:1.31.4 404 on Docker Hub. Bitnami deprecated public Docker Hub registry in 2025; their kubectl image stopped getting tags. alpine/k8s is the canonical alpine-based replacement — kubectl + helm + standard k8s CLI surface, actively maintained, :1.31.4 verified present.

Bumps 0.1.8 → 0.1.9 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 01:55:46 +04:00
e3mrah
887ff62200
fix(self-sovereign-cutover): bitnami/kubectl tag :1.31 → :1.31.4 (#853)
Live otech103 2026-05-04: Step-5 (flux-gitrepository-patch) Pod DeadlineExceeded after 10m of ImagePullBackOff. bitnami/kubectl on DockerHub doesn't have a floating :1.31 tag — only patch-level :1.31.X. Pin to :1.31.4 (latest of 1.31 minor as of today).

Bumps 0.1.7 → 0.1.8 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 01:42:54 +04:00
e3mrah
e9970db7b6
fix(self-sovereign-cutover): proxy-quay adapter type docker-registry (#852)
Live otech103: Harbor rejects project create with metadata.proxy_cache=true on registries with type 'quay' — HTTP 400 'unsupported registry type quay'. Quay speaks plain v2 so docker-registry is the correct adapter (4/7 projects ahead succeeded with the same shape). Bumps 0.1.6 → 0.1.7.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 01:29:26 +04:00
e3mrah
ea51642092
fix(self-sovereign-cutover): proxy-ghcr Harbor adapter type 'github-ghcr' (#851)
Live otech103 2026-05-04: Step-2 harbor-projects POST /api/v2.0/registries returns 500 'adapter factory for github not found'. Harbor 2.x's canonical GHCR proxy-cache adapter is named 'github-ghcr', not 'github'.

Bumps 0.1.5 → 0.1.6 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 01:26:51 +04:00
e3mrah
8f96daeb6f
fix(self-sovereign-cutover): harbor service is 'harbor-core' not 'harbor-harbor-core' (#849)
Live failure on otech103 2026-05-04: Step-2 (harbor-projects) Pod exits silently after first echo because curl exit 6 (CURLE_COULDNT_RESOLVE_HOST). The chart's default harborInternalURL was http://harbor-harbor-core.harbor.svc.cluster.local but the actual bitnami harbor chart's service name is harbor-core (release name doesn't double-prefix when targetNamespace == 'harbor' AND releaseName == 'harbor').

Fix: harborInternalURL → http://harbor-core.harbor.svc.cluster.local. Verified via 'kubectl get svc -n harbor' on otech103.

Bumps 0.1.4 → 0.1.5 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 01:16:41 +04:00
e3mrah
ab5681e656
fix(self-sovereign-cutover): Step-1 use bare clone + explicit refspec push (#848)
Live failure on otech103 2026-05-04 even after 0.1.3: git push --all in a mirror clone still pushes refs/pull/* because mirror clones store all upstream refs (incl. GitHub PR refs) at the same level as refs/heads/, and --all walks the whole local refstore.

Fix: use git clone --bare (not --mirror) which only fetches refs/heads/* and refs/tags/*, then push with explicit refspecs:
  git push origin 'refs/heads/*:refs/heads/*'
  git push origin 'refs/tags/*:refs/tags/*'

Bumps 0.1.3 → 0.1.4 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 00:59:25 +04:00