db618cc5eb
239 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2c32fde847
|
feat(epic-5): NetBird mesh + ClusterMesh activator + DMZ vCluster scaffolds (#1100) (#1171)
Closes the EPIC-5 leftovers (per .claude/architect-briefs/epic-5/00-master-brief-leftovers.md): * NB — bp-netbird platform Blueprint chart (default-OFF, SHA-pinned, fail-fast). Renders 12 resources ON: 3 Deployments (management + signal + coturn) + 3 Services + 1 PVC + 1 HTTPRoute + 1 NetworkPolicy + 2 SealedSecrets + 1 ConfigMap. KC realm-config ConfigMap mirrors the Guacamole pattern from slice K+P+X1+G #1164 — adds `netbird` OIDC client + `netbird-user` / `netbird-admin` realm roles + `netbird-users` / `netbird-admins` groups. * CM — ClusterMesh activator slice on the existing Cilium chart. ADDs platform/cilium/chart/values-clustermesh.yaml (operator-applied values overlay) + templates/clustermesh-config.yaml (renders the catalyst-clustermesh-config ConfigMap when cluster.name + cluster.id are set per-Sovereign). Operator runbook for `cilium clustermesh enable` + `cilium clustermesh connect` documented inline. Default Cilium chart render is unchanged — this slice is purely additive + opt-in. * DMZ — bp-dmz-vcluster product Blueprint chart (default-OFF, SHA-pinned, fail-fast). Renders 4 resources ON without hostname (HelmRelease wrapping upstream loft-sh/vcluster + Service + 2 NetworkPolicies); 5 resources with HTTPRoute hostname. Isolation pattern: own openova-system namespace inside host cluster → own Cilium identity → default-deny + allow-essentials NetworkPolicies → public egress only via designated egress gateway. All 3 charts: helm lint clean. Tests at chart/tests/render.sh + chart/tests/clustermesh-overlay.sh. Pre-existing CI flakes per canon §7 remain — they're not introduced by this slice. Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
639b94fe55
|
feat(epic-4): K+P+X1+G — k8s-ws-proxy + projector + WebSocket logs + Guacamole chart (#1099) (#1164)
EPIC-4 Slice K+P+X1+G — bundled backend infrastructure for the
"k9s-on-web" Cloud Resources experience:
K1 — core/cmd/k8s-ws-proxy/ — per-node WebSocket exec proxy.
HMAC-signed (X-Catalyst-HMAC: SHA256({timestamp}:{path})) WebSocket
upgrades on /proxy/exec/{ns}/{pod}/{container} bridged to the local
kube-apiserver via in-cluster ServiceAccount. v4.channel.k8s.io
subprotocol echo. Optional TMUX_CASCADE wraps in a shared
catalyst-ops tmux session. Shipped as a DaemonSet + Service with
internalTrafficPolicy=Local in platform/k8s-ws-proxy/chart/.
P1 — core/cmd/projector/ — NATS catalyst.events JetStream → Valkey
KV projector. Canonical key shape:
cluster:{cluster-id}:kind:{kind}:{namespace}/{name}
Cold-start does a full LIST across DefaultKinds, then catches up on
the 24h replay window. Multi-replica safe (durable consumer queue
group, last-write-wins on namespacedName). Shipped as a default-OFF
Deployment + RBAC under products/catalyst/chart/templates/services/projector/.
X1 — products/catalyst/bootstrap/api/internal/handler/k8s_logs.go —
WebSocket Pod-log streaming endpoint:
GET /api/v1/sovereigns/{id}/k8s/logs/{ns}/{pod}/{container}
?follow&tailLines&since=<rfc3339>&previous
Reads from kubelet via client-go GetLogs().Stream(); each WS frame =
one log line. Supports `since` resume. Reuses RequireSession middleware
+ chroot cluster-id resolver. New k8scache.Factory.CoreClient(id)
accessor exposes the per-cluster typed client without duplicating
kubeconfig parsing.
G1 — platform/guacamole/chart/ — full Apache Guacamole chart:
guacd Deployment + Service, Tomcat webapp Deployment + Service,
Cilium Gateway HTTPRoute, SeaweedFS-PVC for recordings (RWO,
hcloud-volumes), SealedSecret placeholder for Keycloak OIDC client
secret, NetworkPolicy (default-deny + selective egress to KC +
k8s-ws-proxy + SeaweedFS + NATS), and ConfigMap consumed by
keycloak-config-cli post-deploy Job (mirrors platform/keycloak
realm-config pattern). Default-OFF gate; full-ON renders 9
resources. Empty image.tag / hostname / oidc.issuer fail-fast at
helm template time per INVIOLABLE-PRINCIPLES #4a/#5. ONE Guacamole
per Sovereign per ADR-0001 §11. Blueprint manifest uses
v1alpha1 + version "0.1.0" + upgrades.from ["0.x"].
Tests:
- k8s-ws-proxy: HMAC happy/expired-old/expired-future/malformed/
bad-signature, path-only signature, WS upgrade + protocol echo,
bad path, bad HMAC, denied namespace via httptest.
- projector: Apply ADD/MOD/DEL/validation, key shape (ns-scoped +
cluster-scoped), handleOne ack/nak/term routing with fakeMsg,
cold-start LIST + project + error continuation via dynamicfake.
- X1: parseLogOptions defaults + edge cases + bad query params,
503/404/400 paths + full WS happy-path with kfake clientset.
- G1: chart/tests/render.sh — default-OFF=0, empty-tag fail-fast,
full-ON=9 resources, every required kind present, realm-config
wires OIDC client.
- bp-k8s-ws-proxy chart: chart/tests/render.sh — default-OFF=0,
empty-tag fail-fast, full-ON=5 resources.
Pre-existing test status: TestPinIssue and TestBootstrapKit/gitea
remain flaky on main per canon §7 — verified not introduced by
this slice.
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a0c356fe34
|
fix(cnpg-pair): drop bp-cnpg: prefix from upgrades.from semver range (#1156)
Other platform/*/blueprint.yaml files use bare semver-range strings (e.g. ["0.x"]) without the bp-name: prefix. C3 blueprint-controller's validate package rejects "bp-cnpg:1.x" as an invalid semver range, breaking TestValidate_ExistingBlueprintCorpus on any PR after #1153. Found by EPIC-6 K-Cont-2 (#1155). Brief at C-DB-1 (.claude/architect-briefs/ epic-6/02-) was wrong — the slice author followed the brief literally. Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
746901b671
|
feat(cnpg-pair): C-DB-1 — bp-cnpg-pair Blueprint (active-hotstandby CNPG cluster-pair across regions) (#1101) (#1153)
EPIC-6 Slice C-DB-1+C-DB-2. Active-hotstandby CNPG cluster-pair as a
companion to bp-cnpg: primary CNPG Cluster CR in region A, replica
Cluster CR in region B configured as a CNPG replica cluster
(replica.enabled=true + externalCluster), WAL streaming over a
Cilium ClusterMesh-shared Service. Per ADR-0001 §9 ClusterMesh is the
only canonical inter-region transport — never public TLS.
What ships:
platform/cnpg-pair/
├── chart/
│ ├── Chart.yaml # bp-cnpg-pair 0.1.0; no-upstream + smoke-render-mode=default-off
│ ├── values.yaml # default-OFF gate; placement schema constrains active-hotstandby ONLY
│ ├── templates/
│ │ ├── _helpers.tpl # fail-fast on empty image.tag; region pair validation
│ │ ├── primary-cluster.yaml # CNPG Cluster CR (region-pinned via openova.io/region affinity)
│ │ ├── replica-cluster.yaml # CNPG Cluster CR (replica.enabled=true; externalClusters[])
│ │ ├── service-replication.yaml # Cilium ClusterMesh global Service
│ │ ├── failover-readiness.yaml # probe Pod flips Ready when WAL lag < threshold
│ │ ├── networkpolicy.yaml # default-deny carve-outs for replication + probe
│ │ └── audit-config.yaml # NATS audit subjects + types this Blueprint emits
│ ├── blueprint.yaml # configSchema + placementSchema (active-hotstandby ONLY)
│ ├── README.md # 80-line deployment + failover semantics
│ └── tests/cnpg-pair-render.sh # 5-case render gate
└── DESIGN.md # topology, lag-threshold rationale, deferred C-DB-3 plan
Default-OFF gate per the brief: helm template with default values
renders ZERO resources; helm template with cnpgPair.enabled=true +
both regions + image.tag renders 8 resources (2 Cluster CRs, 1
Service, 1 Deployment, 3 NetworkPolicies, 1 audit-config ConfigMap).
Empty image.tag fails fast at template-render per Inviolable
Principle #4a; same primary/replica region fails fast (degenerate
pair). All 5 render gates pass locally; helm lint + YAML parse clean.
CI smoke-render gate fix (single-line behavior change in
blueprint-release.yaml): adds a `catalyst.openova.io/smoke-render-
mode: default-off` annotation opt-in so charts that legitimately
render zero at default values (this chart + future bp-*-pair
Blueprints) skip the `<5 lines` empty-render check. The chart's own
tests/cnpg-pair-render.sh covers the enabled-render path; without
the annotation the empty-render check still fires unchanged.
Seam-map additions (return diff for 01-canonical-seams.md Platform
table):
- service.cilium.io/global=true ClusterMesh global Service annotation
(first chart in the repo to use it; pattern reused by Continuum
K-Cont-2 for HTTPRoute weight=0 cross-region drains)
- bp-*-pair active-hotstandby cluster-pair pattern (primary+replica
Cluster CRs colocated in one Blueprint, region-pinned via
openova.io/region node-affinity)
- audit-config ConfigMap co-located with the emitting Blueprint
(label-selector discovery for K-Cont-2 + U-DR-1; future
bp-*-pair Blueprints follow this convention)
- smoke-render-mode=default-off Chart.yaml annotation opt-in for
the blueprint-release smoke gate
C-DB-2 (publish): existing blueprint-release.yaml workflow auto-
detects `platform/*/chart/**` paths — no allowlist edit required.
First push triggers `ghcr.io/openova-io/bp-cnpg-pair:0.1.0` build.
C-DB-3 (1M-row acceptance test) DEFERRED — full plan documented in
DESIGN.md "Deferred — C-DB-3 acceptance test plan" section so the
future implementer's brief is self-contained.
Tests:
- bash platform/cnpg-pair/chart/tests/cnpg-pair-render.sh ✓ 5/5 PASS
- helm lint platform/cnpg-pair/chart ✓ clean
- helm template ... | python3 yaml.safe_load_all ✓ 8 docs parse clean
- smoke-gate logic simulated locally ✓ default-off annotation honored
Pre-existing CI failures untouched:
- TestPinIssue rate-limit flake — not affected by chart-only slice
- TestBootstrapKit/gitea version drift — only iterates over a fixed
10-chart bootstrap list (no cnpg-pair entry)
Out of scope per brief (all deferred to dedicated slices):
- K-Cont-2 reconciler logic
- K-Cont-3 lease witness
- K-Cont-4 Cloudflare Worker
- C-DB-3 1M-row acceptance test
- Application controller changes
- U-DR-1 UI
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a6ccdcef41
|
feat(rbac): /rbac/assign find-or-create + /rbac/access-matrix + boundary validator (slice A, #1098) (#1143)
EPIC-3 slice A bundles three deliverables on top of the just-landed
slice T1 (5-tier ClusterRoles):
A1 — POST /api/v1/sovereigns/{id}/rbac/assign
Find-or-create-role endpoint backing the multi-grant editor (slice
U1). Race-tolerant 409 retry follows the EnsureUser pattern. Three
paths: created / updated (tier rotation on existing scope) / no-op.
Authoring side: writes UserAccess CR with metadata.labels[
catalyst.openova.io/tier]=<tier> + spec.tierRoleRef + spec.scopes[].
A2 — GET /api/v1/sovereigns/{id}/rbac/access-matrix
Manara-style users × applications × tier matrix with per-CR
warnings (developer-tier missing env-type=dev surfaces inline).
Optional org/application filters. Pure aggregator extracted for
testability — no apiserver, no clock.
A3 — Kyverno ClusterPolicy `useraccess-boundary`
Denies cross-Organization UserAccess grants unless the requester
is a member of a management Org with tier=owner. Default Audit
(values-driven action). Test fixtures + kyverno-test.yaml shape
ready for kyverno-CLI CI step in a follow-up slice.
UserAccess CRD extension:
- spec.tierRoleRef (string, openova:tier-* pattern)
- spec.scopes[] ({key, value})
- applications[] no longer required (legacy + new shapes coexist)
Test coverage (26 new tests, race-clean):
- A1: 3-path find-or-create, 409 retry, validation, 404
- A2: matrix shape + filters + warnings, http happy/empty/404
- Pure helpers: scope normalization/equality, CR-name determinism
Pre-existing failure `TestPinIssue_ConcurrentRapidFireRateLimit`
(rate-limit timing flake) reproduced on clean main per canon §7;
not introduced by this slice.
Refs: EPIC-3 master brief at .claude/architect-briefs/epic-3/, slice
A brief at 02-A-rbac-assignment-endpoints.md, T1 ancestor #1142.
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
c215468a61
|
feat(rbac): land 5-tier ClusterRoles (slice T1, #1098) (#1142)
Renders 5 ClusterRoles `openova:tier-{viewer,developer,operator,admin,owner}`
via Helm template with inherit-chain expansion. Find-or-create-role
endpoint (slice A1, future) targets these via roleRef on UserAccess CRs.
Per-tier action sets in values.yaml's new `tierActions:` block (227
lines authored by EPIC-3-T agent before stream timeout — Coordinator
finished the template + helper):
- tier-viewer (level 10): 6 rules — `*.read` on common kinds
- tier-developer (level 20): 10 rules — viewer + workloads.exec/console
+ tickets + sessions.playback. Auto-injected scope `openova.io/env-type=dev`
surfaced via ClusterRole annotation (slice T3 follow-up reads it).
- tier-operator (level 30): 15 rules — developer + console.connect.admin
+ sam.manage + patches.manage + tickets.accept
- tier-admin (level 40): 29 rules — operator + compute.* (no delete)
+ credentials.* + applications.* + actions.* + accounts.* + networks.*
+ sessions.* + workloads.*
- tier-owner (level 50): 33 rules — admin + rbac.* + organization.*
+ compute.delete
Total 93 RBAC rules across the 5 ClusterRoles.
Inherit chain expansion via _tier-helpers.tpl `catalyst.tierRules`
template helper. Each ClusterRole's `metadata.labels` carries:
- `catalyst.openova.io/tier-name: <tier>`
- `catalyst.openova.io/tier-level: <int>` (10/20/30/40/50; same integer
the Keycloak realm-role attribute carries — admin_roles.go:88-92)
`metadata.annotations.catalyst.openova.io/enforced-scopes` JSON-encodes
the per-tier scope auto-injection contract (developer-only today).
Per ADR-0001 §2.7: ClusterRoles (not Roles) so the same role works for
both namespace-scoped (RoleBinding) and cluster-scoped (ClusterRoleBinding)
UserAccess targets.
Per docs/INVIOLABLE-PRINCIPLES.md #4: every action set is in values.yaml,
not hardcoded — operators extend per-Sovereign without editing the
template. The `tiers.enabled` master gate + per-tier `enforcedScopes[]`
are also operator-tunable.
Validated:
- `helm lint` clean (1 INFO about chart icon, pre-existing)
- `helm template` renders exactly 5 ClusterRoles with the expected
inherit-chain rule counts (6 → 10 → 15 → 29 → 33)
- Inherit chain helper handles base case (viewer has no inherit) and
caps recursion at 10 levels (defensive)
Out of scope (deferred to follow-up slices):
- T2: Keycloak composite realm-role bootstrap (init Job in catalyst-api
startup that creates 5 `catalyst-<tier>` realm roles + composite chain)
- T3: useraccess-controller mod for developer scope auto-injection
(reads enforced-scopes annotation from this template's ClusterRoles)
Refs: #1094, #1098, docs/EPICS-1-6-unified-design.md §6.2
(authoritative tier action-set spec).
Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
f1d0801ad2
|
feat(catalyst-api): compliance score aggregator + handler (slice S, #1096) (#1141)
Joins Kyverno PolicyReports + slice W2's compliance-evaluator events
+ EnvironmentPolicy weights into per-resource → per-Application →
per-Environment → per-Organization → per-Sovereign weighted scores.
Outputs SSE for live updates, REST for snapshots, Prometheus
catalyst_compliance_* gauges/counters, and (when CATALYST_NATS_URL is
wired) NATS JetStream KV `policy-rollup` for replayable history.
S1 — internal/handler/compliance.go:
* REST endpoints under /api/v1/sovereigns/{id}/compliance/
- GET /scorecard — per-app/env/org/sovereign rollups
- GET /policies — per-policy weight + mode + violation tally
- GET /violations — paginated fail rows, ?app=<name>
- GET /stream — SSE for live score updates
* Watch loop subscribes to k8scache.Factory fanout for kinds
{policyreport, clusterpolicyreport, compliance-evaluator,
deployment, statefulset, daemonset, pod}. Per ADR-0001 §5
every score recompute is event-driven; no polling.
* Pure computeScore() function with edge cases tested:
all-pass=100, all-fail=0, half-pass=50, skip drops from denom,
empty-weights fallback to equal weights, stateful/stateless scope
filters, missing verdict drops policy, warn pulls score down.
* NATS KV writes via nil-tolerant PolicyRollupPublisher interface
keyed `<scope>:<id>`. Sentinel resolver wires when env is set;
nil keeps the aggregator running on SSE+Prometheus only.
* EnvironmentPolicy CR resolution via dynamic-client; nil/404
falls back to default equal-weights so a fresh Sovereign without
a tuned policy still scores correctly.
S2 — platform/mimir/chart/templates/prometheusrule-compliance.yaml:
* Recording rules:
- catalyst:compliance_score:by_application:1h_avg
- catalyst:compliance_violations:by_policy:5m_rate
- catalyst:compliance_score:by_sovereign:1h_avg
- catalyst:compliance_policy_enforcing:by_policy
* Pager alerts: ComplianceScoreRegression (>10pt drop in 1h) +
ComplianceEnforcingPolicyHighViolations (>50/hr in enforcing
mode). Every threshold a values.yaml knob per
docs/INVIOLABLE-PRINCIPLES.md #4.
* Capabilities-gated on monitoring.coreos.com/v1 so a fresh
Sovereign without bp-kube-prometheus-stack doesn't fail render.
Tests:
* 18 unit + integration tests in compliance_test.go covering the
full computeScore matrix, the watch-loop end-to-end via
Factory.Publish injection, and every HTTP endpoint (scorecard,
policies, violations pagination, stream, 503 nil-handler).
* `go test -count=1 -race ./internal/handler/...` clean (5 runs).
* `go vet ./...` clean.
Pre-existing CI failures (TestPinIssue_ConcurrentRapidFireRateLimit,
TestRun_FailsFastOnDynadotError, TestAuthHandover_HappyPath nil-ptr,
TestValidate_*Harbor_robot_token*) confirmed not introduced by this
slice — they reproduce on clean main.
Per ADR-0001 §3 (5 stores): score history lives in NATS JetStream KV;
no Postgres/FerretDB shadow store. Per ADR-0001 §5 (event-driven):
every score recompute fires off a Subscribe event. Per
INVIOLABLE-PRINCIPLES #4: SSE retention, KV TTL, alert thresholds all
runtime-configurable.
Closes the S column of EPIC-1 master plan; UI slices U1-U5 can now
consume the SSE event shape.
Co-authored-by: hatiyildiz <hati@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d74e0d5e5a
|
feat(bp-kyverno): land 19 compliance ClusterPolicy templates (slice K, #1096) (#1138)
Slice K of EPIC-1 (#1096) compliance engine — author the baseline policy library that the score aggregator (slice S) will consume via PolicyReport rows. K1 ships 13 baseline policies + K2 ships 7 added policies. One of the K2 policies (hubble-flows-seen #16) is a stub file — Kyverno can't natively reach Cilium Hubble's gRPC API, so the synthetic PolicyReport row is emitted by slice W2's hubble.go evaluator (per design §4.1). Stub keeps the policy slot explicit in the bundle. Architecture per docs/EPICS-1-6-unified-design.md §4.3: K1 (13 baseline) 01 multi-replica-drainability (resilience, permissive) 02 pdb-permits-eviction (resilience, permissive) 03 topology-spread (resilience, permissive) 04 probes-present (resilience, enforcing) 05 resource-requests (resilience, enforcing) 06 resource-limits (resilience, permissive) 07 pvc-volume-expansion (resilience, permissive — stateful) 08 hpa-effective (resilience, permissive) 09 cilium-l7-mtls (security, enforcing) 10 flux-managed (governance, enforcing) 11 harbor-proxy-pull (governance, enforcing) 12 image-tag-pinned (governance, enforcing) 13 prometheus-scrape (observability, permissive) K2 (7 added) 14 networkpolicy-present (security, permissive) 15 otel-injected (observability, permissive) 16 hubble-flows-seen (deferred to W2 evaluator) 17 runasnonroot-readonlyrootfs (security, permissive) 18 cosign-verified (security, permissive) 19 secret-not-in-env (security, permissive) 20 backup-configured (resilience, permissive) Per docs/INVIOLABLE-PRINCIPLES.md #4 every operationally-meaningful value is runtime-configurable via .Values.compliancePolicies.<name>.*: - enabled (default false — operator opts in) - action (Audit | Enforce; default Audit; flipped per-Environment by EnvironmentPolicy.spec.compliance.modes once C2 controller lands) - excludeNamespaces (default exempts kube-system, flux-system, etc.) - per-policy specifics (allowedRegistryRegex, cosign keys, ...) Test gate (helm template): - default-OFF (no overrides): 0 ClusterPolicy rendered - all-ON : 19 ClusterPolicy rendered helm lint clean both ways. Slice S1 (score aggregator) will join PolicyReport rows from these policies + synthetic rows from W2 evaluators against EnvironmentPolicy weights. UI surfaces (slices U1-U5) consume the SSE/NATS rollups. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f18dd8df19
|
feat(bp-opentelemetry-operator): scaffold operator + default Instrumentation CR (slice H5, #1095) (#1121)
New platform/opentelemetry-operator/ Blueprint scaffold per design doc
§3.9 row 5. Companion to existing bp-opentelemetry (the collector) —
this Blueprint ships the OPERATOR that auto-injects OTel SDK sidecars
into Pods based on annotations:
instrumentation.opentelemetry.io/inject-{java|nodejs|python|dotnet}: "default"
Two-Blueprint split is intentional: collector and operator are separate
upgrade cycles. Mixing them risks coupling observability cadence to
auto-instrumentation cadence, and the operator's mutating admission
webhook intercepts every Pod creation cluster-wide so misconfiguration
is high-blast-radius.
What ships:
- platform/opentelemetry-operator/README.md — activation contract
- platform/opentelemetry-operator/blueprint.yaml — bp-opentelemetry-operator 1.0.0
- platform/opentelemetry-operator/chart/Chart.yaml — wraps upstream
opentelemetry-operator:0.61.0 from open-telemetry-helm-charts.
Subchart `condition: enabled` — default-off skips it entirely.
- platform/opentelemetry-operator/chart/values.yaml — gate, default
Instrumentation CR config (exporterEndpoint, sampler, per-language
toggles), upstream subchart values (manager.collectorImage.repository
required, serviceAccount, cert-manager-backed admission webhook)
- platform/opentelemetry-operator/chart/templates/instrumentation-default.yaml
— Catalyst overlay Instrumentation CR with parentbased_traceidratio
sampler @ 0.25 default, propagators (tracecontext + baggage + b3),
per-language injection toggles. Default OFF; namespace = cilium by
default (operator overrides per Sovereign).
Default-OFF for both layers:
- .Values.enabled: false → upstream subchart's `condition: enabled`
also fires, so 0 resources rendered total
- Even after .Values.enabled=true, the Catalyst Instrumentation CR
is gated again by .Values.defaultInstrumentation.enabled=false so
installing the chart doesn't auto-inject anywhere
Per docs/INVIOLABLE-PRINCIPLES.md #4 every parameter (sampler ratio,
exporter endpoint, per-language toggles, namespace) is in values.yaml.
Validated:
- helm dependency build pulls upstream cleanly
- helm template with default values: 0 resources rendered
- helm template with enabled=true defaultInstrumentation.enabled=true:
22 resources rendered (upstream operator manager Deployment, CRDs,
RBAC, mutating + validating webhooks, cert-manager Issuer +
Certificate, plus the Catalyst Instrumentation CR)
Out of scope for this slice:
- Add this Blueprint to clusters/_template/bootstrap-kit/ — EPIC-5
(#1100) sequences both bp-opentelemetry (collector first) and this
Blueprint as part of the observability roll-out
- Per-Application Instrumentation CRs from Blueprint.spec.observability.
traces=otlp — application-controller (slice C4 of #1095) renders
those at install time
Refs: #1094, #1095, #1100, docs/EPICS-1-6-unified-design.md §3.9 row 5
+ §8.4 (EPIC-5 Networking).
Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
5915e309dc
|
feat(bp-kyverno): land label-vocab mutate + validate ClusterPolicies (slices E1+E2, #1095) (#1120)
Realizes design doc §3.6 (Label-vocabulary enforcement). Two ClusterPolicies that together implement the contract in §1: the openova.io/* label set is the join key across compliance scoring (#1096), RBAC scope matching (#1098), billing (post-Phase-1), and networking (#1100). If labels are missing, every downstream consumer is blind. E1 — mutate-add-openova-labels (slice E1): - Mutating ClusterPolicy that derives missing openova.io/{org, env, application, blueprint, managed-by} labels from namespace annotations + ownerReferences and adds them at admission. - Three rules: * add-org-from-namespace-annotation * add-env-from-namespace-annotation * add-managed-by-flux-when-flux-instance-label - Best-effort safety net — Catalyst controllers (C1/C2/C4) are the authoritative source. This rule covers resources created OUTSIDE the controller path (e.g. a debug Pod from kubectl run, a CronJob authored manually). E2 — validate-require-openova-labels (slice E2): - Validating ClusterPolicy that REJECTS workload resources missing required openova.io/* labels. - Default action `Audit` (permissive) — per-Environment overlay flips to `Enforce` (blocking) via EnvironmentPolicy.spec.modes in EPIC-1 #1096. - One rule per required label (templated from .Values.kyvernoOverlay. labelVocab.validate.requiredLabels) — lets the Audit/Enforce decision be per-label rather than all-or-nothing. - excludeNamespaces list exempts control-plane namespaces (kube-system, flux-system, cilium, cert-manager, openova-system, catalyst, etc.) so existing Sovereign infra doesn't trip on missing org labels. Both default OFF (.Values.kyvernoOverlay.labelVocab.{mutate,validate}. enabled). Operator opts in once the prerequisite Organization (slice B1) + Environment (slice B2) CRs exist on the cluster, otherwise the mutate rule has nothing to derive from and the validate rule rejects every workload. Per docs/INVIOLABLE-PRINCIPLES.md #4, every list (requiredLabels, resourceKinds, excludeNamespaces, action) is in values.yaml. Validated: - helm dependency build pulls upstream kyverno cleanly - helm template with default values: 0 ClusterPolicy resources rendered - helm template with both gates enabled: exactly 2 ClusterPolicies rendered (mutate-add-openova-labels + validate-require-openova-labels) Chart version bumped 1.0.1 → 1.1.0 (minor — new templates, no breaking). Blueprint.yaml mirrored 1.0.0 → 1.1.0. Refs: #1094, #1095, #1096, #1098, #1100, docs/EPICS-1-6-unified-design.md §1 (label vocab) + §3.6 (E1+E2 scope). Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e1d7bf18be
|
feat(bp-hcloud-csi): scaffold Hetzner CSI driver Blueprint (slice H6, #1095) (#1119)
New platform/hcloud-csi/ Blueprint scaffold per design doc §3.9 row 6. Wraps the upstream hetznercloud/csi-driver Helm chart and ships the Catalyst-managed `hcloud-volumes` StorageClass that multi-node stateful workloads (CNPG primary/replica pairs in EPIC-6 #1101) need. Default-OFF: chart is a no-op until .Values.enabled is true. Even after enabling, the cluster's default StorageClass is NOT flipped unless .Values.defaultStorageClass is also true — that's a destructive change for Pods relying on the previous default's binding semantics, so the in-place migration plan is operator-scheduled. What ships: - platform/hcloud-csi/README.md — activation contract, why-default-OFF - platform/hcloud-csi/blueprint.yaml — bp-hcloud-csi 1.0.0, configSchema - platform/hcloud-csi/chart/Chart.yaml — wraps upstream hcloud-csi:2.13.0 from charts.hetzner.cloud, condition=enabled gate - platform/hcloud-csi/chart/values.yaml — gate, default-storageclass flag, hetznerTokenSecretRef (SealedSecret), catalystStorageClasses array (renamed from storageClasses to avoid collision with upstream's storageClasses key), volumeSnapshotClass block (default off) - platform/hcloud-csi/chart/templates/storageclass.yaml — renders one StorageClass per catalystStorageClasses[] entry; first entry annotated as cluster default when defaultStorageClass=true - platform/hcloud-csi/chart/templates/volumesnapshotclass.yaml — VolumeSnapshotClass for backup workflows; default off Why a separate Blueprint, not values toggle on bp-cilium: - CSI drivers are independent of CNI. Mixing them risks coupling the network-plane upgrade cycle to the storage-plane upgrade cycle. Per docs/INVIOLABLE-PRINCIPLES.md #4 every parameter (StorageClass list, SealedSecret reference, replicas, resource requests) is in values.yaml. Validated: - helm dependency build pulls upstream hcloud-csi:2.13.0 cleanly - helm template with default values: 0 resources rendered (gate + Chart.yaml condition both fire correctly) - helm template with enabled=true defaultStorageClass=true: 7 resources rendered (upstream CSI controller Deployment, node DaemonSet, CSIDriver, RBAC, plus Catalyst hcloud-volumes StorageClass with the storageclass.kubernetes.io/is-default-class annotation) Schema collision lesson: - Initial draft used .Values.storageClasses[] which collided with the upstream subchart's storageClasses array (different shape; subchart expects array under that exact name). Renamed to catalystStorageClasses + passed [] to upstream's hcloud-csi.storageClasses to suppress its own StorageClass rendering. Lesson logged in seam map. Refs: #1094, #1095, #1101, docs/EPICS-1-6-unified-design.md §3.9 row 6, docs/SRE.md §2.5, platform/cnpg/README.md. Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
eca27002ae
|
feat(bp-cilium): add Hubble UI HTTPRoute overlay (slice H7, #1095) (#1117)
Realizes design doc §3.9 row 7 (Hubble relay+UI on; OIDC ingress) as a default-OFF scaffold that EPIC-5 (#1100) flips on per Sovereign once the zero-trust observability tier is ready. Why default-OFF in Phase-0: - Hubble relay/UI in production today is intentionally off (SovereignA was crash-looping on monitoring.coreos.com/v1 ServiceMonitor missing before bp-kube-prometheus-stack reconciles — issue #182). - The OIDC enforcement at the gateway boundary is the missing piece — Cilium's L7 OIDC filter wires to bp-keycloak's `hubble-ui` client which lands in slice D1. - Flipping the gate without the OIDC layer would leave Hubble UI publicly accessible. The template comments explicitly warn against this for production. What ships: - platform/cilium/chart/templates/hubble-ui-httproute.yaml — HTTPRoute exposing hubble-ui Service via cilium-gateway with the wildcard cert. Gated by `catalystOverlay.hubbleUI.{enabled,hostname}`. - platform/cilium/chart/values.yaml `catalystOverlay:` block: hubbleUI.{ enabled, hostname, gatewayRef.{name,namespace}, serviceRef.{name,namespace,port}, auth (oidc|none, default oidc) }. All operator-overrideable per docs/INVIOLABLE-PRINCIPLES.md #4. Operator opt-in path (per-Sovereign overlay at clusters/<sov>/bootstrap-kit/ 01-cilium.yaml): spec.values.cilium.hubble.relay.enabled: true spec.values.cilium.hubble.ui.enabled: true spec.values.catalystOverlay.hubbleUI.enabled: true spec.values.catalystOverlay.hubbleUI.hostname: hubble.<sovereign-domain> … AND bp-keycloak realm has a `hubble-ui` OIDC client (slice D1). Validated: - helm template with default values: 0 HTTPRoute resources rendered - helm template with catalystOverlay.hubbleUI.enabled=true + hostname: exactly 1 HTTPRoute rendered with proper parentRefs/hostnames/backendRefs - Original 34-resource render count unchanged in default mode (no regression to existing chart output) Chart version bumped 1.2.1 → 1.3.0 (minor — new templates, no breaking). Refs: #1094, #1095, #1100, docs/EPICS-1-6-unified-design.md §3.9 row 7, §8 (EPIC-5 Networking). Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
68c68eaf7a
|
feat(bp-network-policies): land default-deny CCNP + system-namespace + DNS allow templates (slice H8, #1095) (#1116)
New platform/network-policies/ Blueprint scaffold per design doc §3.9 row 8. Ships the cluster-wide zero-trust primitives that EPIC-5 (#1100) activates as part of the networking roll-out. What ships: - platform/network-policies/blueprint.yaml — bp-network-policies 1.0.0 - platform/network-policies/chart/Chart.yaml — Helm chart, no upstream sub-chart - platform/network-policies/chart/values.yaml — gate (enabled: false default) - platform/network-policies/chart/templates/default-deny.yaml — CCNP that denies all ingress + egress at endpointSelector: {} (full-cluster scope) - platform/network-policies/chart/templates/allow-system-namespaces.yaml — CCNP allowing full traffic for kube-system, flux-system, cilium, cert-manager, catalyst, openova-system, monitoring, ingress (set is parametric via .Values.allowSystemNamespaces — operator extends per Sovereign for gitea/harbor/loki etc.) - platform/network-policies/chart/templates/allow-egress-dns.yaml — CCNP permitting UDP/TCP/53 to CoreDNS from every Pod (without this the cluster is unbootable under default-deny — first DNS lookup fails) Why a separate Blueprint, not bp-cilium: - bp-cilium is foundational, installed on every cluster on day 0. Default-deny breaks every workload that hasn't been allowlisted, so it cannot ship in bp-cilium without operator opt-in semantics. - Separate Blueprint with enabled: false default preserves the safety boundary. EPIC-5 wires the activation when the rest of the zero-trust story is ready. Per-namespace intra-namespace allow is intentionally NOT in this slice: - Cilium CCNPs cannot express "same namespace as the source Pod" without listing every namespace, which contradicts dynamic Org provisioning. - That allow rule is rendered as a per-namespace CiliumNetworkPolicy (CNP, namespace-scoped) by organization-controller (slice C1 of #1095) at Organization creation time. README + values.yaml note this for downstream Implementers. Per docs/INVIOLABLE-PRINCIPLES.md #4, every policy parameter (allowSystemNamespaces list, dnsNamespace, dnsServiceName) is in values.yaml, not hardcoded. Validated: - helm template with default values: 0 resources rendered (gate works) - helm template with enabled=true: exactly 3 CCNPs rendered (default-deny, allow-system-namespaces, allow-egress-dns), all parse cleanly through python yaml.safe_load_all - CCNP CRD validation will happen on Sovereigns where bp-cilium is installed; local k3s here uses flannel so server-side dry-run is unavailable Refs: #1094, #1095, #1100, docs/EPICS-1-6-unified-design.md §3.9 row 8 + §8 (EPIC-5), ADR-0001 §2 (zero-trust). Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
82bf6f6eec
|
fix(bp-cilium): align declared upstream version with Chart.lock (slice H1, #1095) (#1115)
EPIC-0 audit found provenance drift in bp-cilium: - Chart.yaml dependencies[0].version declared "1.19.3" - values.yaml catalystBlueprint.upstream.version declared "1.19.3" - Chart.lock pinned to 1.16.5 (truth-on-disk — what every Sovereign has actually been running) The declared "1.19.3" was never installed anywhere. Aligning all three to "1.16.5" so observability/audit pipelines that compare the declared upstream version with the actually-deployed Cilium version stop reporting a 3-minor mismatch. This is a pure metadata fix — no behavioral change. Rolling forward to a newer Cilium minor (1.17.x or 1.18.x) is a separate slice that needs real upgrade testing on a live data-plane cluster, including k3s --flannel-backend=none compatibility and Gateway API CRD compatibility. Validated: - helm dependency build re-resolves to 1.16.5 cleanly - Chart.lock unchanged (Cilium 1.16.5 was already what it had) Chart version bumped 1.2.0 → 1.2.1 (patch). Blueprint.yaml mirrored. Refs: #1094, #1095, docs/EPICS-1-6-unified-design.md §3.9 row 1, §11 row 3. Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e8bf1aab69
|
feat(bp-nats-jetstream): land Stream + KV CR templates (slice H4, #1095) (#1114)
Realizes design doc §3.9 row 7. The chart had no templates/ directory —
NACK Stream and KeyValue CRs that ADR-0001 §6 mandates as the Catalyst
event spine were declared in docs but not in code.
What this slice ships:
- platform/nats-jetstream/chart/templates/_helpers.tpl — common labels +
servers helper (defaults to <release>-nats Service URL, override via
.Values.catalystStreams.servers).
- platform/nats-jetstream/chart/templates/streams.yaml — three Streams:
* catalyst.audit : 90-day retention, R=3, mirrored to DR (#1101)
* catalyst.events : 24-hour retention (cross-replica fan-out + cold-
start replay), R=3
* catalyst.billing: 1-year retention, R=3, consumed by future billing
- platform/nats-jetstream/chart/templates/kv-buckets.yaml — three KVs:
* idempotency : 24h TTL, 256 MiB cap (write-path idempotency keys)
* dr-leases : 60s TTL (Continuum dns-quorum lease path; CF-KV
bypasses this bucket)
* policy-rollup: 7-day retention, 1 GiB cap (compliance scorer #1096)
Reconciliation gate:
- All resources render only when .Values.catalystStreams.enabled is true.
- NACK (nats-io/nack) is NOT a current dependency — installing it as a
sibling Blueprint and flipping this toggle is a follow-up slice.
- Same default-off pattern the chart already uses for promExporter.podMonitor
(issue #182) so a fresh Sovereign with no NACK keeps booting cleanly.
Per-tenant streams (org.<id>.events, app.<id>.events) are intentionally
NOT shipped here — they'll be created at runtime by organization-controller
(slice C1) and application-controller (slice C4) so they can scale per
tenant.
Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every retention,
TTL, replicas, and maxBytes is a values.yaml variable; per-Sovereign
overlays override.
Validated:
- helm dependency build pulls upstream nats:1.2.0
- helm template with default values: 0 catalyst-* resources rendered
(catalystStreams.enabled=false, the safe default)
- helm template with catalystStreams.enabled=true: 6 resources rendered
exactly as expected (3 Streams + 3 KeyValues, all in
jetstream.nats.io/v1beta2)
Chart version bumped 1.1.2 → 1.2.0 (minor — new templates, no breaking).
Blueprint.yaml version mirrored.
Refs: #1094, #1095, #1096, #1101, docs/EPICS-1-6-unified-design.md §3.9
row 7, ADR-0001 §6.
Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
25ef20a8e5
|
feat(catalyst-chart): land Blueprint CRD + fix 5 string-form depends (slice B4, #1095) (#1112)
Realizes the Blueprint CRD per docs/BLUEPRINT-AUTHORING.md §3 and design
doc §3.2.4. Promotes the doc-contract (apiVersion catalyst.openova.io)
from a YAML-loaded contract to a schema-validated CRD.
Schema design:
- Two versions served from one inline schema (YAML anchors): v1alpha1
(legacy, served, not storage) and v1 (canonical, served, storage). The
shared schema means the 38 existing v1alpha1 files in platform/ +
products/ continue to validate; migration to v1 is a follow-up slice.
- Required at this layer: spec.version (strict semver pattern),
spec.card.title (minLength=1).
- Card variants accommodated as documented: summary | description |
tagline interchangeable; category | family interchangeable; docs |
documentation interchangeable. All optional except title.
- visibility enum: listed | unlisted | private.
- placementSchema.modes enum: single-region | active-active | active-
hotstandby — same set Application.spec.placement validates against.
- depends[].blueprint pattern accepts both bp-* and bare-name (legacy).
- manifests accepts both manifests.chart (legacy short-form) AND
manifests.source.{kind,ref} (canonical). Three source kinds: HelmChart,
Kustomize, OAM.
- rotation[].ttl pattern '^[0-9]+(s|m|h|d)$'.
- x-kubernetes-preserve-unknown-fields liberally on configSchema (per-
Blueprint JSON Schema is arbitrary by design), card, manifests, owner,
observability, outputs, depends[].values, manifests.values, etc.
Existing files validation:
- Surveyed all blueprint.yaml in platform/ + products/ (59 files).
- Card field frequency: title (59), summary (38), description (20+1),
category (25), family (20), docs (20), documentation (14+1), icon (25),
tags (14), license (14).
- 54 of 59 files passed the schema unchanged.
- 5 files used `depends: [- bp-name]` (string form) instead of the
canonical `[- blueprint: bp-name]` object form per BLUEPRINT-AUTHORING
§3. Those 5 files are fixed in this commit:
* platform/cert-manager-powerdns-webhook/blueprint.yaml
* platform/cert-manager-dynadot-webhook/blueprint.yaml
* platform/crossplane-claims/blueprint.yaml
* platform/powerdns/blueprint.yaml
* platform/self-sovereign-cutover/blueprint.yaml
- After fix: ALL 59 files pass server-side validation (kubectl apply
--dry-run=server) against the new CRD.
Negative validation (tests/blueprint-sample-invalid.yaml):
- spec.version "1.3" → semver pattern
- spec.card missing → required
- spec.card.title missing → required
- spec.visibility "secret" → enum listed|unlisted|private
- spec.placementSchema.modes "round-robin" → enum
- spec.depends[0] bare string "bp-bad-string" → must be object
- spec.depends[1].blueprint "Foo" → pattern fails (uppercase)
- spec.rotation[0].ttl "5 days" → pattern '^[0-9]+(s|m|h|d)$'
All 8 seeded vectors rejected.
This commit ONLY touches new CRD + test files + the 5 depends fixes —
leaves the in-flight router.tsx + rootBeforeLoad.test.ts work from a
parallel agent and the .claude/worktrees/ directory untouched.
Refs: #1094, #1095, docs/EPICS-1-6-unified-design.md §3.2.4,
docs/BLUEPRINT-AUTHORING.md §3
Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a6fb97f2ef
|
fix(cutover step-01): clone+push (regular repo) instead of pull-mirror (#1033)
PR #1029 added a step-06 PATCH to flip mirror=false before push so the cutover-helmrepository-patches Job could write HelmRepository URL pivots to local Gitea. On Gitea 1.22.3 the PATCH returns 200 but silently no-ops — `mirror_interval` updates but `mirror: true` stays. The repo remains read-only and step-06 still hits HTTP 403 "remote: mirror repository is read-only". Reproduced on otech127 2026-05-05 with chart 0.1.22 deployed. Per ADR (cutover ends upstream tracking — Sovereign goes self-hosted from this point), the architecturally correct fix is to never create the mirror in the first place. Step-01 now creates a regular Gitea repo and bare-clones+pushes upstream content. All refs (branches+tags) replicate via `git push --mirror --force`, which is idempotent on re-runs. Trade-off: post-cutover Sovereigns no longer auto-sync from upstream — that's the intended cutover semantics anyway. Operator re-runs this Job manually for chart rollouts (next-session follow-up: dedicated post-cutover sync mechanism, perhaps a periodic CronJob the operator can opt into). Bumps: - bp-self-sovereign-cutover chart 0.1.22 → 0.1.23 - bootstrap-kit pin 0.1.22 → 0.1.23 Co-authored-by: Hati Yildiz <hatiyildiz@openova.io> |
||
|
|
a070808eda
|
fix(cutover step-06): convert pull-mirror to standalone before pushing patches (#1029)
Step-01 creates openova/openova on the Sovereign's local Gitea as a
pull mirror so it tracks upstream openova-public during early
bootstrap. After cutover, the Sovereign is self-hosted and MUST
diverge from upstream — but Gitea blocks pushes to a mirror with
HTTP 403 "remote: mirror repository is read-only".
Step-06 adds a Phase-1.5 PATCH /api/v1/repos/{owner}/{repo}
{"mirror": false, "mirror_interval": "0"} BEFORE attempting to
clone+push the HelmRepository URL pivot. This converts the
pull-mirror into a standalone writable repo — the way the post-
cutover Sovereign architecture expects it.
Caught on otech125 2026-05-05: cutover-helmrepository-patches Job
returned "FATAL: git push failed" with no upstream stderr (chart
0.1.20 lacks the printf '%s\n' "$push_err" fix from PR #1022, which
was published in 0.1.21 only). Reproduced by cloning openova/openova
from a debug pod and running git push: "remote: mirror repository
is read-only / fatal: ... HTTP 403". Without the demirror step,
EVERY Sovereign provisioned fails handover at this step.
Bumps:
- bp-self-sovereign-cutover chart 0.1.21 → 0.1.22
- bootstrap-kit pin 0.1.20 → 0.1.22
Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
|
||
|
|
478743db17
|
fix(cutover-step-06): actually surface git push stderr (PR #1021 merged with only chart bump) (#1022)
PR #1021 was supposed to ship this code fix but the chart-version bump landed first and the actual sed didn't apply (sed quoting mishap). The debug-error fix never reached main. Re-shipping now as a clean Edit- based commit. Captures git push stderr into push_err and prints it on FATAL so the next iteration's failed Job logs include git's actual rejection (auth / branch protection / hook). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
69980ed48e
|
chore(bp-self-sovereign-cutover): bump 0.1.20 → 0.1.21 (#1021)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
608db53a25
|
fix(cutover 0.1.20): Step-06 pushes YAML edit to local Gitea so patches survive Flux reconcile (#970) (#971)
## Root cause (live on otech116 2026-05-05 14:38) After the #968 fix shipped (0.1.19), the cutover engine reached Step-7 (87%) successfully — Step-01..07 all completed. Then Step-08 (egress- block-test) caught 38/38 HelmRepositories had reverted to upstream: ``` external HelmRepositories still pointing at ghcr.io/openova-io: 38 OFFENDER flux-system/bp-cilium=oci://ghcr.io/openova-io ... (37 more) FAIL — at least one HelmRepository did not pivot ``` But Step-06's job logs say: ``` [helmrepository-patches] OK bp-cilium -> oci://harbor.otech116.omani.works/openova-io ... (37 more OK) ok=38 skip=0 fail=0 ``` So Step-06 thought it succeeded — and it had, momentarily. But then the bootstrap-kit Kustomization (which had successfully pivoted to local Gitea via Step-05) reconciled its YAML from local Gitea, where the YAML still declared `url: oci://ghcr.io/openova-io`. Within ~30s every kubectl patch was undone. The cutover engine then aborted at Step-8 verification. ## Fix Step-06 now runs in two phases: 1. **Live K8s patches** (existing behaviour) — flips spec.url on every HelmRepository immediately. Useful for the cluster between cutover and the next reconcile. 2. **NEW — Push YAML edit to local Gitea** — clones `openova/openova` from the local Gitea over basic-auth, sed-rewrites every `clusters/_template/bootstrap-kit/*.yaml` declaration of `url: oci://ghcr.io/openova-io` → `oci://harbor.<sov-fqdn>/openova-io`, commits with a clear message, pushes back. Subsequent reconciles see local Harbor as the steady-state. After the push, the script annotates `flux-system/openova` GitRepository to trigger immediate reconciliation so the new YAML lands without waiting for the polling interval. ## Image change Step-06 image bumped from `bitnami/kubectl:1.31.4` to `alpine/k8s:1.31.4` because the new phase needs both `kubectl` and `git` in one image (verified live on otech116 — both binaries present). ## Acceptance gate Test case 16 added to cutover-contract.sh — guards against future regressions that remove the `git clone`, the `git push origin main`, or the `clusters/_template/bootstrap-kit` target dir reference. ## Live verification Will fire on otech117 (next provision). Expected: - Step-06 logs `cloning gitea-http.gitea.../openova/openova.git` then `pushed to ...` - Step-08 verify PASSES (38/38 HelmRepositories pivoted in K8s + Gitea) - self-sovereign-cutover-status `cutoverComplete: "true"` - Egress block to ghcr.io safely activates Co-authored-by: e3mrah <ebaysal@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3db19b76b1
|
fix(cutover 0.1.19): Step-01 gitea-mirror DNS readiness probe + backoffLimit=3 (#968) (#969)
## Root cause (live on otech115 2026-05-05 14:15) After PR #959 (0.1.18) unblocked the auto-trigger to actually call /internal/cutover/trigger, the cutover engine fired Step-01 within ~8s of bp-self-sovereign-cutover Helm-install completing. The gitea Pod had only just reached Ready state — cluster-DNS endpoint publication for the headless service `gitea-http` was still in flight. One wget returned `bad address gitea-http.gitea.svc.cluster.local` and exited non-zero. Catalyst-api's cutover engine stamped Jobs with backoffLimit=0 (cutover.go:584), so a single DNS miss was terminal and aborted all 8 cutover steps. otech115 finished provisioning with cutoverComplete=false and tethered to upstream github.com/ghcr.io. ## Fix (dual-layer) **Layer A — catalyst-api (cutover.go)**: backoffLimit lifted from 0 to 3. A single transient miss is recoverable (4 attempts over each step's activeDeadlineSeconds) without burning operator-attention. Hard failures still surface within budget. **Layer B — chart Step-01 (01-gitea-mirror-job.yaml)**: explicit nslookup readiness probe at the top of the bash script, before any wget call. 30 attempts × 5s = 150s budget; alpine/git ships nslookup in /usr/bin (verified live on otech115). Layer B is faster than Layer A (in-script DNS retry vs Pod recreate); Layer A is the safety net for any other transient pre-cluster-stable race we haven't yet enumerated. ## Acceptance gate Test case 15 added to platform/self-sovereign-cutover/chart/tests/ cutover-contract.sh — guards against future regressions that drop either the gitea_host extraction or the nslookup loop. ## Live verification Will fire on the next provision (otech116). Expected: - Step-01 logs `[gitea-mirror] DNS ready for gitea-http.gitea.svc.cluster.local (attempt N)` - All 8 cutover Jobs reach Complete - self-sovereign-cutover-status ConfigMap reaches cutoverComplete=true Co-authored-by: e3mrah <ebaysal@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d1431bed09
|
fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965)
Closes #921 — bp-cluster-autoscaler-hcloud chart shipped without HCLOUD_CLUSTER_CONFIG / HCLOUD_CLOUD_INIT, so cluster-autoscaler 1.32.x FATALs at startup with "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT is not specified" on every Sovereign (otech112 evidence). HelmRelease reports Ready=True (Helm install succeeded) but the Pod CrashLoopBackOffs invisibly behind the False-positive condition. Closes #916 — wizard let operators dispatch unbuildable topologies (otech109: cpx32 worker in `ash`) because PROVIDER_NODE_SIZES did not encode regional orderability. Hetzner rejected the worker creation 41s into `tofu apply` after Phase-0 had already created the CP + network + LB + firewall. Chart fix (issue #921): - Add `clusterAutoscalerHcloud.{clusterConfig,cloudInit}` values to the umbrella chart (base64-encoded per upstream contract). - Render `hetzner-node-config` Secret unconditionally with both keys so the upstream Deployment's secretKeyRef references resolve cleanly during `helm template` AND in the live cluster regardless of overlay state. - Wire HCLOUD_CLUSTER_CONFIG + HCLOUD_CLOUD_INIT extraEnvSecrets onto the upstream chart's deployment. - Tofu Phase 0 base64-encodes the Phase-0 worker cloud-init and stamps it under `flux-system/cloud-credentials.hcloud-cloud-init`; the bootstrap-kit overlay lifts that key via Flux `valuesFrom` into `clusterAutoscalerHcloud.cloudInit`. Autoscaler-spawned workers thus receive the IDENTICAL bootstrap as the Phase-0 worker fleet. - Bump bp-cluster-autoscaler-hcloud chart 1.0.0 → 1.1.0. - Chart-test smoke gate (chart/tests/hetzner-node-config.sh) verifies Secret + env var wiring + no-regression of HCLOUD_TOKEN — runs in CI's blueprint-release "Run chart integration tests" step. Wizard fix (issue #916): - Add `availableRegions?: string[]` to NodeSize interface; encode cpx32 = ['fsn1','nbg1','hel1'], cpx21/cpx31 = [] (orderable nowhere new) per Hetzner /v1/server_types vs POST /v1/servers gap. - Add `isSkuAvailableInRegion()` + `suggestAlternativeSkus()` helpers. - StepProvider filters SKU dropdowns by selected region; auto-swaps current SKU to recommended default when region change drops it out of orderability. - Mirror the matrix Go-side in sku_availability.go; gate `provisioner.Request.Validate()` with same predicate so a stale wizard build OR direct API caller bypassing the UI cannot dispatch otech109's failure mode. - Two-sided enforcement covers both r.Regions[] (multi-region) and the legacy singular path. Tests: 13 vitest cases on the wizard side + 38 Go subtests on the API side. Chart smoke renders + helm template gates the env wiring at publish time. Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
238c6d2010
|
fix(bp-flux): mitigate helm-controller leader-election loss + stuck-HR recovery (#925) (#960)
* fix(bp-flux): mitigate helm-controller leader-election loss + recovery CronJob (#925) On otech113.omani.works the bp-vpa HelmRelease became stuck Ready=Unknown forever after a transient kube-apiserver blip caused helm-controller to lose its leader-election lease mid-install. The Helm release secret was already committed (Status=deployed) by the previous leader, but its last write to the HR's Ready condition was Unknown and the new leader's "release in storage?" short-circuit never re-evaluates that. The HR blocked bootstrap-kit → sovereign-tls → cilium-gateway, breaking every HTTPRoute on the Sovereign. Fix is two-pronged: 1) PRIMARY (prevent the trigger). Stretch leader-election lease durations on the three Catalyst-critical controllers (helm/kustomize/source) from the upstream defaults of lease=35s renew=30s retry=5s to lease=60s renew=40s retry=5s, and bump memory limits from 256Mi to 512Mi (helm) / 384Mi (kustomize, source) so OOMKills during 35-HR fan-out installs don't themselves trigger leadership handoffs. Costs ~50s extra failover time on a real controller crash; that's acceptable since CP HA is a Phase 2 concern and we'd much rather avoid spurious flips during transient API pressure. 2) RECOVERY (handle the residual case). New CronJob bp-flux-stuck-hr-recovery runs every 2 minutes, scans every HelmRelease cluster-wide, and for each HR stuck in Ready=Unknown for >5 minutes whose underlying Helm release secret already has status=deployed, force-toggles spec.suspend (the only known workaround per #925). Guardrail: refuses to act if more than 10 HRs would be touched in a single run (signals a cluster-wide outage). Operator-disablable via .Values.catalyst.stuckHelmReleaseRecovery.enabled=false. Lock-in tests: tests/leader-election-and-recovery.sh covers all three flag/memory bumps, CronJob render, RBAC presence, disable-toggle, and threshold operator override. version-pin-replay + observability-toggle still green. Chart bumped 1.1.4 → 1.2.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-flux): bump blueprint.yaml spec.version to 1.2.0 to match Chart.yaml (#925) The bootstrap-kit static validation gate (Chart.yaml version == blueprint.yaml spec.version) caught the missed bump on PR #960. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b7f150db38
|
fix(cutover 0.1.18): poll /healthz for readiness instead of auth-gated /status (#957) (#959)
The 0.1.17 auto-trigger Job was Complete=True on otech113 but the
cutover never actually started: the readiness probe loop polled
/api/v1/sovereign/cutover/status (auth-gated, behind RequireSession)
and treated 401 as "API not ready". The loop ran 30 times for 300s
and exited 0 — the trigger endpoint was NEVER called.
Live evidence on otech113 2026-05-05:
- 30 consecutive 401s from auto-trigger Pod (10.42.4.216) on
/sovereign/cutover/status in catalyst-api access log
- zero hits on /api/v1/internal/cutover/trigger
- Helm post-upgrade hook deadline tripped → rollback to 0.1.15
Fix (chart-side only; PR #947 catalyst-api endpoint is correct as-is):
- poll /healthz (unauthenticated, always 200 when process is up)
- drop the pre-flight cutoverComplete=true short-circuit since
/internal/cutover/trigger is already idempotent (returns 200 with
the existing snapshot when cutoverComplete=true, per
cutover_internal.go line 279)
- bump chart 0.1.17 → 0.1.18; pin slot 06a to 0.1.18
Tests:
- contract gate Case 13: probe target is /healthz, NOT
/sovereign/cutover/status (regression guard)
- contract gate Case 14: no stale cutoverComplete pre-read off
/tmp/status.json (the file no longer exists)
- existing 12 contract gates still pass; helm lint clean
- existing 6 Go unit tests for HandleCutoverInternalTrigger pass
Closes #957
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
2ff50f0591
|
fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955)
Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on fresh Sovereign): #952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar} anonymously and gets 403 Forbidden. Fix: - Templatize spec.imagePullSecrets on Deployment + channel-seed Job. - Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`. - Add `newapi` to flux-system/ghcr-pull's reflector reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl so bp-reflector mirrors the source Secret into the namespace automatically on every fresh Sovereign. - Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay. #953 — services-build.yaml's image-rewrite loop only matched the hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8 sme-services templates use `image: "{{ ... }}/services-<svc>:{{ .Values.images.smeTag }}"`. Each services-build run bumped only auth.yaml while reporting "update sme service images to ${SHA}", leaving the live Pod on stale bytes (PR #951's #941 fix never reached services-catalog despite the merge + chart bump chain). Fix: - After the hardcoded loop, also bump `images.smeTag` in products/catalyst/chart/values.yaml with a strict regex match (`^ smeTag: "<sha>"$`); refuse to auto-bump if the line shape changes (defends against silent drift if a contributor renames the field). - Mirror the change into the retry-path `rewrite()` function so a reset-to-origin/main retry does not recreate the original bug. Tests: - platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases asserting the Deployment and channel-seed Job carry the default ghcr-pull reference, that an empty override suppresses the block, and that custom secret names propagate (Inviolable Principle #4). - tests/integration/services-build-rewrite.sh — 3 cases reproducing the workflow's rewrite logic on a sandboxed copy of the live chart, asserting both auth.yaml's hardcoded line AND values.yaml's smeTag get bumped, that helm-render of the catalyst chart with the bumped values produces all 8 SME-service Deployments at the new SHA, and that an idempotent re-bump to a second SHA also lands cleanly. Refs: #952 #953 (umbrella #915 — alice signup gate 5). Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
689276889c
|
fix(bp-catalyst-platform+bp-newapi): unblock alice signup gates 2-6 on Sovereigns (#915) (#951)
Six coupled chart + orchestrator fixes that unblock alice marketplace
signup → tenant ready → SaaS integrations → LLM → ledger on a freshly
franchised Sovereign. C5-final got Gate 1 GREEN on otech113 (2026-05-05)
but every downstream gate failed because the SME bundle hardcoded
contabo-only assumptions.
Bumps:
- bp-catalyst-platform 1.4.21 → 1.4.22
- bp-newapi 1.3.0 → 1.4.0
- bootstrap-kit slot 13 + 80 pins updated in lockstep
Issues addressed (single consolidated PR — smaller PRs would race
against alice signup retries):
- #934 (auth SMTP empty → "failed to send email"): sme-secrets.yaml
now reads SMTP_* from `catalyst-system/sovereign-smtp-credentials`
(the same A5-seeded source #883/#905 the chart 1.4.20 catalyst-
openova-kc-credentials Secret already uses) with source-wins
precedence. Both canonical (smtp-host/port/from/user/pass) AND
legacy (host/port/from/user/password) source-Secret key shapes
accepted. Empty source falls back to chart-level defaults so the
contabo path stays clean.
- #940 (provisioning service GITHUB_TOKEN placeholder + hardcoded
upstream github.com): chart values
.Values.smeServices.provisioning.{githubToken,git.{apiURL,owner,
repo,branch}} make every GitHub-API coordinate operator-overridable
with topology-aware defaults (Sovereign ⇒ in-cluster Gitea REST
API + `openova` org; contabo ⇒ api.github.com + `openova-io` org).
Provisioning binary's startup gate validates the GITHUB_TOKEN does
NOT contain placeholder substrings (<placeholder>, PLACEHOLDER,
REPLACE_ME, ...) and crashes the Pod into Pending if it does — the
operator sees the misconfig immediately instead of after alice
signups have failed silently in service logs. GitHub client now
accepts a custom API URL via NewClientWithAPIURL so Gitea's GitHub-
compatible /api/v1 surface drops in without re-implementing the
client.
- #941 (catalog "27 apps COMING SOON"): added `openclaw` and
`stalwart-mail` to migrateAppDeployable's deployable map at
core/services/catalog/handlers/seed.go. Both blueprints (bp-openclaw,
bp-stalwart-{sovereign,tenant}) ship with visibility=listed in the
embedded blueprints.json AND have working SME-tenant overlay
templates in sme_tenant_gitops.go, but the catalog handler silently
filtered them out because they were missing here. Map extracted to
DeployableAppSlugs() exported function so unit tests can assert
membership without invoking a Mongo store.
- #942 (REDPANDA_BROKERS hardcoded to talentmesh): configmap.yaml
selects broker default at render time based on global.sovereignFQDN
— Sovereign ⇒ NATS JetStream Service per ADR-0001 (the only local
bus on Sovereigns); contabo ⇒ legacy Redpanda Service in talentmesh.
Operator MAY override either default via
.Values.smeServices.eventBus.brokers without forking the chart.
The ConfigMap key name stays REDPANDA_BROKERS for back-compat with
existing SME service Go env wiring; new EVENT_BUS_PROTOCOL key
surfaces the protocol hint for services that want to switch wire
format independently.
- #943 (bp-newapi silently skips Deployment): NEW
templates/cnpg-cluster.yaml auto-provisions a CNPG-backed Postgres
Cluster + Helm-`lookup`-persistent DSN Secret when
.Values.cnpg.enabled (DEFAULT true). NEW templates/credentials-
secret.yaml auto-generates SESSION_SECRET + CRYPTO_SECRET (each
64-char randAlphaNum, persistent across reconciles via Helm
`lookup`) when .Values.credentials.autoProvision (DEFAULT true).
deployment.yaml gate now resolves Secret names from the chart-
emitted defaults when the operator hasn't supplied an override.
Capabilities-gated on postgresql.cnpg.io/v1 so a cold install
before bp-cnpg is Ready surfaces as "no Cluster yet" rather than
a hard install error.
- #944 (CRITICAL — cross-cluster pollution): provisioning.yaml
templates GIT_BASE_PATH from
.Values.smeServices.provisioning.gitBasePath with a topology-aware
default `clusters/<sovereignFQDN>/sme-tenants` on Sovereigns. NEW
`core/services/provisioning/gitguard` package validates at startup
AND on every commit code path that the path begins with
`clusters/<self-FQDN>/` — refusing to commit to any other cluster's
tree. Defence in depth so a runtime env mutation (kubectl exec,
ConfigMap update without Pod restart, hostile sidecar) cannot
bypass the check. Pre-#944 every alice tenant overlay landed in
upstream openova/openova `clusters/contabo-mkt/tenants/<id>/`
which contabo Flux would then install on the contabo cluster —
C5-final caught + reverted the alice2 incident at commit
|
||
|
|
890fa67eff
|
fix(bp-harbor): inline labels on admin Secret to drop duplicate keys (#949) (#950)
PR #947 (bp-harbor 1.2.14) added templates/admin-secret.yaml that included the canonical bp-harbor.labels helper AND re-declared app.kubernetes.io/name + catalyst.openova.io/component with admin- credential-specific values. Helm's strict YAML post-render parser rejected the rendered manifest with `mapping key "app.kubernetes.io/name" already defined at line 8`, blocking the upgrade chain on otech113 — bp-self-sovereign-cutover dependsOn bp-harbor and re-blocked, stalling cutover indefinitely. Per the issue's recommended Option A, labels are inlined verbatim on the admin Secret. Every key the helper would emit is reproduced explicitly, except the two that need a Secret-specific value (catalyst.openova.io/component=harbor-admin) plus an explicit admin-credentials sub-component label. A regression guard (Case 6) is added to tests/admin-secret.sh: the rendered Secret block is parsed through PyYAML's safe_load_all, which enforces mapping-key uniqueness the same way Helm's post- render does. Duplicate keys raise and break the test. Bumps: - platform/harbor/chart/Chart.yaml 1.2.14 → 1.2.15 - clusters/_template/bootstrap-kit/19-harbor.yaml slot pin Verification (all green locally): helm template smoke . --namespace harbor # renders OK bash tests/admin-secret.sh # 6 gates green helm lint . # 0 failed Closes one half of #949 (bp-harbor side); the slot pin update delivers it to fresh Sovereigns; existing otech113 picks up the upgrade on next Flux reconcile after the new chart publishes. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> |
||
|
|
88a8ecd8bb
|
fix(cutover): Reflector-mirror harbor-admin Secret + in-cluster trigger endpoint (#935) (#947)
Two bugs surfaced live on otech113 2026-05-05 blocking Self-Sovereignty Cutover end-to-end. Fix both in lockstep: Bug 1 — bp-self-sovereign-cutover Step 02 (harbor-projects) Job in `catalyst` namespace was hitting `secret "harbor-core" not found` for 11+ retries because the upstream Harbor `harbor-core` Secret only exists in the `harbor` namespace and Kubernetes forbids cross-namespace secretKeyRef. Step 02 was stuck in CreateContainerConfigError forever. Fix: bp-harbor 1.2.13 → 1.2.14 ships a Catalyst-curated `harbor-admin` Secret in the `harbor` namespace with Reflector mirror annotations (allowed-namespaces=catalyst, auto-enabled). The same Secret name auto-materialises in `catalyst` so the cutover Job's secretKeyRef resolves natively. Password is randomly generated on first install (32-char alphanum, 190 bits entropy per feedback_passwords.md) and preserved across reconciles via `lookup`. The upstream Harbor subchart consumes it via `existingSecretAdminPassword: harbor-admin`. bp-self-sovereign-cutover 0.1.16 → 0.1.17 updates `harbor.adminSecretRef.name` from `harbor-core` to `harbor-admin`. Bug 2 — The 0.1.16 auto-trigger Helm post-install Job (#933) POSTed /api/v1/sovereign/cutover/start which sits behind RequireSession middleware. The Job has no human session cookie — every request 401'd forever and cutover never started. Fix: new catalyst-api endpoint POST /api/v1/internal/cutover/trigger lives OUTSIDE RequireSession and validates the bearer token via the apiserver's TokenReview API + checks the resolved username matches the canonical `bp-self-sovereign-cutover-runner` SA. Same engine, same idempotency, same state machine — different auth surface. The auto-trigger Job now mounts its projected SA token at /var/run/secrets/kubernetes.io/serviceaccount/token and sends it as `Authorization: Bearer <token>`. SA username + accepted list are runtime-overridable per Inviolable Principle #4. Tests - 6 Go unit tests for HandleCutoverInternalTrigger covering happy path, missing bearer (401), TokenReview rejection (502), wrong SA (403), idempotency (no Jobs created when complete), wrong method (405). All pass. - bp-harbor admin-secret contract test (5 cases) — Secret renders, HARBOR_ADMIN_PASSWORD key present, Reflector annotations, keep policy, upstream consumes via existingSecretAdminPassword. - bp-self-sovereign-cutover cutover-contract test extended with 3 new cases — auto-trigger uses /internal/cutover/trigger, sends SA bearer token, references harbor-admin (not harbor-core). - All 12 cutover-contract gates green; all 4 observability-toggle gates green; helm template + helm lint clean on both charts. Bootstrap-kit slot pins - clusters/_template/bootstrap-kit/19-harbor.yaml: 1.2.13 → 1.2.14 - clusters/_template/bootstrap-kit/06a-bp-self-sovereign-cutover.yaml: 0.1.16 → 0.1.17 Closes #935 Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e9a72aa00d
|
feat(self-sovereign-cutover): auto-trigger on install + always-defined State (#933 E1) (#936)
Closes the otech113 dashboard regression where SovereigntyCard rendered `invalid CutoverState: <undefined>` instead of a Tethered badge, and makes the Day-2 cutover fire automatically once the chart lands rather than waiting for an operator click on "Achieve True Sovereignty". Founder rule per #933: handover is not "done" until cutover has run; the operator must NOT have to click a CTA on console.<sov-fqdn>/console/dashboard. Three coupled changes: 1. catalyst-api: cutoverStatusResponse now ALWAYS emits a `state` field ("tethered" or "sovereign"), derived from cutoverComplete. The UI's branded parseCutoverState rejects empty/undefined, which is what was rendering the user-visible error text. Tests cover the empty ConfigMap, missing cutoverComplete, and explicit-true cases. 2. UI parseCutoverStatus: defensive fallback when wire frame omits `state` — derive from cutoverComplete (default "tethered"). Hostile/ typo'd state values (e.g. 'pending', '') still throw via the branded parser. Defends against partial-rollout where a stale catalyst-api Pod is still serving the old shape. 3. bp-self-sovereign-cutover 0.1.16 (chart): new Helm post-install/ post-upgrade hook (templates/10-auto-trigger-job.yaml) POSTs /api/v1/sovereign/cutover/start on catalyst-api after the step ConfigMaps + RBAC land. Idempotent via catalyst-api's durable status ConfigMap (200 if already complete, 409 if running, 200 to start). Fails open: a transient catalyst-api unreachability exits 0 so the chart install doesn't block; operator can always re-fire via the manual CTA. Gated on .Values.trigger.auto (default true; per-Sovereign overlays can disable for soak Sovereigns). Hard rules honoured: - No contabo Pods touched. - Existing tethered Sovereigns that have not cutover stay tethered — the auto-trigger Job is in the chart (per-Sovereign), not in the mothership; only fresh Sovereign installs of bp-self-sovereign-cutover 0.1.16+ get it. - IaC-first: the auto-trigger uses catalyst-api's existing /start endpoint (no bespoke cluster mutation outside the chart). - Event-driven: post-install hook fires on chart install (no cron). Verification: - Go: cutover_test.go +TestBuildCutoverStatusResponse_StateAlwaysDefined +TestHandleCutoverStatus_StateFieldEmittedOnFreshSovereign — both green. - TS: cutover.test.ts +5 cases for parseCutoverStatus state-fallback; 35/35 green. Sovereignty widget tests 20/20 green. - Chart: tests/cutover-contract.sh +Case 8/9 (auto-trigger present by default, absent under trigger.auto=false); helm template renders cleanly. Co-authored-by: Hatice Yildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9077016466
|
feat(bp-stalwart-sovereign): per-Sovereign Stalwart for Console mail (#924) (#931)
Phase-2 follow-up to #883: replace mothership Stalwart relay (mail.openova.io:587) with a Sovereign-local Stalwart so Console PIN/magic-link mail originates from `noreply@<sovereignFQDN>` with per-Sovereign SPF/DKIM/DMARC posture, eliminating the mothership SMTP SPOF for Sovereign Console login. What ships: 1. NEW blueprint platform/stalwart-sovereign/ (otech-level — distinct from per-tenant bp-stalwart-tenant). Single Stalwart instance per Sovereign cluster, scoped to Sovereign Console system mail. NO Keycloak OIDC, NO webmail UI — Sovereign Console is the only consumer. Auto-provisioned admin + submission Secrets via the lookup-or-generate pattern (#898/#830/#887). Post-install Job: - registers the noreply submission principal in Stalwart - allows send-as for noreply@<sovereignFQDN> - reads DKIM public key, patches dns-records ConfigMap - materialises catalyst-system/sovereign-smtp-credentials with Sovereign-local infrastructure addresses + credentials, carrying BOTH key shapes (smtp-user/smtp-pass + legacy user/password) so the consumer chart works either way. 2. NEW bootstrap-kit slot 95 (clusters/_template/bootstrap-kit/ 95-bp-stalwart-sovereign.yaml). dependsOn: bp-cert-manager, bp-catalyst-platform. Sequenced after bp-catalyst-platform (slot 13) so the chart's post-install Job lands its mirror Secret in an already-existing catalyst-system namespace. 3. bp-catalyst-platform 1.4.19 → 1.4.20: SOURCE-wins precedence extended to (a) non-secret fields smtp-host/smtp-port/smtp-from so Sovereign-local infra addresses (`mail.<sovereignFQDN>`) take over from mothership defaults (`mail.openova.io`) on the next reconcile after slot 95 lands, and (b) canonical key shape `smtp-user`/`smtp-pass` in addition to legacy `user`/`password` source key shape. 4. expected-bootstrap-deps.yaml: declare slot 95 graph edge. 5. catalyst-api handler/sovereign_smtp_seed.go: documentation-only update to note this Phase-1 step is now a graceful fallback — the Phase-2 chart's post-install Job overwrites the mirror Secret on first reconcile so the cutover from mothership relay to Sovereign-local relay is automatic, no operator action. Verification: - `helm template smoke ./platform/stalwart-sovereign/chart` clean (smoke-render-safe; per-template gates skip when sovereignFQDN unset). - `helm template smoke -f operator-values.yaml` emits StatefulSet, LoadBalancer Service, ClusterIP HTTP Service, DKIM-signing config, dns-records ConfigMap, Setup Job + RBAC. - `chart/tests/sovereign-render.sh` 3 cases all PASS. - `helm template smoke ./products/catalyst/chart` (1.4.20) clean. - `helm lint` both charts: clean (only icon-recommended INFO). - `bash scripts/check-bootstrap-deps.sh` PASSED — bootstrap-kit dependency graph audit, 0 drift, 0 cycles. - `go test -run TestSeedSovereignSMTP` — Phase-1 seed tests pass. - `go test -run TestBootstrapKit_TemplateClusterParses` — slot 95 YAML parses cleanly. Out of scope (sub-PR follow-up under #924): - DKIM keypair generation in catalyst-api orchestrator + DNS records (MX/A/SPF/DMARC/DKIM-pubkey) registration via PDM dynadot adapter at omani.works. - Hetzner PTR (rDNS) auto-registration via the Hetzner cloud API. - Cert-manager Certificate adding mail.<sovereignFQDN> SAN to the Sovereign wildcard cert (chart relies on the existing wildcard cert from bp-catalyst-platform 1.4.0+'s per-zone Certificate template — when that wildcard chain covers the Sovereign FQDN, `mail.<sovereignFQDN>` is already covered). Acceptance (lands when sub-PR follow-up ships): - Sovereign Console PIN delivery uses noreply@<sov-fqdn>. - External mail server (e.g. Gmail) accepts mail with valid SPF + DKIM. - Mothership SMTP no longer SPOF for Sovereign Console login. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3fe27f625f
|
feat(bp-wordpress-tenant): wp-cli OIDC bootstrap + oidc.* canonical block (0.2.0, #915) (#927)
Umbrella issue #915 (D1 sub-task). Aligns the chart's post-install OIDC config Job with the canonical wp-cli flow and the bp-keycloak tenant- realm contract C1's PR #918 ships. Chart 0.2.0 ----------- - templates/oidc-config-job.yaml rewritten to use the official wordpress:cli-2.12.0-php8.3 image (manifest-list digest pinned per Inviolable Principle #4). Replaces direct PHP/SQL UPSERTs against wp_options with: * wp core install (idempotent: wp core is-installed) * wp plugin install openid-connect-generic --activate (idempotent: wp plugin is-installed) * wp option update openid_connect_generic_settings <json> * wp option update default_role * wp theme install/activate * wp option update siteurl/home Going through wp-cli (i.e. WordPress core's own PHP API) is more resilient than schema-shape-dependent INSERT statements and survives WordPress minor upgrades. - values.yaml: new canonical oidc.* block — oidc.{enabled, issuerURL, clientId, clientSecretName, defaultRole, identityKey, roleMapping, cliImage}. Default oidc.clientSecretName = "wordpress-oidc-client-secret" matches the K8s Secret bp-keycloak's PR #918 emits alongside the realm import ConfigMap (so the realm JSON's `secret` field and the Secret bytes never drift). - Legacy keycloak.{realmURL, clientID, clientSecretName} kept as a back-compat alias. _helpers.tpl folds it into oidc.* when the modern keys are at their values.yaml defaults so chart 0.1.x clusters keep reconciling. Removed in chart 0.3.0. - oidc.defaultRole=subscriber — newly auto-created SSO users land with subscriber capability (operator overrides via overlay). - Redirect URIs: the openid-connect-generic plugin's default callback is /wp-admin/admin-ajax.php?action=openid-connect-authorize when alternate_redirect_uri=0 (we set 0). bp-keycloak (PR #918) registers the same URL plus /wp-login.php and a /* wildcard, so the client's allowed-redirect-URI list aligns with what the plugin actually issues. Orchestrator emit ----------------- - products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go smeTenantBPWordPress now emits the canonical oidc.* block AND the legacy keycloak.* alias (for chart 0.1.x clusters mid-upgrade). Tests ----- - chart/tests/oidc-config.sh — 7 helm-template assertions: 1. Canonical oidc.* render produces a Job with the required wp-cli command flow + wordpress:cli-2.12.0-php8.3 image. 2. Legacy keycloak.* fold path (chart 0.1.x compat). 3. oidc.enabled=false short-circuits the Job. 4. alternate_redirect_uri=0 (so plugin URL matches the realm- registered redirect URI from PR #918). 5. defaultRole rendered + propagated. 6. Render YAML is parseable and contains all required kinds. 7. wp-content PVC mounted in the Job (so pg4wp's db.php drop-in loads — failure here would silently fall back to mysqli). - internal/handler/sme_tenant_test.go: * TestRenderSMETenantOverlay_WordPressEmitsOIDC — pins the canonical oidc.* block + legacy keycloak.* alias the orchestrator emits for the alice@omantel test fixture. * TestRenderSMETenantOverlay_WordPressOIDC_BYOMode — BYO domain mode renders wordpress.<byo-domain> as the ingress host. Verification ------------ - helm lint clean - helm template smoke green for: oidc.* canonical, keycloak.* legacy fold, oidc.enabled=false short-circuit - chart/tests/oidc-config.sh: 7/7 PASS - chart/tests/observability-toggle.sh: 2/2 PASS (regression) - go test ./internal/handler/ -run "SMETenant|TestRenderSME": all green (TestAuthHandover_HappyPath failure is pre-existing on main, unrelated to this change) Closes (D1 sub-task) of #915. Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
a1ca1872aa
|
feat(bp-stalwart-tenant): wire Keycloak OIDC SSO end-to-end (#915) (#920)
Closes the C2 sub-task of EPIC #915 — alice's Stalwart authenticates SMTP/IMAP/JMAP/webmail logins against her per-tenant Keycloak realm, not a shared otech-level IdP. Three layered changes (matching the three things broken on otech103): 1. Orchestrator (`smeTenantBPStalwart` in sme_tenant_gitops.go) now emits per-tenant OIDC values matching the bp-wordpress-tenant + bp-openclaw shape: keycloak.realmURL = https://keycloak.<sub>.<parent>/realms/sme-<sub> keycloak.clientID = stalwart keycloak.clientSecretName = stalwart-oidc-client-secret keycloak.oidcExternalSecret.remoteRef.key = sovereign/<otech-fqdn>/stalwart/<tenant>/oidc plus admin externalSecret + dependsOn bp-keycloak so the SME's three apps (wordpress, openclaw, stalwart) SSO against ONE realm with distinct client IDs (#915 C1 registers all three in the realm bootstrap). 2. Chart bootstrap config.toml drops the pre-0.16 kebab-case `[directory.keycloak] type = "oidc"` block (silently ignored by the upstream registry parser — verified against crates/registry/src/schema/structs.rs in stalwartlabs/stalwart; OidcDirectory serdes camelCase: `@type = "Oidc"`, `issuerUrl`, `claimUsername`, `claimName`, `claimGroups`, `requireScopes`). The `internal` directory stays as the bootstrap fallback so the admin can log in before the post-install Job seeds OIDC. 3. setupJob defaults to enabled (was off in 0.1.1) and POSTs the canonical OIDC directory entry to `/api/settings`: directory.keycloak.@type = "Oidc" directory.keycloak.issuerUrl = <realm URL> directory.keycloak.claimUsername = preferred_username directory.keycloak.claimName = name directory.keycloak.claimGroups = groups directory.keycloak.requireScopes = [openid email profile groups] directory.keycloak.usernameDomain = <tenant domain> storage.directory = keycloak The setting POSTs are idempotent (`assert_empty: false`) so Helm upgrades re-run without breaking existing logins. Re-uses the upstream Stalwart container (ships curl + stalwart-cli) — no new image needed. Tests: - `chart/tests/oidc-render.sh` (NEW): asserts every settings key is rendered, the [oauth] env block propagates the per-tenant realm URL, and the bootstrap config.toml parses as valid TOML. - `chart/tests/expression-syntax.sh`: re-passes (Stalwart expression `==` audit per stalwart_expression_syntax.md). - `TestRenderSMETenantOverlay_StalwartEmitsKeycloakOIDC` (NEW): Go test verifies the orchestrator emits the per-tenant realm URL, client metadata, and ExternalSecret-store remoteRef paths. - All existing TestRenderSMETenantOverlay_* tests pass. - `helm template` clean with default values AND with a per-tenant overlay (--api-versions external-secrets.io/v1beta1). Chart bumps 0.1.1 → 0.1.2; blueprint.yaml spec.version mirrors per issue #817 (chart/blueprint version invariant). Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9447d88dfd
|
feat(bp-newapi): auto-seed channel #1 = Qwen3.6 @ BankDhofar (#915) (#919)
Per epic #915 (SME tenant integration DoD: alice → OpenClaw → NewAPI → Qwen3.6@BankDhofar end-to-end), bp-newapi must come up with channel #1 = Qwen3.6 hosted at BankDhofar (https://llm-api.omtd.bankdhofar.com, model qwen3-coder / alias qwen3.6) already wired to its admin API, so the FIRST customer request from an SME's OpenClaw → NewAPI hits a real upstream LLM rather than a 404 / "no channel found" error. Until now the chart's channels.yaml ConfigMap was a documentation surface only; the upstream NewAPI binary persists channel state to its Postgres `channels` table via its admin API at /api/channel/. This patch bridges that gap. Discovery: - Canonical BankDhofar relay reference exists in openova-private/clusters/contabo-mkt/apps/axon/helmrelease.yaml (axon.vllm.baseUrl=https://llm-api.omtd.bankdhofar.com, defaultModel=qwen3-coder, secret=axon-vllm-secret). - K8s secret confirmed live (axon/axon-vllm-secret, key AXON_VLLM_API_KEY). - Architecture: bp-newapi is per-Sovereign (one NewAPI per OTECH); SME tenants share it via OpenClaw's newapi.baseURL = https://newapi.<OTECHFQDN>. Channel seeding therefore happens at the Sovereign-level chart install, NOT per-tenant. Changes: 1. platform/newapi/chart/values.yaml - New `defaultChannels.qwenBankDhofar` block (enabled=false by default; per-Sovereign overlay flips it true with the canonical endpoint + commercial-contract attestation). - New `channelSeed` block configuring the post-install Helm hook Job (image, resources, backoff, deadline, hook delete policy). 2. platform/newapi/chart/templates/_helpers.tpl - effectiveChannels helper composes qwenBankDhofar BEFORE operator-supplied .Values.channels and BEFORE defaultChannels.vllm so it lands as channel #1 in NewAPI's row-insertion order (NewAPI's router resolves `model` lookups in row order). - New channelSeedJobName helper (shared by Job + RBAC + ConfigMap). 3. platform/newapi/chart/templates/channel-seed-job.yaml (NEW) - post-install/post-upgrade Helm hook Job that: * Mounts the operator-supplied master-key Secret (auth.adminUI.masterKeySecret) for one-time admin API auth. * Mounts the per-channel upstream API key Secret (defaultChannels.qwenBankDhofar.existingSecret). * Polls /api/status until 200 (handles NewAPI startup window). * For each default channel: GET /api/channel/?keyword=<name>; if a row whose `name` exactly matches exists, SKIP. Otherwise POST /api/channel/ with the channel definition. Idempotent — re-runs after upgrades are no-ops once channels exist. * Bounded RBAC (Role+RoleBinding only on the named Secrets). * Skip-render gates: channelSeed.enabled, defaultChannels.* enabled, masterKeySecret supplied. helm template with default values renders no Job (CI smoke clean). 4. clusters/_template/bootstrap-kit/80-newapi.yaml - Bumped chart version 1.2.0 → 1.3.0. - Added defaultChannels.qwenBankDhofar block to the per-Sovereign overlay shape (still enabled=false in the template — operator supplies endpoint + attestation + Secrets per Sovereign). 5. platform/newapi/chart/Chart.yaml - Bumped 1.2.0 → 1.3.0 with changelog comment. 6. products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go - bp-openclaw per-tenant overlay now emits `newapi.defaultModel: qwen3.6` so OpenClaw's UI surfaces the friendlier alias by default. (Both qwen3.6 and qwen3-coder route to the same channel via the chart's `models` list.) Verification: - helm lint . PASS (1 chart linted, 0 failed) - helm template (defaults) PASS (no Job rendered) - helm template (qwen enabled) PASS (Job + RBAC + ConfigMap + channels.yaml all render with channel #1 first) - helm template (endpoint empty) FAIL with helpful message (configurability gate) - go build ./... PASS - go test ./internal/handler/... PASS for SME tenant overlay tests (TestRenderSMETenantOverlay_*) - Pre-existing AuthHandover panic is unrelated to this change Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every knob is configurable via the per-Sovereign bootstrap-kit overlay. The endpoint default is empty so a fresh `helm template` does not silently wire customers to a third-party host. Co-authored-by: alierenbaysal <alierenbaysal@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7f859dbb4b
|
feat(bp-keycloak): tenant-mode realm with wordpress/openclaw/stalwart OIDC clients (1.4.0, #915) (#918)
PR #911 wired the SME tenant orchestrator to emit realmConfig.tenant.enabled=true on the per-tenant bp-keycloak HelmRelease — but the chart had no template that consumed those values, so the WordPress / OpenClaw / Stalwart OIDC integrations had no client registered in the tenant realm and SSO failed end-to-end. This change adds the chart-side template the orchestrator was already emitting for. When realmConfig.tenant.enabled=true: * configmap-sovereign-realm.yaml SKIPS (mutual-exclusion guard added on the existing template) so only one realm CM is rendered. * NEW templates/configmap-tenant-realm.yaml renders a realm import ConfigMap (same name `<release>-sovereign-realm-config` so the upstream keycloak-config-cli existingConfigmap reference still resolves) carrying the tenant realm + 3 OIDC clients: - wordpress (confidential, auth-code; redirect URIs cover the openid-connect-generic plugin's admin-ajax.php callback + /wp-login.php fallback) - openclaw (confidential, auth-code; redirect URI /oauth/callback per #915 spec) - stalwart (confidential, serviceAccountsEnabled=true so the directory.keycloak type=oidc backend can use client_credentials to introspect IMAP/SMTP tokens; standardFlowEnabled=true for webmail UI auth-code) * NEW per-app Secrets emitted in the same template scope as the realm ConfigMap so the realm JSON's `secret` field and the K8s Secret bytes never drift: - wordpress-oidc-client-secret - openclaw-oidc-client-secret - stalwart-oidc-client-secret (carries BOTH client-secret AND OIDC_CLIENT_SECRET keys for the two consumer paths) * Each per-app secret persists across helm upgrade via lookup-or-generate (mirrors marketplace-api/secret.yaml pattern from issue #887 and the existing catalyst-api-server secret in configmap-sovereign-realm.yaml). helm.sh/resource-policy: keep so bytes outlive uninstall. * Fail-closed validation when realmConfig.tenant.enabled=true and any of realmName / parentDomain / subdomain is unset (Inviolable Principle #4). NEW tests/tenant-realm-oidc-clients.sh covers 6 cases: 1. Sovereign-mode default render unchanged (kubectl + catalyst-ui + catalyst-api-server clients present, no tenant artefacts leak). 2. Tenant-mode render produces exactly ONE realm CM under the expected name + zero leaked Sovereign-only resources. 3. Tenant realm JSON parses + 3 OIDC clients present with the redirect-URI / publicClient / serviceAccountsEnabled shape per #915 spec; Secret bytes match realm JSON's `secret` fields. 4. Fail-closed validation when tenant fields missing. 5. keycloak-config-cli post-install Job projects the realm CM by SAME name in BOTH modes. 6. Operator-supplied per-app clientSecret overrides the lookup-or-generate path. Existing tests/observability-toggle.sh + tests/oidc-kubectl-client.sh still pass. Sovereign-mode unchanged. The chart now consumes the values the orchestrator (PR #911) was already emitting; no orchestrator change needed. Closes #915 (C1 sub-task) and unblocks #899 (per-tenant Keycloak realm-config materialisation). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
61c8d77b58
|
feat(bp-openclaw): per-tenant Keycloak SSO + NewAPI as OpenAI-compatible LLM gateway (#915) (#917)
Wire bp-openclaw to the per-tenant Keycloak realm (OIDC SSO) and the per-tenant NewAPI (OpenAI-compatible LLM endpoint, NOT direct OpenAI), delivering C3 of umbrella epic #915. Chart changes (bp-openclaw 0.1.0 → 0.2.0): - Add canonical `oidc.{issuerURL,clientId,clientSecret.{name,key}}` block. - Add canonical `llm.{baseURL,apiKey.{name,key},defaultModel}` block. - Controller Deployment now emits OIDC_*, LLM_*, OPENAI_API_{BASE,KEY}, LLM_DEFAULT_MODEL envs (legacy KEYCLOAK_*/NEWAPI_BASE_URL_DEFAULT retained for back-compat with current controller image). - Per-user pods carry OPENAI_API_BASE / OPENAI_API_KEY / LLM_DEFAULT_MODEL alongside the identity-blind NEWAPI_BASE_URL / NEWAPI_KEY (ADR-0003 §3.3 unchanged). - Legacy `keycloak.*` / `newapi.*` keys remain accepted as fallbacks; helpers prefer canonical blocks but fall back to the legacy alias when the canonical block is unset (or still at placeholder). - assertNoPlaceholders guard updated to check resolved canonical values. - render-toggles.sh smoke test extended: asserts both canonical and legacy code-paths render and that all expected envs reach the rendered Deployment. Orchestrator changes (catalyst-api smeTenantBPOpenClaw template): - Emit per-tenant `oidc.issuerURL` = https://keycloak.<sub>.<parent>/realms/sme-<sub> - Emit per-tenant `oidc.clientId` = openclaw, secret from openclaw-oidc-client-secret/OIDC_CLIENT_SECRET (rendered by bp-keycloak's post-install hook). - Emit per-tenant `llm.baseURL` = https://api.<sub>.<parent>/v1 (alice's own NewAPI ingress, NOT the otech-wide newapi.<otech-fqdn>); apiKey from openclaw-newapi-controller-token/NEWAPI_KEY. - Emit `llm.defaultModel: qwen3.6` — NewAPI uses this to select the backing channel; C4 of #915 wires Qwen3.6@BankDhofar at tenant-create. - Legacy keycloak/newapi blocks still emitted for back-compat with bp-openclaw < 0.2.0. Tests: - New TestRenderSMETenantOverlay_OpenClawOIDCAndLLMBlocks asserts the rendered HelmRelease contains the canonical oidc + llm blocks with per-tenant values, and that llm.baseURL is the per-tenant api.<sub>.<parent>/v1 (NOT the otech-wide newapi). - bp-openclaw render-toggles.sh extended (Case 2b/2c). Co-authored-by: alierenbaysal <alierenbaysal@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
368545369b
|
fix(bp-stalwart-tenant): unbootable on fresh tenants — values shape, missing admin Secret, sec ctx (#898) (#904)
Three fixes that left bp-stalwart-tenant 0.1.0 unable to come up on a
freshly-franchised SME tenant. All surfaced on the otech103 alice
tenant during the Phase-1 DoD sweep.
1. Tenant-domain values shape (HelmRelease render error)
The 0.1.0 chart referenced `.Values.domain.primary` in five
templates. The live HR on otech103 had `values.domain:
acme.omani.works` (a string), emitted by a pre-#897 catalyst-api
build, so every reconcile died with:
can't evaluate field primary in type interface {}
Added `bp-stalwart-tenant.tenantDomain` + `tenantMode` helpers
that resolve in priority order:
1. `tenant.domain` (forward-looking flat shape)
2. `domain.primary` (canonical post-#897 map shape)
3. `domain` (string) (legacy pre-#897 shape — back-compat)
Returns "" smoke-render-safe; per-template gates skip when empty.
2. Missing stalwart-admin Secret
deployment.yaml + mailbox-provision-job.yaml reference a Secret
key `ADMIN_PASSWORD` on `.Values.admin.secretName`. The 0.1.0
chart only emitted an ExternalSecret, and only when
`admin.externalSecret.remoteRef.key` was non-empty (smoke-render
concession). Fresh tenants land in CreateContainerConfigError.
Added `templates/admin-secret.yaml` mirroring marketplace-api/
secret.yaml (#887): random 32-char ADMIN_PASSWORD generated by
sprig randAlphaNum, persisted across reconcile via lookup,
helm.sh/resource-policy: keep so reinstall picks it back up.
Auto-disabled when an authoritative ExternalSecret is wired —
no double-bind between two controllers.
3. Pod sec ctx vs. upstream image's file capabilities
`getcap docker.io/stalwartlabs/stalwart:v0.16.3 /usr/local/bin/
stalwart` reports `cap_net_bind_service=ep`. The image creates
user `stalwart` at UID 2000 and the binary IS the entrypoint
(no demotion script). The 0.1.0 chart ran as UID 65534 with
`drop: ALL` — kernel refuses to elevate file caps with empty
bounding set, so exec failed with `operation not permitted`.
Aligned to image's native UID 2000, kept `drop: ALL` and added
`NET_BIND_SERVICE` explicitly. fsGroup 2000 ensures /opt/stalwart
PVC is writable.
Other:
- Bumped Chart.yaml + blueprint.yaml to 0.1.1 (#817 alignment).
- configSchema in blueprint.yaml now permits the legacy + tenant
shapes alongside the canonical map.
- mailboxProvisioner.setupJob.enabled defaults to false until the
canonical stalwart-cli image is published (re-uses upstream
stalwart container as fallback CLI host).
Acceptance: targeted at otech103 alice tenant
(sme-789ae512-bc0f-467c-a016-001f5496c403) where 0.1.0 reconciliation
fails with the value-shape error and the pod CrashLoops with `exec
... operation not permitted`. Verification on otech103 in #898.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
93c4b700de
|
fix(bp-keycloak): templatize existingConfigmap reference for per-tenant installs (#899) (#902)
bp-keycloak 1.3.2 hardcoded `keycloak.keycloakConfigCli.existingConfigmap` to
the literal "keycloak-sovereign-realm-config". This worked for the Sovereign-
mothership bootstrap-kit (releaseName=keycloak emits matching ConfigMap) but
broke for every per-tenant install where releaseName=bp-keycloak emits
"bp-keycloak-sovereign-realm-config" — the post-install keycloak-config-cli
Job stuck in ContainerCreating with `MountVolume.SetUp failed for volume
"config-volume" : configmap "keycloak-sovereign-realm-config" not found`,
HelmRelease InstallFailed after 15m timeout, cascading to bp-openclaw and
bp-wordpress-tenant which dependsOn it.
The bitnami/keycloak subchart's `keycloak.keycloakConfigCli.configmapName`
helper (charts/keycloak/templates/_helpers.tpl) applies `tpl` to the
existingConfigmap value, so embedding `{{ .Release.Name }}` inside the
string resolves at chart-render time. With this single-line change:
- Sovereign-mothership (releaseName=keycloak) → keycloak-sovereign-realm-config (unchanged)
- Per-tenant (releaseName=bp-keycloak) → bp-keycloak-sovereign-realm-config (matches actual emitted ConfigMap)
Verified via helm template both modes — backendRef and config-volume
configMap.name match the actual ConfigMap emitted by
templates/configmap-sovereign-realm.yaml.
Chart bumped 1.3.2 → 1.3.3 + bootstrap-kit slot 09 + blueprint.yaml.
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
eddf0e62a4
|
fix(self-sovereign-cutover): Step-5 widens GitRepository ignore filter (#891) (#892)
* fix(catalyst-api): SME-tenant orchestrator writes parent kustomization.yaml index (#889) The Flux Kustomization rendered by bp-catalyst-platform 1.4.13+ at clusters/<sov-fqdn>/sme-tenants/ requires a parent kustomization.yaml that enumerates tenant subdirectories. The orchestrator only wrote per-tenant overlays without the parent index, so on otech103 Flux hit: kustomization path not found: stat /tmp/kustomization-... /clusters/otech103.omani.works/sme-tenants: no such file or directory Even after a tenant signup, the parent path lacked a kustomization.yaml so Flux couldn't enumerate subdirs. Fix: NEW writeParentTenantsIndex helper called from both WriteTenantOverlay and DeleteTenantOverlay. Scans the parent dir for subdirectories that contain kustomization.yaml, sorts them lexically for deterministic output (no spurious diffs), and writes a parent kustomization.yaml listing them under `resources:`. Empty list (no tenants) renders as `resources: []` — still a valid Kustomization root, so Flux stays Ready=True after the last tenant teardown. git add covers both the per-tenant subdir AND the parent index, so a single commit captures the delta. Live on otech103 post-cutover, 2026-05-05. * fix(self-sovereign-cutover): Step-5 widens GitRepository ignore filter to include clusters/<sov-fqdn>/ (#891) After Day-2 cutover, the GitRepository ignore filter excluded the Sovereign's own clusters/<sov-fqdn>/ subtree. This made every Sovereign-specific Flux Kustomization (sme-tenants, future per-Sov overlays) hit "kustomization path not found" because source-controller filtered the path out of the artifact tarball. Live on otech103 (2026-05-05): sme-tenants Kustomization stuck for 20+ minutes despite the orchestrator successfully committing the overlay to local Gitea. Fix: Step-5 (flux-gitrepository-patch) now writes the patch as a multi-line YAML strategic-merge file via /tmp emptyDir (since the Pod runs readOnlyRootFilesystem), composing the new ignore filter: /* !/clusters/_template !/clusters/${SOVEREIGN_FQDN} !/platform !/products The SOVEREIGN_FQDN is wired from .Values.sovereign.fqdn (already established in the chart values). Bumps chart 0.1.14 -> 0.1.15. Slot 06a pin bumps in lockstep. --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
8e4c88fd28
|
fix(bp-self-sovereign-cutover): auto-sync local Gitea mirror from upstream GitHub (#870) (#875)
Step-1 gitea-mirror Job replaces the legacy one-shot create-empty-repo + git-push pattern with a single call to Gitea's native /repos/migrate API with mirror=true and mirror_interval=10m0s. Gitea now polls the upstream openova-io/openova repo on a 10-minute interval and replicates branches + tags into the local Sovereign Gitea automatically. Closes the "Sovereign drifts from upstream main forever after Day-2 cutover" bug — hit twice during the otech103 2026-05-04 overnight DoD session, requiring manual `git fetch` inside the Gitea pod for every chart rollout. Why /repos/migrate over the previous git push approach: - Gitea cannot convert a regular repo into a pull-mirror after creation (the mirror flag is set at create-time only). The migrate endpoint creates the repo AS a mirror in one shot. - The migrate endpoint accepts toggles for issues / pull-requests / wiki / labels / milestones / releases — we set them all to false so Gitea only replicates branches+tags, the only refs the Sovereign's Flux GitRepository needs. - Recurring sync is a Gitea-native capability; using it avoids a parallel CronJob (which would violate the "event-driven not cron" inviolable principle) or a long-poll sidecar (which would duplicate what Gitea already does). Idempotency: if the repo already exists from a prior cutover attempt, the script PATCHes mirror_interval to the desired value and POSTs to /mirror-sync to trigger an immediate refresh. Note that PATCH alone cannot convert a legacy non-mirror repo to a mirror — Sovereigns seeded by chart < 0.1.14 would need an operator-driven repo delete + re-migrate to retro-fit auto-sync, but new provisions take the migrate path automatically. Verification on the rendered ConfigMap: $ helm template smoke . # renders 16 docs cleanly $ bash tests/cutover-contract.sh # all 7 gates green $ sh -n <rendered-script> # POSIX shell syntax OK Chart bumped 0.1.13 → 0.1.14 (Chart.yaml + blueprint.yaml spec.version aligned per #817 invariant + slot 06a-bp-self-sovereign-cutover.yaml pin lockstep). Refs #870, #790. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9b710049e3
|
fix(self-sovereign-cutover): Step-8 baseline-diff (only NEW regressions count) (#858)
Live otech103: Step-8 survival window failed because infrastructure-config Kustomization had been NotReady for 4h pre-cutover (Crossplane provider CRD ordering — unrelated to sovereignty). Sovereignty proof asks 'did cutover break anything', not 'is the cluster perfect'. Capture baseline NotReady set before the window, only fail on NEW additions during. Bumps 0.1.12 → 0.1.13 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
d5d1d9b2cd
|
fix(self-sovereign-cutover): Step-8 tolerate slot-managed self-ref HelmRepositories (#857)
Live otech103: Step-8 verification flagged 2 HelmRepositories (bp-newapi + bp-self-sovereign-cutover) still on ghcr.io/openova-io. Both are declared in clusters/_template/bootstrap-kit/ slot files which Flux Kustomization re-applies on every reconcile — Step-6's patch is transient for them. Data-plane impact is null because they're not pulled again until the next cutover cycle which would re-apply the patch first. The 38 leaf-bp HelmRepositories ARE patched durably (live in HelmRelease values, not separate slot files). Bumps 0.1.11 → 0.1.12 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
142ea21534
|
fix(self-sovereign-cutover): Step-8 passive architectural verification (Cilium can't egressDeny+toFQDNs) (#856)
Live otech103: Step-8 (egress-block-test) failed because Cilium 1.16's CiliumNetworkPolicy schema doesn't support 'spec.egressDeny[].toFQDNs' — strict-decoding error 'unknown field'. FQDN-based matching in Cilium is only allowed in 'egress' (allow), not 'egressDeny'. Pivot: Step-8 now asserts the architectural pivots from Steps 5-7 are actually live (GitRepository.url + all HelmRepositories + catalyst-api env all point at local Gitea/Harbor) BEFORE entering the durationSeconds survival window during which Flux Kustomization + HelmRelease readiness is polled. Same sovereignty proof, expressed in a form Cilium can evaluate. Bumps 0.1.10 → 0.1.11 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
86ae235804
|
fix(self-sovereign-cutover): catalyst-api namespace catalyst-system not catalyst-platform (#855)
Live otech103: Step-7 (catalyst-api-env-patch) hit 'deployments.apps catalyst-api not found' in catalyst-platform ns. Actual Sovereign-side namespace is catalyst-system. Bumps 0.1.9 → 0.1.10. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
dd84060d05
|
fix(self-sovereign-cutover): switch from bitnami/kubectl to alpine/k8s (#854)
Live otech103 2026-05-04: bitnami/kubectl:1.31.4 404 on Docker Hub. Bitnami deprecated public Docker Hub registry in 2025; their kubectl image stopped getting tags. alpine/k8s is the canonical alpine-based replacement — kubectl + helm + standard k8s CLI surface, actively maintained, :1.31.4 verified present. Bumps 0.1.8 → 0.1.9 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
887ff62200
|
fix(self-sovereign-cutover): bitnami/kubectl tag :1.31 → :1.31.4 (#853)
Live otech103 2026-05-04: Step-5 (flux-gitrepository-patch) Pod DeadlineExceeded after 10m of ImagePullBackOff. bitnami/kubectl on DockerHub doesn't have a floating :1.31 tag — only patch-level :1.31.X. Pin to :1.31.4 (latest of 1.31 minor as of today). Bumps 0.1.7 → 0.1.8 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
e9970db7b6
|
fix(self-sovereign-cutover): proxy-quay adapter type docker-registry (#852)
Live otech103: Harbor rejects project create with metadata.proxy_cache=true on registries with type 'quay' — HTTP 400 'unsupported registry type quay'. Quay speaks plain v2 so docker-registry is the correct adapter (4/7 projects ahead succeeded with the same shape). Bumps 0.1.6 → 0.1.7. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
ea51642092
|
fix(self-sovereign-cutover): proxy-ghcr Harbor adapter type 'github-ghcr' (#851)
Live otech103 2026-05-04: Step-2 harbor-projects POST /api/v2.0/registries returns 500 'adapter factory for github not found'. Harbor 2.x's canonical GHCR proxy-cache adapter is named 'github-ghcr', not 'github'. Bumps 0.1.5 → 0.1.6 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
8f96daeb6f
|
fix(self-sovereign-cutover): harbor service is 'harbor-core' not 'harbor-harbor-core' (#849)
Live failure on otech103 2026-05-04: Step-2 (harbor-projects) Pod exits silently after first echo because curl exit 6 (CURLE_COULDNT_RESOLVE_HOST). The chart's default harborInternalURL was http://harbor-harbor-core.harbor.svc.cluster.local but the actual bitnami harbor chart's service name is harbor-core (release name doesn't double-prefix when targetNamespace == 'harbor' AND releaseName == 'harbor'). Fix: harborInternalURL → http://harbor-core.harbor.svc.cluster.local. Verified via 'kubectl get svc -n harbor' on otech103. Bumps 0.1.4 → 0.1.5 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
ab5681e656
|
fix(self-sovereign-cutover): Step-1 use bare clone + explicit refspec push (#848)
Live failure on otech103 2026-05-04 even after 0.1.3: git push --all in a mirror clone still pushes refs/pull/* because mirror clones store all upstream refs (incl. GitHub PR refs) at the same level as refs/heads/, and --all walks the whole local refstore. Fix: use git clone --bare (not --mirror) which only fetches refs/heads/* and refs/tags/*, then push with explicit refspecs: git push origin 'refs/heads/*:refs/heads/*' git push origin 'refs/tags/*:refs/tags/*' Bumps 0.1.3 → 0.1.4 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |