openova/products/continuum/chart/values.yaml
e3mrah 96f8b260c9
feat(continuum): F — dry-run report + post-switchover health check + audit-emit coverage (slice F-1+F-2+F-3, #1101) (#1161)
Slice F layers three concerns on top of K-Cont-2's reconciler +
sequencer:

F-1 — extend audit-emit coverage with three new audit-types:
- continuum-cr-created     — fires once per CR observation
- continuum-config-changed — fires on switchover-relevant spec drift
- continuum-lease-collision — fires when Acquire returns
                              ErrLeaseHeldByAnother during the
                              opportunistic re-acquire path
Total reserved Continuum audit-types now 12 (was 9). Order is
K-Cont-2's 9 first, then F-1's 3 (additions at end so existing
index-pinned tests keep working). U-DR-1 subscribes by
audit-type=continuum-* so it receives the new types automatically.

F-2 — Sequencer.DryRun + DryRunReport struct + per-step
preconditions evaluator. Walks the same 7 steps Execute would run,
but read-only end-to-end (asserted by tests: zero audit emits, zero
state mutation). Per-step durations as exported constants. Plan
content fingerprint (16-hex SHA-256 prefix) for cache idempotency.
Blockers (FATAL) vs Warnings (advisory) so the UI can render the
report and disable [ Confirm Switchover ] when blockers present.

F-3 — Sequencer.PostSwitchoverHealth + HealthReport struct + 4
fixed-order checks (replicas-healthy, dns-probes, latency-normal,
audit-posted). Replicas check reads both halves of the cluster-pair
post-switchover (new-primary has replica.enabled=false; new-replica
has replica.enabled=true; both must be Ready=true). DNS check
fans out to multi-vantage resolvers (default 8.8.8.8 / 1.1.1.1 /
9.9.9.9) and asserts every (hostname × vantage) returns at least one
ToRegion IP. Latency check is permanently Deferred=true (Cilium
hubble metrics scrape is SRE follow-up). Audit check queries an
injected AuditTail (recorder in tests; NATS PullConsumer wiring is
follow-up — currently Deferred=true in production).

Controller chains PostSwitchoverHealth ~30s after every successful
switchover (HealthDelay; CONTINUUM_HEALTH_DELAY_SECONDS env). Result
written to Continuum CR status condition LastSwitchoverHealthy with
True/False/Unknown + one-line summary message.

Endpoints — small HTTP server in continuum-controller binary on
:8082 (CONTINUUM_API_ADDR env; empty disables):
- POST /v1/continuums/{ns}/{name}/dry-run  → DryRunReport
- GET  /v1/continuums/{ns}/{name}/health   → HealthReport
- GET  /healthz                            → ok

Auth — owner-tier gated per INVIOLABLE-PRINCIPLES #5:
X-Catalyst-Owner-Tier: true header (catalyst-api stamps it after JWT
validation) plus optional Authorization: Bearer <CONTINUUM_API_TOKEN>
for defence in depth. The /api/v1/sovereigns/{id}/... outer envelope
is the catalyst-api's responsibility (separate slice); the controller
exposes only the inner shape.

Chart — values.yaml + deployment.yaml + service.yaml extended with
continuum.api.{port,tokenSecretRef} and
continuum.health.postSwitchoverDelaySeconds. Service exposes new
api port (default 8082) so the catalyst-api proxy can reach it.

Tests — three-tier gate per implementer-canon §6:
- 53 unit tests across switchover (DryRun + Health + integration),
  events (3 new types + roundtrip), api (server + auth + cache),
  controller (4 new test groups for F-1 + F-3 chain).
- End-to-end integration test: DryRun → Execute → PostSwitchoverHealth
  sequence (TestEndToEnd_DryRunThenSwitchoverThenHealth +
  TestEndToEnd_DryRunBlockedSwitchoverNeverRuns).
- go test -count=1 -race ./... clean across all sibling controllers.
- go vet ./... clean.

K-Cont-2's sequencer surface was sufficient — this slice ADDED
DryRun + PostSwitchoverHealth methods without modifying the existing
Execute / RequestFailback / steps() implementations.

Out of scope (per slice F brief): WitnessClient interface changes,
CF Worker changes, U-DR-1 UI, 1M-row C-DB-3 acceptance test,
Cilium hubble latency metrics, NATS PullConsumer for audit-posted
health check (deferred).

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 08:33:37 +04:00

120 lines
4.9 KiB
YAML

# bp-continuum values — slice K-Cont-1 of EPIC-6 (#1101).
#
# Continuum is a customer-facing capability (per ADR-0001 §3.2.8 — NOT
# under platform/) that orchestrates per-Application DR for placement:
# active-hotstandby Applications. This chart deploys the
# continuum-controller alongside its RBAC + NetworkPolicy.
#
# Default-OFF gate: `continuum.enabled: false`. Operators flip this in
# a per-Sovereign overlay once they have:
# - bp-cnpg-pair installed (C-DB-1 — primary + replica CNPG cluster)
# - bp-powerdns + pool-domain-manager reachable (lua-record commits)
# - lease witness configured (Cloudflare KV per K-Cont-3, or DNS
# quorum fallback)
#
# Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode) every value
# below is overridable per-Sovereign overlay. Per #4a (image SHA-pinned
# from CI) `image.tag` is empty by default; the chart fail-fasts at
# render time when `enabled: true` and `image.tag` is empty (see
# templates/_helpers.tpl `continuum.image`). CI stamps the SHA on every
# push to main via .github/workflows/build-continuum-controller.yaml.
# ─── Global registry rewrite (matches catalyst chart pattern) ──────────
# When set, ALL Continuum image pulls route through this registry.
# Per-Sovereign overlays set this to harbor.<sovereign-fqdn> so every
# image pull hits the Sovereign's own Harbor proxy_cache rather than
# ghcr.io directly. Empty = no rewrite.
global:
imageRegistry: ""
# ─── Continuum controller deployment ───────────────────────────────────
continuum:
# Default-OFF. Flip to true on a Sovereign overlay once dependencies
# (bp-cnpg-pair, bp-powerdns, lease witness) are ready.
enabled: false
image:
repository: "ghcr.io/openova-io/openova/continuum-controller"
# tag is empty by default — CI stamps a SHA on every push to main.
# Render-time fail-fast (see templates/_helpers.tpl) prevents a
# half-configured deploy from accidentally running with `:latest`.
tag: ""
pullPolicy: IfNotPresent
replicas: 1
leaderElection:
enabled: true
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# ─── Metrics / health endpoints ──────────────────────────────────────
metrics:
enabled: true
# Port exposed on the controller Deployment + Service.
port: 9090
health:
# /healthz + /readyz endpoint port (separate from metrics so the
# ServiceMonitor scrape can hit /metrics without exposing health
# publicly).
port: 8081
# Slice F-3: post-switchover health-check delay in seconds.
# Default 30s per design doc §9.5; 0 = run immediately (tests).
postSwitchoverDelaySeconds: 30
# Slice F-2 + F-3: HTTP server for dry-run + health endpoints.
api:
# Port exposed on the controller Deployment + Service for the
# F-2 (POST .../dry-run) + F-3 (GET .../health) endpoints. Empty
# = disabled (no HTTP server bound).
port: 8082
# Optional bearer token gate. Leave nil to rely solely on the
# X-Catalyst-Owner-Tier header (catalyst-api stamps after JWT
# validation). When set, a SealedSecret reference like:
# tokenSecretRef:
# name: continuum-api-token
# key: token
# populates the CONTINUUM_API_TOKEN env var.
tokenSecretRef: null
# ─── K-Cont-2 / K-Cont-3 config seams (sketch — finalized in
# downstream slices) ─────────────────────────────────────────────────
#
# PDM endpoint that the controller calls /v1/commit on for lua-record
# writes. Empty = use the in-cluster Service default. Per Inviolable
# Principle #4 the value is fully runtime-configurable; per-Sovereign
# overlays may repoint at a Sovereign-local PDM instance.
pdmURL: ""
# NATS endpoint for catalyst.audit publishing. Empty = use the
# in-cluster nats.openova-system.svc.cluster.local default.
natsURL: ""
# Lease witness defaults (overridden per-CR via Continuum.spec.leaseClient).
# These are CONTROLLER-WIDE defaults; per-Continuum-CR config wins.
lease:
ttlSeconds: 30
renewSeconds: 10
# Free-form extra env vars threaded into the Pod (advanced; for one-off
# operator-side knobs not yet promoted to a top-level value).
env: {}
nodeSelector: {}
tolerations: []
affinity: {}
# ─── NetworkPolicy ────────────────────────────────────────────────────
# Default-deny except for explicit egress to Kubernetes API server,
# in-cluster CNPG status, NATS, PDM, and the lease witness URL. K-Cont-2
# extends with witness-specific egress when WITNESS_KIND=cloudflare-kv
# (egress to CF API at api.cloudflare.com:443).
networkPolicy:
enabled: true