Slice F layers three concerns on top of K-Cont-2's reconciler +
sequencer:
F-1 — extend audit-emit coverage with three new audit-types:
- continuum-cr-created — fires once per CR observation
- continuum-config-changed — fires on switchover-relevant spec drift
- continuum-lease-collision — fires when Acquire returns
ErrLeaseHeldByAnother during the
opportunistic re-acquire path
Total reserved Continuum audit-types now 12 (was 9). Order is
K-Cont-2's 9 first, then F-1's 3 (additions at end so existing
index-pinned tests keep working). U-DR-1 subscribes by
audit-type=continuum-* so it receives the new types automatically.
F-2 — Sequencer.DryRun + DryRunReport struct + per-step
preconditions evaluator. Walks the same 7 steps Execute would run,
but read-only end-to-end (asserted by tests: zero audit emits, zero
state mutation). Per-step durations as exported constants. Plan
content fingerprint (16-hex SHA-256 prefix) for cache idempotency.
Blockers (FATAL) vs Warnings (advisory) so the UI can render the
report and disable [ Confirm Switchover ] when blockers present.
F-3 — Sequencer.PostSwitchoverHealth + HealthReport struct + 4
fixed-order checks (replicas-healthy, dns-probes, latency-normal,
audit-posted). Replicas check reads both halves of the cluster-pair
post-switchover (new-primary has replica.enabled=false; new-replica
has replica.enabled=true; both must be Ready=true). DNS check
fans out to multi-vantage resolvers (default 8.8.8.8 / 1.1.1.1 /
9.9.9.9) and asserts every (hostname × vantage) returns at least one
ToRegion IP. Latency check is permanently Deferred=true (Cilium
hubble metrics scrape is SRE follow-up). Audit check queries an
injected AuditTail (recorder in tests; NATS PullConsumer wiring is
follow-up — currently Deferred=true in production).
Controller chains PostSwitchoverHealth ~30s after every successful
switchover (HealthDelay; CONTINUUM_HEALTH_DELAY_SECONDS env). Result
written to Continuum CR status condition LastSwitchoverHealthy with
True/False/Unknown + one-line summary message.
Endpoints — small HTTP server in continuum-controller binary on
:8082 (CONTINUUM_API_ADDR env; empty disables):
- POST /v1/continuums/{ns}/{name}/dry-run → DryRunReport
- GET /v1/continuums/{ns}/{name}/health → HealthReport
- GET /healthz → ok
Auth — owner-tier gated per INVIOLABLE-PRINCIPLES #5:
X-Catalyst-Owner-Tier: true header (catalyst-api stamps it after JWT
validation) plus optional Authorization: Bearer <CONTINUUM_API_TOKEN>
for defence in depth. The /api/v1/sovereigns/{id}/... outer envelope
is the catalyst-api's responsibility (separate slice); the controller
exposes only the inner shape.
Chart — values.yaml + deployment.yaml + service.yaml extended with
continuum.api.{port,tokenSecretRef} and
continuum.health.postSwitchoverDelaySeconds. Service exposes new
api port (default 8082) so the catalyst-api proxy can reach it.
Tests — three-tier gate per implementer-canon §6:
- 53 unit tests across switchover (DryRun + Health + integration),
events (3 new types + roundtrip), api (server + auth + cache),
controller (4 new test groups for F-1 + F-3 chain).
- End-to-end integration test: DryRun → Execute → PostSwitchoverHealth
sequence (TestEndToEnd_DryRunThenSwitchoverThenHealth +
TestEndToEnd_DryRunBlockedSwitchoverNeverRuns).
- go test -count=1 -race ./... clean across all sibling controllers.
- go vet ./... clean.
K-Cont-2's sequencer surface was sufficient — this slice ADDED
DryRun + PostSwitchoverHealth methods without modifying the existing
Execute / RequestFailback / steps() implementations.
Out of scope (per slice F brief): WitnessClient interface changes,
CF Worker changes, U-DR-1 UI, 1M-row C-DB-3 acceptance test,
Cilium hubble latency metrics, NATS PullConsumer for audit-posted
health check (deferred).
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
120 lines
4.9 KiB
YAML
120 lines
4.9 KiB
YAML
# bp-continuum values — slice K-Cont-1 of EPIC-6 (#1101).
|
|
#
|
|
# Continuum is a customer-facing capability (per ADR-0001 §3.2.8 — NOT
|
|
# under platform/) that orchestrates per-Application DR for placement:
|
|
# active-hotstandby Applications. This chart deploys the
|
|
# continuum-controller alongside its RBAC + NetworkPolicy.
|
|
#
|
|
# Default-OFF gate: `continuum.enabled: false`. Operators flip this in
|
|
# a per-Sovereign overlay once they have:
|
|
# - bp-cnpg-pair installed (C-DB-1 — primary + replica CNPG cluster)
|
|
# - bp-powerdns + pool-domain-manager reachable (lua-record commits)
|
|
# - lease witness configured (Cloudflare KV per K-Cont-3, or DNS
|
|
# quorum fallback)
|
|
#
|
|
# Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode) every value
|
|
# below is overridable per-Sovereign overlay. Per #4a (image SHA-pinned
|
|
# from CI) `image.tag` is empty by default; the chart fail-fasts at
|
|
# render time when `enabled: true` and `image.tag` is empty (see
|
|
# templates/_helpers.tpl `continuum.image`). CI stamps the SHA on every
|
|
# push to main via .github/workflows/build-continuum-controller.yaml.
|
|
|
|
# ─── Global registry rewrite (matches catalyst chart pattern) ──────────
|
|
# When set, ALL Continuum image pulls route through this registry.
|
|
# Per-Sovereign overlays set this to harbor.<sovereign-fqdn> so every
|
|
# image pull hits the Sovereign's own Harbor proxy_cache rather than
|
|
# ghcr.io directly. Empty = no rewrite.
|
|
global:
|
|
imageRegistry: ""
|
|
|
|
# ─── Continuum controller deployment ───────────────────────────────────
|
|
continuum:
|
|
# Default-OFF. Flip to true on a Sovereign overlay once dependencies
|
|
# (bp-cnpg-pair, bp-powerdns, lease witness) are ready.
|
|
enabled: false
|
|
|
|
image:
|
|
repository: "ghcr.io/openova-io/openova/continuum-controller"
|
|
# tag is empty by default — CI stamps a SHA on every push to main.
|
|
# Render-time fail-fast (see templates/_helpers.tpl) prevents a
|
|
# half-configured deploy from accidentally running with `:latest`.
|
|
tag: ""
|
|
pullPolicy: IfNotPresent
|
|
|
|
replicas: 1
|
|
leaderElection:
|
|
enabled: true
|
|
|
|
resources:
|
|
requests:
|
|
cpu: 100m
|
|
memory: 128Mi
|
|
limits:
|
|
cpu: 500m
|
|
memory: 512Mi
|
|
|
|
# ─── Metrics / health endpoints ──────────────────────────────────────
|
|
metrics:
|
|
enabled: true
|
|
# Port exposed on the controller Deployment + Service.
|
|
port: 9090
|
|
|
|
health:
|
|
# /healthz + /readyz endpoint port (separate from metrics so the
|
|
# ServiceMonitor scrape can hit /metrics without exposing health
|
|
# publicly).
|
|
port: 8081
|
|
# Slice F-3: post-switchover health-check delay in seconds.
|
|
# Default 30s per design doc §9.5; 0 = run immediately (tests).
|
|
postSwitchoverDelaySeconds: 30
|
|
|
|
# Slice F-2 + F-3: HTTP server for dry-run + health endpoints.
|
|
api:
|
|
# Port exposed on the controller Deployment + Service for the
|
|
# F-2 (POST .../dry-run) + F-3 (GET .../health) endpoints. Empty
|
|
# = disabled (no HTTP server bound).
|
|
port: 8082
|
|
# Optional bearer token gate. Leave nil to rely solely on the
|
|
# X-Catalyst-Owner-Tier header (catalyst-api stamps after JWT
|
|
# validation). When set, a SealedSecret reference like:
|
|
# tokenSecretRef:
|
|
# name: continuum-api-token
|
|
# key: token
|
|
# populates the CONTINUUM_API_TOKEN env var.
|
|
tokenSecretRef: null
|
|
|
|
# ─── K-Cont-2 / K-Cont-3 config seams (sketch — finalized in
|
|
# downstream slices) ─────────────────────────────────────────────────
|
|
#
|
|
# PDM endpoint that the controller calls /v1/commit on for lua-record
|
|
# writes. Empty = use the in-cluster Service default. Per Inviolable
|
|
# Principle #4 the value is fully runtime-configurable; per-Sovereign
|
|
# overlays may repoint at a Sovereign-local PDM instance.
|
|
pdmURL: ""
|
|
|
|
# NATS endpoint for catalyst.audit publishing. Empty = use the
|
|
# in-cluster nats.openova-system.svc.cluster.local default.
|
|
natsURL: ""
|
|
|
|
# Lease witness defaults (overridden per-CR via Continuum.spec.leaseClient).
|
|
# These are CONTROLLER-WIDE defaults; per-Continuum-CR config wins.
|
|
lease:
|
|
ttlSeconds: 30
|
|
renewSeconds: 10
|
|
|
|
# Free-form extra env vars threaded into the Pod (advanced; for one-off
|
|
# operator-side knobs not yet promoted to a top-level value).
|
|
env: {}
|
|
|
|
nodeSelector: {}
|
|
tolerations: []
|
|
affinity: {}
|
|
|
|
# ─── NetworkPolicy ────────────────────────────────────────────────────
|
|
# Default-deny except for explicit egress to Kubernetes API server,
|
|
# in-cluster CNPG status, NATS, PDM, and the lease witness URL. K-Cont-2
|
|
# extends with witness-specific egress when WITNESS_KIND=cloudflare-kv
|
|
# (egress to CF API at api.cloudflare.com:443).
|
|
networkPolicy:
|
|
enabled: true
|