openova/products/continuum/chart
e3mrah 96f8b260c9
feat(continuum): F — dry-run report + post-switchover health check + audit-emit coverage (slice F-1+F-2+F-3, #1101) (#1161)
Slice F layers three concerns on top of K-Cont-2's reconciler +
sequencer:

F-1 — extend audit-emit coverage with three new audit-types:
- continuum-cr-created     — fires once per CR observation
- continuum-config-changed — fires on switchover-relevant spec drift
- continuum-lease-collision — fires when Acquire returns
                              ErrLeaseHeldByAnother during the
                              opportunistic re-acquire path
Total reserved Continuum audit-types now 12 (was 9). Order is
K-Cont-2's 9 first, then F-1's 3 (additions at end so existing
index-pinned tests keep working). U-DR-1 subscribes by
audit-type=continuum-* so it receives the new types automatically.

F-2 — Sequencer.DryRun + DryRunReport struct + per-step
preconditions evaluator. Walks the same 7 steps Execute would run,
but read-only end-to-end (asserted by tests: zero audit emits, zero
state mutation). Per-step durations as exported constants. Plan
content fingerprint (16-hex SHA-256 prefix) for cache idempotency.
Blockers (FATAL) vs Warnings (advisory) so the UI can render the
report and disable [ Confirm Switchover ] when blockers present.

F-3 — Sequencer.PostSwitchoverHealth + HealthReport struct + 4
fixed-order checks (replicas-healthy, dns-probes, latency-normal,
audit-posted). Replicas check reads both halves of the cluster-pair
post-switchover (new-primary has replica.enabled=false; new-replica
has replica.enabled=true; both must be Ready=true). DNS check
fans out to multi-vantage resolvers (default 8.8.8.8 / 1.1.1.1 /
9.9.9.9) and asserts every (hostname × vantage) returns at least one
ToRegion IP. Latency check is permanently Deferred=true (Cilium
hubble metrics scrape is SRE follow-up). Audit check queries an
injected AuditTail (recorder in tests; NATS PullConsumer wiring is
follow-up — currently Deferred=true in production).

Controller chains PostSwitchoverHealth ~30s after every successful
switchover (HealthDelay; CONTINUUM_HEALTH_DELAY_SECONDS env). Result
written to Continuum CR status condition LastSwitchoverHealthy with
True/False/Unknown + one-line summary message.

Endpoints — small HTTP server in continuum-controller binary on
:8082 (CONTINUUM_API_ADDR env; empty disables):
- POST /v1/continuums/{ns}/{name}/dry-run  → DryRunReport
- GET  /v1/continuums/{ns}/{name}/health   → HealthReport
- GET  /healthz                            → ok

Auth — owner-tier gated per INVIOLABLE-PRINCIPLES #5:
X-Catalyst-Owner-Tier: true header (catalyst-api stamps it after JWT
validation) plus optional Authorization: Bearer <CONTINUUM_API_TOKEN>
for defence in depth. The /api/v1/sovereigns/{id}/... outer envelope
is the catalyst-api's responsibility (separate slice); the controller
exposes only the inner shape.

Chart — values.yaml + deployment.yaml + service.yaml extended with
continuum.api.{port,tokenSecretRef} and
continuum.health.postSwitchoverDelaySeconds. Service exposes new
api port (default 8082) so the catalyst-api proxy can reach it.

Tests — three-tier gate per implementer-canon §6:
- 53 unit tests across switchover (DryRun + Health + integration),
  events (3 new types + roundtrip), api (server + auth + cache),
  controller (4 new test groups for F-1 + F-3 chain).
- End-to-end integration test: DryRun → Execute → PostSwitchoverHealth
  sequence (TestEndToEnd_DryRunThenSwitchoverThenHealth +
  TestEndToEnd_DryRunBlockedSwitchoverNeverRuns).
- go test -count=1 -race ./... clean across all sibling controllers.
- go vet ./... clean.

K-Cont-2's sequencer surface was sufficient — this slice ADDED
DryRun + PostSwitchoverHealth methods without modifying the existing
Execute / RequestFailback / steps() implementations.

Out of scope (per slice F brief): WitnessClient interface changes,
CF Worker changes, U-DR-1 UI, 1M-row C-DB-3 acceptance test,
Cilium hubble latency metrics, NATS PullConsumer for audit-posted
health check (deferred).

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 08:33:37 +04:00
..
crds feat(continuum): K-Cont-1 — Continuum product skeleton (chart + binary + GHA workflow, no reconcile yet) (#1101) (#1151) 2026-05-09 04:45:00 +04:00
templates feat(continuum): F — dry-run report + post-switchover health check + audit-emit coverage (slice F-1+F-2+F-3, #1101) (#1161) 2026-05-09 08:33:37 +04:00
blueprint.yaml feat(continuum): K-Cont-1 — Continuum product skeleton (chart + binary + GHA workflow, no reconcile yet) (#1101) (#1151) 2026-05-09 04:45:00 +04:00
Chart.yaml feat(continuum): K-Cont-1 — Continuum product skeleton (chart + binary + GHA workflow, no reconcile yet) (#1101) (#1151) 2026-05-09 04:45:00 +04:00
README.md feat(continuum): K-Cont-1 — Continuum product skeleton (chart + binary + GHA workflow, no reconcile yet) (#1101) (#1151) 2026-05-09 04:45:00 +04:00
values.yaml feat(continuum): F — dry-run report + post-switchover health check + audit-emit coverage (slice F-1+F-2+F-3, #1101) (#1161) 2026-05-09 08:33:37 +04:00

bp-continuum

OpenOva Continuum — DR orchestration for active-hotstandby Applications. Slice K-Cont-1 of EPIC-6 (#1101).

What this ships

K-Cont-1 ships the skeleton: chart + Containerfile + binary scaffold + GHA workflow. The Reconcile() body is a no-op. K-Cont-2 fills in the reconcile loop:

  • per-Continuum-CR goroutine maintaining a lease (10s renew, 30s TTL)
  • watches CNPG replication metrics
  • switchover sequencer (drain HTTPRoute → flip lua-record → flip CNPG primary → audit on NATS)
  • failback handler (manual approval gate)

K-Cont-3 wires the lease witness (Cloudflare KV per SRE.md §2.4, with 3-DNS-quorum fallback). K-Cont-4 ships the Cloudflare Worker source.

Install

# Default-OFF gate keeps the controller stopped until the operator
# installs bp-cnpg-pair + bp-powerdns and configures the witness.
helm install bp-continuum ./products/continuum/chart \
  --namespace openova-system \
  --create-namespace

To enable on a per-Sovereign overlay:

# clusters/<sovereign>/values-continuum.yaml
continuum:
  enabled: true
  image:
    tag: "<sha-stamped-by-CI>"
  pdmURL: "http://pool-domain-manager.openova-system.svc.cluster.local:8080"
  natsURL: "nats://nats.openova-system.svc.cluster.local:4222"

CRD

The Continuum.dr.openova.io/v1 CRD lives in the Catalyst chart at products/catalyst/chart/crds/continuum.yaml (slice B8 of EPIC-0, #1110). It is not duplicated here — see crds/README.md.

Reconcile contract (K-Cont-2)

The K-Cont-2 reconciler will:

  1. Fetch the Continuum CR. Validate the referenced Application has placement: active-hotstandby. Validate primaryRegionApplication.spec.regions[].
  2. Acquire / renew the lease via the witness (kind from spec.leaseClient.kind).
  3. Watch CNPG cnpg.io/cluster.replicationLag per replica.
  4. On switchover (operator-initiated or auto-failover when health check + witness quorum agree primary is unreachable), execute the sequence in docs/EPICS-1-6-unified-design.md §9.3.
  5. Patch status (phase, primaryRegion, leaseHolder, leaseExpiresAt, replicationLag map, conditions, lastSwitchover).

Reading order for K-Cont-2 implementer

  1. docs/EPICS-1-6-unified-design.md §9 (full Continuum DR spec)
  2. docs/SRE.md §2 (DR runbook + lease witness pattern)
  3. docs/MULTI-REGION-DNS.md (lua-record DNS pattern)
  4. products/catalyst/chart/crds/continuum.yaml (CRD shape)
  5. core/controllers/continuum/DESIGN.md (this slice's design notes)
  6. core/controllers/internal/placement/ (Plan + RegionPlan — reuse for failover region resolution)

Lease witness API contract (K-Cont-3 sketch)

The K-Cont-2 reconciler talks to a WitnessClient interface; K-Cont-3 will land the cloudflare-kv + dns-quorum implementations. Sketched shape (subject to finalization in K-Cont-3):

type WitnessClient interface {
    // Acquire attempts to claim the lease for `holder` with the
    // configured TTL. Returns the granted lease (with lease-id +
    // expiry) or an error. ErrLeaseHeldByAnother is returned when
    // another holder already owns the lease.
    Acquire(ctx context.Context, key string, holder string, ttl time.Duration) (Lease, error)

    // Renew extends the lease for the same holder. Returns
    // ErrLeaseLost if the lease was reassigned.
    Renew(ctx context.Context, lease Lease) (Lease, error)

    // Release relinquishes the lease. Idempotent — releasing a lost
    // lease is not an error.
    Release(ctx context.Context, lease Lease) error

    // Read returns the current lease holder + expiry without
    // attempting to acquire. Used for diagnostic / dry-run.
    Read(ctx context.Context, key string) (LeaseInfo, error)
}

K-Cont-2 wires against this interface with a stub implementation; K-Cont-3 swaps in the real cloudflare-kv + dns-quorum clients.

— K-Cont-1 (#1101)