Slice F layers three concerns on top of K-Cont-2's reconciler +
sequencer:
F-1 — extend audit-emit coverage with three new audit-types:
- continuum-cr-created — fires once per CR observation
- continuum-config-changed — fires on switchover-relevant spec drift
- continuum-lease-collision — fires when Acquire returns
ErrLeaseHeldByAnother during the
opportunistic re-acquire path
Total reserved Continuum audit-types now 12 (was 9). Order is
K-Cont-2's 9 first, then F-1's 3 (additions at end so existing
index-pinned tests keep working). U-DR-1 subscribes by
audit-type=continuum-* so it receives the new types automatically.
F-2 — Sequencer.DryRun + DryRunReport struct + per-step
preconditions evaluator. Walks the same 7 steps Execute would run,
but read-only end-to-end (asserted by tests: zero audit emits, zero
state mutation). Per-step durations as exported constants. Plan
content fingerprint (16-hex SHA-256 prefix) for cache idempotency.
Blockers (FATAL) vs Warnings (advisory) so the UI can render the
report and disable [ Confirm Switchover ] when blockers present.
F-3 — Sequencer.PostSwitchoverHealth + HealthReport struct + 4
fixed-order checks (replicas-healthy, dns-probes, latency-normal,
audit-posted). Replicas check reads both halves of the cluster-pair
post-switchover (new-primary has replica.enabled=false; new-replica
has replica.enabled=true; both must be Ready=true). DNS check
fans out to multi-vantage resolvers (default 8.8.8.8 / 1.1.1.1 /
9.9.9.9) and asserts every (hostname × vantage) returns at least one
ToRegion IP. Latency check is permanently Deferred=true (Cilium
hubble metrics scrape is SRE follow-up). Audit check queries an
injected AuditTail (recorder in tests; NATS PullConsumer wiring is
follow-up — currently Deferred=true in production).
Controller chains PostSwitchoverHealth ~30s after every successful
switchover (HealthDelay; CONTINUUM_HEALTH_DELAY_SECONDS env). Result
written to Continuum CR status condition LastSwitchoverHealthy with
True/False/Unknown + one-line summary message.
Endpoints — small HTTP server in continuum-controller binary on
:8082 (CONTINUUM_API_ADDR env; empty disables):
- POST /v1/continuums/{ns}/{name}/dry-run → DryRunReport
- GET /v1/continuums/{ns}/{name}/health → HealthReport
- GET /healthz → ok
Auth — owner-tier gated per INVIOLABLE-PRINCIPLES #5:
X-Catalyst-Owner-Tier: true header (catalyst-api stamps it after JWT
validation) plus optional Authorization: Bearer <CONTINUUM_API_TOKEN>
for defence in depth. The /api/v1/sovereigns/{id}/... outer envelope
is the catalyst-api's responsibility (separate slice); the controller
exposes only the inner shape.
Chart — values.yaml + deployment.yaml + service.yaml extended with
continuum.api.{port,tokenSecretRef} and
continuum.health.postSwitchoverDelaySeconds. Service exposes new
api port (default 8082) so the catalyst-api proxy can reach it.
Tests — three-tier gate per implementer-canon §6:
- 53 unit tests across switchover (DryRun + Health + integration),
events (3 new types + roundtrip), api (server + auth + cache),
controller (4 new test groups for F-1 + F-3 chain).
- End-to-end integration test: DryRun → Execute → PostSwitchoverHealth
sequence (TestEndToEnd_DryRunThenSwitchoverThenHealth +
TestEndToEnd_DryRunBlockedSwitchoverNeverRuns).
- go test -count=1 -race ./... clean across all sibling controllers.
- go vet ./... clean.
K-Cont-2's sequencer surface was sufficient — this slice ADDED
DryRun + PostSwitchoverHealth methods without modifying the existing
Execute / RequestFailback / steps() implementations.
Out of scope (per slice F brief): WitnessClient interface changes,
CF Worker changes, U-DR-1 UI, 1M-row C-DB-3 acceptance test,
Cilium hubble latency metrics, NATS PullConsumer for audit-posted
health check (deferred).
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| crds | ||
| templates | ||
| blueprint.yaml | ||
| Chart.yaml | ||
| README.md | ||
| values.yaml | ||
bp-continuum
OpenOva Continuum — DR orchestration for active-hotstandby Applications. Slice K-Cont-1 of EPIC-6 (#1101).
What this ships
K-Cont-1 ships the skeleton: chart + Containerfile + binary scaffold + GHA workflow. The Reconcile() body is a no-op. K-Cont-2 fills in the reconcile loop:
- per-Continuum-CR goroutine maintaining a lease (10s renew, 30s TTL)
- watches CNPG replication metrics
- switchover sequencer (drain HTTPRoute → flip lua-record → flip CNPG primary → audit on NATS)
- failback handler (manual approval gate)
K-Cont-3 wires the lease witness (Cloudflare KV per SRE.md §2.4, with 3-DNS-quorum fallback). K-Cont-4 ships the Cloudflare Worker source.
Install
# Default-OFF gate keeps the controller stopped until the operator
# installs bp-cnpg-pair + bp-powerdns and configures the witness.
helm install bp-continuum ./products/continuum/chart \
--namespace openova-system \
--create-namespace
To enable on a per-Sovereign overlay:
# clusters/<sovereign>/values-continuum.yaml
continuum:
enabled: true
image:
tag: "<sha-stamped-by-CI>"
pdmURL: "http://pool-domain-manager.openova-system.svc.cluster.local:8080"
natsURL: "nats://nats.openova-system.svc.cluster.local:4222"
CRD
The Continuum.dr.openova.io/v1 CRD lives in the Catalyst chart at
products/catalyst/chart/crds/continuum.yaml (slice B8 of EPIC-0,
#1110). It is not duplicated here — see crds/README.md.
Reconcile contract (K-Cont-2)
The K-Cont-2 reconciler will:
- Fetch the Continuum CR. Validate the referenced Application has
placement: active-hotstandby. ValidateprimaryRegion∈Application.spec.regions[]. - Acquire / renew the lease via the witness (kind from
spec.leaseClient.kind). - Watch CNPG
cnpg.io/cluster.replicationLagper replica. - On switchover (operator-initiated or auto-failover when health
check + witness quorum agree primary is unreachable), execute the
sequence in
docs/EPICS-1-6-unified-design.md§9.3. - Patch status (phase, primaryRegion, leaseHolder, leaseExpiresAt, replicationLag map, conditions, lastSwitchover).
Reading order for K-Cont-2 implementer
docs/EPICS-1-6-unified-design.md§9 (full Continuum DR spec)docs/SRE.md§2 (DR runbook + lease witness pattern)docs/MULTI-REGION-DNS.md(lua-record DNS pattern)products/catalyst/chart/crds/continuum.yaml(CRD shape)core/controllers/continuum/DESIGN.md(this slice's design notes)core/controllers/internal/placement/(Plan + RegionPlan — reuse for failover region resolution)
Lease witness API contract (K-Cont-3 sketch)
The K-Cont-2 reconciler talks to a WitnessClient interface; K-Cont-3
will land the cloudflare-kv + dns-quorum implementations. Sketched
shape (subject to finalization in K-Cont-3):
type WitnessClient interface {
// Acquire attempts to claim the lease for `holder` with the
// configured TTL. Returns the granted lease (with lease-id +
// expiry) or an error. ErrLeaseHeldByAnother is returned when
// another holder already owns the lease.
Acquire(ctx context.Context, key string, holder string, ttl time.Duration) (Lease, error)
// Renew extends the lease for the same holder. Returns
// ErrLeaseLost if the lease was reassigned.
Renew(ctx context.Context, lease Lease) (Lease, error)
// Release relinquishes the lease. Idempotent — releasing a lost
// lease is not an error.
Release(ctx context.Context, lease Lease) error
// Read returns the current lease holder + expiry without
// attempting to acquire. Used for diagnostic / dry-run.
Read(ctx context.Context, key string) (LeaseInfo, error)
}
K-Cont-2 wires against this interface with a stub implementation; K-Cont-3 swaps in the real cloudflare-kv + dns-quorum clients.
— K-Cont-1 (#1101)