Slice K-Cont-1 of EPIC-6 (#1101) ships the Continuum product skeleton: - core/controllers/continuum/{cmd,internal/{controller,events}} - cmd/main.go — controller-runtime Manager bootstrap; leader election; /healthz, /readyz, /metrics endpoints; env-only config per INVIOLABLE-PRINCIPLES #4 - internal/controller — ContinuumReconciler with no-op Reconcile() (K-Cont-2 fills the body); SetupWithManager() watches Continuum CRs via unstructured.Unstructured per ADR-0001 §2.7 (no controller-gen) - internal/events — placeholder package documenting K-Cont-2's NATS audit-event-type list - Containerfile — multi-stage Go build → alpine:3.20 runtime, UID 65534 - products/continuum/chart/ — full Helm chart shape (default-OFF): - Chart.yaml + values.yaml (continuum.enabled: false; image.tag empty; fail-fast on empty tag at render time) - templates/{_helpers.tpl, deployment, service, serviceaccount, rbac, networkpolicy}.yaml - blueprint.yaml — OpenOva Blueprint manifest with configSchema + placementSchema (single-region: management cluster) + depends: bp-cnpg-pair + bp-powerdns - crds/README.md — pointer to the canonical Continuum CRD shipped in products/catalyst/chart/crds/continuum.yaml (B8 #1110); not duplicated - products/continuum/DESIGN.md — chart-vs-binary split decision (Option A: binary in shared core/controllers/ module per CC1 #1135), K-Cont-2 fill list, K-Cont-3 lease witness API contract sketch - .github/workflows/build-continuum-controller.yaml — event-driven CI (NO cron) with go vet + go test -race + helm template ON/OFF resource count gates + fail-fast verification + GHCR build & push (cosign keyless signed) + repository_dispatch for chart-bump fan-out helm template verification: - continuum.enabled=false → 0 resources (default OFF) - continuum.enabled=true + image.tag=ci-test → 6 resources (ServiceAccount, ClusterRole, ClusterRoleBinding, Deployment, Service, NetworkPolicy) - continuum.enabled=true + empty image.tag → render fails per #4a go vet ./continuum/... → clean. go test -count=1 -race → all green. Out of scope (per the K-Cont-1 brief): - Reconcile body — K-Cont-2 - Lease witness implementations — K-Cont-3 - Cloudflare Worker source — K-Cont-4 - bp-cnpg-pair Blueprint — C-DB-1 Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.8 KiB
bp-continuum
OpenOva Continuum — DR orchestration for active-hotstandby Applications. Slice K-Cont-1 of EPIC-6 (#1101).
What this ships
K-Cont-1 ships the skeleton: chart + Containerfile + binary scaffold + GHA workflow. The Reconcile() body is a no-op. K-Cont-2 fills in the reconcile loop:
- per-Continuum-CR goroutine maintaining a lease (10s renew, 30s TTL)
- watches CNPG replication metrics
- switchover sequencer (drain HTTPRoute → flip lua-record → flip CNPG primary → audit on NATS)
- failback handler (manual approval gate)
K-Cont-3 wires the lease witness (Cloudflare KV per SRE.md §2.4, with 3-DNS-quorum fallback). K-Cont-4 ships the Cloudflare Worker source.
Install
# Default-OFF gate keeps the controller stopped until the operator
# installs bp-cnpg-pair + bp-powerdns and configures the witness.
helm install bp-continuum ./products/continuum/chart \
--namespace openova-system \
--create-namespace
To enable on a per-Sovereign overlay:
# clusters/<sovereign>/values-continuum.yaml
continuum:
enabled: true
image:
tag: "<sha-stamped-by-CI>"
pdmURL: "http://pool-domain-manager.openova-system.svc.cluster.local:8080"
natsURL: "nats://nats.openova-system.svc.cluster.local:4222"
CRD
The Continuum.dr.openova.io/v1 CRD lives in the Catalyst chart at
products/catalyst/chart/crds/continuum.yaml (slice B8 of EPIC-0,
#1110). It is not duplicated here — see crds/README.md.
Reconcile contract (K-Cont-2)
The K-Cont-2 reconciler will:
- Fetch the Continuum CR. Validate the referenced Application has
placement: active-hotstandby. ValidateprimaryRegion∈Application.spec.regions[]. - Acquire / renew the lease via the witness (kind from
spec.leaseClient.kind). - Watch CNPG
cnpg.io/cluster.replicationLagper replica. - On switchover (operator-initiated or auto-failover when health
check + witness quorum agree primary is unreachable), execute the
sequence in
docs/EPICS-1-6-unified-design.md§9.3. - Patch status (phase, primaryRegion, leaseHolder, leaseExpiresAt, replicationLag map, conditions, lastSwitchover).
Reading order for K-Cont-2 implementer
docs/EPICS-1-6-unified-design.md§9 (full Continuum DR spec)docs/SRE.md§2 (DR runbook + lease witness pattern)docs/MULTI-REGION-DNS.md(lua-record DNS pattern)products/catalyst/chart/crds/continuum.yaml(CRD shape)core/controllers/continuum/DESIGN.md(this slice's design notes)core/controllers/internal/placement/(Plan + RegionPlan — reuse for failover region resolution)
Lease witness API contract (K-Cont-3 sketch)
The K-Cont-2 reconciler talks to a WitnessClient interface; K-Cont-3
will land the cloudflare-kv + dns-quorum implementations. Sketched
shape (subject to finalization in K-Cont-3):
type WitnessClient interface {
// Acquire attempts to claim the lease for `holder` with the
// configured TTL. Returns the granted lease (with lease-id +
// expiry) or an error. ErrLeaseHeldByAnother is returned when
// another holder already owns the lease.
Acquire(ctx context.Context, key string, holder string, ttl time.Duration) (Lease, error)
// Renew extends the lease for the same holder. Returns
// ErrLeaseLost if the lease was reassigned.
Renew(ctx context.Context, lease Lease) (Lease, error)
// Release relinquishes the lease. Idempotent — releasing a lost
// lease is not an error.
Release(ctx context.Context, lease Lease) error
// Read returns the current lease holder + expiry without
// attempting to acquire. Used for diagnostic / dry-run.
Read(ctx context.Context, key string) (LeaseInfo, error)
}
K-Cont-2 wires against this interface with a stub implementation; K-Cont-3 swaps in the real cloudflare-kv + dns-quorum clients.
— K-Cont-1 (#1101)