openova/docs/MULTI-REGION-DNS.md
hatiyildiz 04559e5c37 docs(reconcile-pass-1): align docs with ground truth at dd578d1c
Reconcile Pass 1 — first holistic LLM-driven reconciliation pass per
~/.claude/skills/reconcile-catalyst-docs/SKILL.md. Skill triggered after
the post-Group-M architectural batch (#161, #162, #163, #167, #168,
#169, #170, #171, #173, #174, #175). Live ground truth verified against
kubectl + ls platform/ + git log + GHCR + componentGroups.ts.

Drift categories fixed:

- A. Numerical: bp-powerdns 1.0.5 → 1.0.6; component-logos 63 → 62
  (powerdns SVG missing, tracked under #173); bootstrap kit 11 → 12
  with bp-powerdns added per #167.
- B. Service: pool-domain-manager + 5 registrar adapters
  (Cloudflare/Namecheap/GoDaddy/OVH/Dynadot, #170) added to
  IMPLEMENTATION-STATUS, ARCHITECTURE, PLATFORM-TECH-STACK, GLOSSARY,
  and PROVISIONING-PLAN; bp-powerdns added to ARCHITECTURE bootstrap
  kit + Catalyst-on-Catalyst dependency tree.
- C. Architectural: SOVEREIGN-PROVISIONING §3 + DEMO-RUNBOOK Step 4
  + ORCHESTRATOR-STATE Step 6 rewritten from Dynadot-direct DNS writes
  to PowerDNS authoritative + PDM /v1/commit + registrar-adapter
  NS-flip; PROVISIONING-PLAN Phase 4 paths corrected to
  products/catalyst/bootstrap/api/ (per INVIOLABLE-PRINCIPLES #3 the
  Go provisioner does NOT call cloud APIs); Phase 6 retitled and
  rewritten for the new DNS architecture.
- D. Process: RUNBOOK-PROVISIONING §2 wizard-step table + DEMO-RUNBOOK
  Step 2 wizard-step table updated to canonical 7-step ordering
  (Org → Domain → Topology → Provider → Credentials → Components →
  Review per WIZARD_STEPS in WizardLayout.tsx, post #169 + #174); the
  three-mode StepDomain (pool / byo-manual / byo-api per #169) and
  two-tab StepComponents (mandatory infra + apps per #161/#162/#175)
  now documented.
- E. Cross-doc: Group G  across PROVISIONING-PLAN +
  ORCHESTRATOR-STATE (superseded by #167+#163+#170, not by the
  original Dynadot-multi-domain plan); Group C  in
  PROVISIONING-PLAN (Flux is reconciling from openova-public today);
  README Stack-at-a-glance DNS row expanded.
- F. Stale terminology: 11-grep banned-terms scan clean — every k8gb
  residual is a legitimate "removed at #171, replaced by lua-records"
  reference.

VALIDATION-LOG.md gains the Reconcile Pass 1 entry per skill spec.
Reconcile-skill numbering is independent of the Audit-skill numbering
(which continues at Pass 108+).

Files: 13 docs + VALIDATION-LOG entry.
Escalations: none.
2026-04-29 09:40:10 +02:00

216 lines
13 KiB
Markdown

# Multi-Region DNS — health-checked failover with PowerDNS lua-records
**Status:** Authoritative. **Updated:** 2026-04-29 (Reconcile Pass 1).
This document is the canonical reference for **how Catalyst routes traffic across regions**. Geographic redundancy in OpenOva is realized at the **authoritative DNS** layer, not at the K8s controller layer. PowerDNS lua-records (`ifurlup`, `ifportup`, `pickclosest`, `pickrandom`, `pickwhashed`) provide everything Catalyst needs:
- **Geo-aware response selection** — answer the closest healthy backend for the resolver's source IP / ECS subnet.
- **Health-checked failover** — drop a backend from the response set when a TCP/HTTP probe fails, restore it when the probe recovers.
- **Latency-aware routing** — combine `ifurlup` (health) with `pickclosest` (geo) for active-active steering.
- **Same operational layer Catalyst already runs** — PowerDNS is bp-powerdns, deployed by the bootstrap kit on every Sovereign's `mgt` cluster. No separate operator, no extra CRDs, no extra reconciliation loop.
This subsumes the role previously assigned to k8gb. The k8gb component has been removed from `componentGroups.ts`, the umbrella chart, and the wizard; lua-records cover every failover scenario k8gb covered without the dedicated GSLB controller.
---
## 1. Why PowerDNS lua-records (and why not k8gb)
| Concern | k8gb (removed) | PowerDNS lua-records (current) |
|---|---|---|
| Authoritative DNS | CoreDNS plugin, separate zone | PowerDNS authoritative — same zones used for `external-dns`, ACME, etc. |
| Operator footprint | k8gb controller + CRDs (`Gslb`, `GslbHttpRoute`) + per-cluster CoreDNS pod set | None — declarative LUA records in the existing PowerDNS zone |
| Health-check primitive | k8gb-managed liveness probes | PowerDNS `ifurlup` / `ifportup` (HTTP / TCP probes from PowerDNS pods) |
| Geo selection | EdgeDNS witness + custom logic | `pickclosest` (geo by source IP), `pickrandom` (RR), `pickwhashed` (sticky weighted) |
| DNSSEC | Layered on top, separate signer | Native — PowerDNS signs the lua-record's computed answer with the zone's KSK/ZSK |
| Operational surface | k8gb pods + CoreDNS pods + custom CRDs | Existing PowerDNS deployment + dnsdist rate-limit shield |
| Cluster-coordination | Required (gslb endpoints sync between clusters) | Not required — authoritative DNS is the source of truth |
The architectural cost difference is large enough that the deletion is the right move per [INVIOLABLE-PRINCIPLES.md](INVIOLABLE-PRINCIPLES.md) #2 ("never compromise from quality — pick the unified primitive, not the dual-shape design") and #4 ("never hardcode — health probes, weights, geo policy are configuration in the lua-record body, not code in a controller").
---
## 2. Failover patterns (the lua-record cookbook)
Every Catalyst Sovereign zone is hosted on PowerDNS. The records below sit alongside ordinary A/AAAA/CNAME records that `external-dns` writes via the PowerDNS REST API. Lua-record syntax follows the [upstream PowerDNS documentation](https://doc.powerdns.com/authoritative/lua-records/index.html).
> **Note on examples.** Backend IPv4 addresses (`5.161.42.18`, `95.217.189.42`) and the FQDN `primary.example.com` below are placeholders — they illustrate the lua-record shape only. The canonical 6-record set per Sovereign zone is written by **pool-domain-manager** (PDM, `core/pool-domain-manager/`) on `/v1/commit`; lua-records (geo / health-check policy) are written by the **catalyst-dns** controller (Catalyst control-plane sidecar) from each Application's Placement spec — see [`docs/PLATFORM-POWERDNS.md`](PLATFORM-POWERDNS.md) §"In-cluster consumers".
### 2.1 Active-active across two regions, health-checked
```
foo.acme.com. IN LUA A "ifurlup('https://primary.example.com/healthz', {'5.161.42.18', '95.217.189.42'}, {selector='all'})"
```
- PowerDNS HTTP-probes `https://primary.example.com/healthz` from each PowerDNS pod every 5s (default; configurable via `interval` option).
- `selector='all'` returns **every** healthy backend — the resolver's stub then picks one (typical client behaviour: rotate, retry on failure).
- When the probe to a backend fails three times in a row (default `failOnIncerror=true`, 3 fails to drop), that backend is removed from the answer set within the next TTL window.
- When the probe recovers, the backend is restored automatically.
### 2.2 Geo-aware active-active (`pickclosest`)
```
api.acme.com. IN LUA A "pickclosest({'5.161.42.18', '95.217.189.42'})"
```
- PowerDNS uses ECS (EDNS Client Subnet) when present, falling back to the resolver's source IP.
- The closer regional LB by GeoIP wins.
- Combine with `ifurlup` for health-aware closeness:
```
api.acme.com. IN LUA A "
ifurlup('https://primary.example.com/healthz', {
{'5.161.42.18', '95.217.189.42'}
}, {selector='pickclosest'})
"
```
### 2.3 Active-passive (primary → DR)
```
api.acme.com. IN LUA A "ifurlup('https://primary.example.com/healthz', {'5.161.42.18', '95.217.189.42'}, {selector='pickfirst'})"
```
- `pickfirst` returns the first healthy backend in the list.
- When `5.161.42.18` (primary) is healthy → answer is `5.161.42.18`.
- When primary fails the probe → answer flips to `95.217.189.42` (DR) within one TTL window.
- When primary recovers → answer flips back to primary on the next probe success.
### 2.4 TCP-only / non-HTTP services (`ifportup`)
For services that don't expose an HTTP `/healthz` (e.g. SMTP, IMAP, custom TCP):
```
mail.acme.com. IN LUA A "ifportup(587, {'5.161.42.18', '95.217.189.42'})"
```
- PowerDNS attempts a TCP connect to port 587 on each backend.
- Connect-fail → drop from the response set; connect-success → include.
### 2.5 Weighted round-robin (`pickwhashed`)
For canary releases or traffic-shifting:
```
api.acme.com. IN LUA A "pickwhashed({{80, '5.161.42.18'}, {20, '95.217.189.42'}})"
```
- 80% of distinct client IPs are pinned to `5.161.42.18`, 20% to `95.217.189.42` (consistent hash on source IP — the same client gets the same answer until the weight changes).
---
## 3. Catalyst integration points
### 3.1 Where lua-records are written
Lua-records are part of each Sovereign's PowerDNS zone, alongside the canonical 6-record set ([`PLATFORM-POWERDNS.md`](PLATFORM-POWERDNS.md) §"Per-Sovereign zone model"). The 6-record set is written once at provisioning by **pool-domain-manager** (PDM `/v1/commit`); ongoing A/AAAA/CNAME records are written by **external-dns**; LUA records are written by the **catalyst-dns** controller (sidecar to the Catalyst control plane on the `mgt` cluster):
```
PDM ──► PowerDNS REST API ──► canonical 6-record set (one-shot at provision)
external-dns ──► PowerDNS REST API ──► A/AAAA/CNAME records (per-region LB IPs)
catalyst-dns ──► PowerDNS REST API ──► LUA records (geo / health-check policy)
```
This separation matters: `external-dns` knows about a single K8s Service or Ingress; it has no concept of multi-region health policy. The catalyst-dns controller reads the Application's **Placement** field from the per-Org Gitea repo, sees `placement: active-active` (or `active-hotstandby`, etc.), and synthesizes the corresponding lua-record body.
### 3.2 Application Placement → lua-record selector mapping
| Application Placement | lua-record idiom |
|---|---|
| `single-region` | Plain A record(s) — no lua-record needed |
| `active-active` | `ifurlup(..., {selector='all'})` (or `selector='pickclosest'` for geo-affinity) |
| `active-hotstandby` | `ifurlup(..., {selector='pickfirst'})` — primary first, DR second |
| `active-passive-warm` | `ifurlup(..., {selector='pickfirst'})` + longer TTL (manual operator promotion is the contract; the LUA only flips when the probe fails enough times) |
| `weighted-canary` | `pickwhashed({{w1, ip1}, {w2, ip2}})` — adjust weights via Catalyst console (re-emits the lua-record body with new weights) |
### 3.3 Probe target
Every Catalyst Application Blueprint MUST expose `/healthz` on its public endpoint. The catalyst-dns controller defaults to `https://<app-fqdn>/healthz` as the probe target, configurable per-Application via `spec.healthCheck.path` in the Blueprint instance.
DNS pods are inside the Sovereign — they probe **outbound** to the regional LB IPs over the public internet (or via the Cilium Cluster Mesh + WireGuard back-channel for cross-region private probes). The probe direction is intentional: DNS pods are the source of truth on whether a regional LB is reachable from the same place the public internet would reach it.
### 3.4 Split-brain protection (failover-controller)
Lua-records are necessary but not sufficient for split-brain protection during a network partition. The [failover-controller](../platform/failover-controller/README.md) layers a **lease-based witness** on top:
- During healthy operation, each regional cluster renews a lease in a cloud witness (Cloudflare KV or similar — out of band from the Sovereign's own infra).
- The PowerDNS lua-record probes are the *primary* failover signal (sub-minute response).
- The lease becomes the *tie-breaker* for stateful promotion (OpenBao DR, CNPG primary promotion) — only the cluster holding a valid lease is allowed to take over write authority.
- See [`SRE.md`](SRE.md) §2.4 for the witness protocol; this doc covers only the DNS-routing half.
---
## 4. When to add a second Sovereign region (the HA upgrade path)
A single-region Sovereign is the SME default ([`PLATFORM-TECH-STACK.md`](PLATFORM-TECH-STACK.md) §9.2). For corporate / regulated tier (and for any Sovereign that signs an SLA strict enough that single-region downtime would breach it), the upgrade path is:
1. **Sovereign provisioned in Region A** (e.g. `hz-fsn-rtz-prod`) — single LB IP, plain A records.
2. **Operator decides to add Region B** via the Catalyst admin UI: Admin → Infrastructure → Add Region (see [`SOVEREIGN-PROVISIONING.md`](SOVEREIGN-PROVISIONING.md) §8).
3. Crossplane provisions Region B's clusters (rtz + dmz) with **the same building blocks** as Region A.
4. Region B's PowerDNS replicas join the Sovereign's authoritative NS set via SOA NOTIFY + AXFR (PowerDNS-native zone replication; no external sync layer needed).
5. **catalyst-dns rewrites every Application's lua-record from `single-region` → `active-active`** (or whichever Placement the Application opts into). Old plain A records are replaced with `ifurlup(...)` lua-records pointing at both regional LBs.
6. The cloud witness (failover-controller) starts arbitrating leases across the two clusters.
The cluster name **never changes** during this upgrade — Region A's cluster is still `hz-fsn-rtz-prod`, Region B is now `hz-hel-rtz-prod`, and neither is "primary" or "DR". This is the explicit design from [`NAMING-CONVENTION.md`](NAMING-CONVENTION.md) §1.3 — failover is a routing event, not a renaming event.
### 4.1 Triggers for adding a second region
| Trigger | Recommendation |
|---|---|
| SLA target ≥ 99.95% uptime | Mandatory second region — single-region cannot meet this |
| Compliance requirement (DORA, NIS2, GDPR data residency split) | Mandatory — typically one region per data-residency boundary |
| Application's Placement set to `active-active` / `active-hotstandby` / `active-passive-warm` | Mandatory — these placements require ≥ 2 regions to honour |
| Latency-sensitive global traffic (regional users far from Region A) | Strongly recommended — `pickclosest` lua-records cut median RTT |
| Cost-sensitive single-tenant Sovereign on a low-tier SLA | Defer — pay for it when a workload demands it |
---
## 5. Operational checks
### 5.1 Verify a lua-record is healthy
```
dig +short api.acme.com @ns1.openova.io
# Expected: an A record from the healthy regional LB set.
```
```
dig +short api.acme.com @ns1.openova.io \
+subnet=80.81.82.0/24
# Expected: with a EU client subnet, pickclosest returns the EU regional LB.
```
### 5.2 Force a probe-failure simulation (chaos-engineering)
The [Litmus](../platform/litmus/README.md) chaos suite includes a scenario that black-holes a regional LB's probe target. After ~1 TTL window:
```
dig +short api.acme.com @ns1.openova.io
# Expected: the affected backend IP is absent from the response.
```
When the probe target is restored, the IP returns automatically — no operator action.
### 5.3 Read PowerDNS probe state
```
kubectl exec -n openova-system deploy/powerdns -- pdns_control bind-list-record api.acme.com
```
PowerDNS exposes the current probe status (last probe timestamp, last result, current selection set) — useful when investigating "why is the answer set what it is?" during an incident.
---
## 6. References
- [PowerDNS Lua Records — upstream documentation](https://doc.powerdns.com/authoritative/lua-records/index.html) — every selector, every option.
- [`PLATFORM-POWERDNS.md`](PLATFORM-POWERDNS.md) — the bp-powerdns deployment, DNSSEC posture, REST API contract.
- [`SOVEREIGN-PROVISIONING.md`](SOVEREIGN-PROVISIONING.md) §7-§8 — multi-region topology + add-region workflow.
- [`NAMING-CONVENTION.md`](NAMING-CONVENTION.md) §1.3 + §7 — building-block naming, no "primary"/"DR" labels.
- [`SRE.md`](SRE.md) §2 — multi-region strategy, split-brain protection, data-replication patterns.
- [`SECURITY.md`](SECURITY.md) §5 — OpenBao independent-Raft-per-region (DNS failover doesn't touch secret authority).
- Issue [#171](https://github.com/openova-io/openova/issues/171) — the change that retired k8gb in favour of PowerDNS lua-records.
---
*Part of [OpenOva Catalyst](https://openova.io). Read [Inviolable Principles](INVIOLABLE-PRINCIPLES.md) before any changes.*