fix(sovereign-tls): cilium-gateway propagates Hetzner LB annotations via spec.infrastructure
Closes #1885 (TBD-A31). Problem (t28 evidence — A98 + A107 reports, 2026-05-19 00:30Z): `console.t28.omani.works:443` accepts TCP but TLS resets. Inspection: `kubectl get svc -n kube-system cilium-gateway-cilium-gateway` shows type=ClusterIP with no Hetzner LB. Even with the tofu-provisioned `hcloud_load_balancer.main` (infra/hetzner/main.tf:955) carrying 443→30443 service-port at the infra layer, the cluster-side hcloud-CCM has no signal to materialise a parallel Service-level LB for the auto-generated gateway Service — so operators inspecting kubectl see a non-LoadBalancer Service and conclude the LB chain is broken. Fix: Add `spec.infrastructure.annotations` to the Gateway resource. The Gateway-API spec mandates that controllers propagate these annotations to any infrastructure resources they create — in Cilium 1.16+ this means the auto-generated `cilium-gateway-cilium-gateway` Service in kube-system. hcloud-cloud-controller-manager (bp-hcloud-ccm slot 55) then picks the annotations up at Service reconcile time and provisions a Hetzner LB. Annotations (mirrors clustermesh-apiserver block in 01-cilium.yaml): - load-balancer.hetzner.cloud/name = <slug>-<region>-gateway - load-balancer.hetzner.cloud/location = <Hetzner DC> - load-balancer.hetzner.cloud/type = lb11 - load-balancer.hetzner.cloud/use-private-ip = "false" (DoD A2 — public IPs always) - load-balancer.hetzner.cloud/disable-private-ingress = "true" - load-balancer.hetzner.cloud/health-check-protocol = tcp - load-balancer.hetzner.cloud/health-check-port = "30443" - load-balancer.hetzner.cloud/health-check-interval = 15s - load-balancer.hetzner.cloud/health-check-timeout = 10s - load-balancer.hetzner.cloud/health-check-retries = "3" Per-region segmentation: SOVEREIGN_FQDN_SLUG + SOVEREIGN_REGION_KEY in the LB name so each multi-region peer's cilium-gateway gets its own public LB (Hetzner LBs are unique-by-name; duplicate-name allocations collapse to the first-created instance, hiding the LB for every subsequent region). Wiring: 3 substitute vars (SOVEREIGN_FQDN_SLUG, SOVEREIGN_REGION_KEY, HCLOUD_LB_LOCATION) threaded into the sovereign-tls Kustomization's postBuild.substitute block. These mirror the same vars already passed to bootstrap-kit's Kustomization for the clustermesh-apiserver LB block in 01-cilium.yaml apiserver.service.annotations, so the configuration boundary is symmetric across the gateway LB and the clustermesh LB. Memory rules respected: - A2 (PUBLIC IPs for inter-region) — use-private-ip=false - feedback_overlap_provs_dont_serialize_wait (no provisioning gate) - feedback_subagents_inherit_design_system (no new architectural seam, reuses existing Gateway-API + hcloud-CCM contracts) Validation: $ kubectl kustomize clusters/_template/sovereign-tls/ | grep -A 30 'kind: Gateway' → renders all 10 Hetzner LB annotations under spec.infrastructure → ${SOVEREIGN_FQDN_SLUG}/${SOVEREIGN_REGION_KEY}/${HCLOUD_LB_LOCATION} substituted at Flux apply time Acceptance criteria (per issue): - kubectl get svc -n kube-system cilium-gateway-cilium-gateway shows type=LoadBalancer with external IP (after fresh prov + handover) - curl -skI https://console.<fqdn>/ returns HTTP 200 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
ab6f3e6510
commit
16f7284116
@ -88,6 +88,61 @@ metadata:
|
||||
catalyst.openova.io/component: cilium-gateway
|
||||
spec:
|
||||
gatewayClassName: cilium
|
||||
# ── TBD-A31 (#1885): Hetzner LB annotations for the gateway Service ──
|
||||
#
|
||||
# The Gateway-API spec (`spec.infrastructure.annotations`) is the canonical
|
||||
# mechanism for declaring annotations that the controller MUST propagate
|
||||
# to any infrastructure resources it creates in response to this Gateway —
|
||||
# in Cilium's case, the auto-generated `cilium-gateway-cilium-gateway`
|
||||
# Service in kube-system. Cilium 1.16+ honours this block and forwards
|
||||
# the annotations onto the Service `metadata.annotations`, where
|
||||
# hcloud-cloud-controller-manager (bp-hcloud-ccm slot 55) picks them up
|
||||
# at Service reconcile time and provisions a Hetzner LB.
|
||||
#
|
||||
# Why this matters operationally:
|
||||
# - A98+A107 evidence on t28 (76fdffb42532e6cc): the gateway Service
|
||||
# showed `type=ClusterIP` with no Hetzner LB attached → public TLS
|
||||
# to console.t28.omani.works:443 reset at the handshake. Even with
|
||||
# the tofu-provisioned `hcloud_load_balancer.main` (infra/hetzner/
|
||||
# main.tf:955) carrying 443→30443 service-port, operators inspecting
|
||||
# `kubectl get svc -n kube-system cilium-gateway-cilium-gateway`
|
||||
# saw a non-LoadBalancer Service and concluded the LB chain was
|
||||
# broken. Without these annotations, hcloud-CCM has no signal to
|
||||
# materialise a parallel Service-level LB (the tofu LB at the
|
||||
# infra layer is invisible to the cluster-side CCM).
|
||||
# - For multi-region Sovereigns the per-region cilium-gateway in each
|
||||
# secondary cluster ALSO needs a public LB so external clients can
|
||||
# reach region-local listeners directly (the omani.homes / omani.rest
|
||||
# SME-pool subdomains attach to the secondary region's gateway).
|
||||
# `${SOVEREIGN_REGION_KEY:=primary}` segments the LB name per region
|
||||
# (mirrors the clustermesh-apiserver LB naming in
|
||||
# clusters/_template/bootstrap-kit/01-cilium.yaml:237).
|
||||
#
|
||||
# use-private-ip: "false" — per docs/SOVEREIGN-MULTI-REGION-DOD.md A2
|
||||
# (inter-region link = PUBLIC IPs ALWAYS) AND the empirical lesson from
|
||||
# PR #1538: the Hetzner per-region LB has no private-network attachment
|
||||
# by default so CCM rejects `use private ip: missing network id`. The
|
||||
# firewall already opens 30000-32767/tcp (infra/hetzner/main.tf:233) so
|
||||
# the public-IP LB health checks pass against node:30443.
|
||||
#
|
||||
# health-check pinned to TCP:30443 — without this annotation, hcloud-CCM
|
||||
# defaults the health check to the Service's nodePort (which Cilium
|
||||
# allocates randomly when hostNetwork=true). Pinning to 30443 (the
|
||||
# actual host-bound cilium-envoy HTTPS listener) ensures the CCM LB
|
||||
# marks targets healthy AS SOON AS envoy is listening — without this,
|
||||
# the LB stayed `unhealthy` indefinitely on prov #76 (2026-05-14).
|
||||
infrastructure:
|
||||
annotations:
|
||||
load-balancer.hetzner.cloud/name: "${SOVEREIGN_FQDN_SLUG:=catalyst}-${SOVEREIGN_REGION_KEY:=primary}-gateway"
|
||||
load-balancer.hetzner.cloud/location: "${HCLOUD_LB_LOCATION}"
|
||||
load-balancer.hetzner.cloud/type: "lb11"
|
||||
load-balancer.hetzner.cloud/use-private-ip: "false"
|
||||
load-balancer.hetzner.cloud/disable-private-ingress: "true"
|
||||
load-balancer.hetzner.cloud/health-check-protocol: "tcp"
|
||||
load-balancer.hetzner.cloud/health-check-port: "30443"
|
||||
load-balancer.hetzner.cloud/health-check-interval: "15s"
|
||||
load-balancer.hetzner.cloud/health-check-timeout: "10s"
|
||||
load-balancer.hetzner.cloud/health-check-retries: "3"
|
||||
# NOTE: ports 30080/30443 (not 80/443) — even with hostNetwork=true,
|
||||
# cilium-envoy refuses to bind privileged ports because cilium-agent
|
||||
# gates that bind through its `envoy-keep-cap-netbindservice` flag and
|
||||
|
||||
@ -1256,6 +1256,28 @@ write_files:
|
||||
# (no 5/168h limit); default → PROD. Locals in main.tf
|
||||
# render the final string so this template stays declarative.
|
||||
WILDCARD_CERT_ISSUER: "${wildcard_cert_issuer}"
|
||||
# TBD-A31 (#1885) — Hetzner LB annotations on cilium-gateway
|
||||
# Gateway resource (spec.infrastructure.annotations). These
|
||||
# substitute vars name and locate the LB hcloud-CCM provisions
|
||||
# for the auto-generated `cilium-gateway-cilium-gateway`
|
||||
# Service in kube-system. Mirrors the same 3 vars threaded
|
||||
# into the bootstrap-kit Kustomization for the clustermesh-
|
||||
# apiserver LB (see 01-cilium.yaml apiserver.service.annotations).
|
||||
# - SOVEREIGN_FQDN_SLUG: short, DNS-safe Sovereign identifier
|
||||
# used as the LB name prefix so operators can spot the
|
||||
# gateway LB in the Hetzner Console.
|
||||
# - SOVEREIGN_REGION_KEY: per-region suffix so each
|
||||
# multi-region peer's cilium-gateway gets its own LB
|
||||
# (Hetzner LBs are unique by name — duplicates collapse to
|
||||
# the first-created instance, hiding the LB for every
|
||||
# subsequent region).
|
||||
# - HCLOUD_LB_LOCATION: Hetzner datacenter location for the
|
||||
# LB. Per-region rendered (primary CP renders var.region,
|
||||
# secondary CPs render each.value.cloudRegion) so the LB
|
||||
# and its backend node are co-located.
|
||||
SOVEREIGN_FQDN_SLUG: "${sovereign_fqdn_slug}"
|
||||
SOVEREIGN_REGION_KEY: ${sovereign_region_key}
|
||||
HCLOUD_LB_LOCATION: "${region}"
|
||||
---
|
||||
apiVersion: kustomize.toolkit.fluxcd.io/v1
|
||||
kind: Kustomization
|
||||
|
||||
Loading…
Reference in New Issue
Block a user