openova/infra
e3mrah ef93a2cdbe
feat(cloud-init): patch node providerID after k3s healthz (unblocks Gap A) (#1520)
Architecturally-clean replacement for the reverted PRs #1513 (k3s flag)
and #1516 (pre-install hcloud-ccm). Both prior approaches broke
cold-start (chicken-and-egg with the uninitialized taint).

This patch instead lets k3s boot normally with its default embedded
cloud controller (which sets `providerID=k3s://<hostname>` — the
problem), then immediately patches the local Node's `spec.providerID`
to `hcloud://<id>` using the Hetzner instance metadata endpoint
(169.254.169.254). The patch runs ONCE per CP node, right after k3s
apiserver healthz becomes reachable, BEFORE flux-bootstrap.yaml applies
the bootstrap-kit Kustomization.

Once providerID has the canonical `hcloud://` prefix, bp-hcloud-ccm
(installed by Flux later in the bootstrap-kit chain) accepts the node
as a Hetzner-managed instance and allocates LBs for Service
type=LoadBalancer normally. That unblocks:

- D12: clustermesh-apiserver Service gets a real external IP
        instead of <pending>
- D10: AutoEstablishClusterMesh (PR #1508) can read each region's
        LB IP and write peer entries into cilium-clustermesh Secret
- D11: inter-region pod-to-pod traffic flows via Cilium WG over the
        per-region LB IPs
- D5: child catalyst-api can reach secondary regions via mesh, so
       /cloud view aggregates all 3 regions instead of 1/1

Failure is non-fatal: if metadata lookup or patch fails, we log and
continue (bp-hcloud-ccm has a chance to set providerID later via its
own node-list-and-match logic). Cold-start is never blocked.

Canonical topology (1 cpx52 per region, workerCount=0) means every
node is a CP — covered by this patch. Operator-added workers
(workerCount>0) would also need providerID patched; a follow-up Job
in bp-providerid-patcher can iterate all nodes post-Flux.

Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 14:12:26 +04:00
..
cloudflare-worker-leases feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) (#1159) 2026-05-09 08:01:44 +04:00
hetzner feat(cloud-init): patch node providerID after k3s healthz (unblocks Gap A) (#1520) 2026-05-16 14:12:26 +04:00