History

e3mrah 3f1a028493 fix(infra): hcloud-CCM + cilium DNS hardening + chart-side gitea token — qa-loop iter-12 Fix #54 (#1281 ) Four chart-side fixes follow-on to Fix #53 to unblock the remaining multi-region + DNS + gitea-bootstrap matrix gaps. Workstream 1 — bp-hcloud-ccm (NEW Blueprint @ 1.0.0) ==================================================== platform/hcloud-ccm/ — full Catalyst-curated umbrella over upstream hetznercloud/hcloud-cloud-controller-manager 1.20.0. Pulled into clusters/_template/bootstrap-kit/55-bp-hcloud-ccm.yaml @ slot 55. Reads hcloud-token from canonical flux-system/cloud-credentials Secret via Flux valuesFrom (mirrors bp-cluster-autoscaler-hcloud + bp-velero + bp-harbor wiring patterns). Renders namespace-local kube-system/ hcloud-token Secret consumed by upstream subchart's HCLOUD_TOKEN env var binding. Pinned to k3s control plane via nodeSelector + node.cloudprovider.kubernetes.io/uninitialized toleration. Why: without hcloud-CCM, every Service-of-type-LoadBalancer stays in EXTERNAL-IP: <pending> forever — the proximate root cause clustermesh-apiserver could not migrate from NodePort to LB on omantel multi-region (Fix #53D PR #1274). Also flips node providerIDs from k3s://<node-name> to hcloud://<server-id> so the scheduler can correlate Pod placement with Hetzner zones. Workstream 2 — bp-cilium 1.3.1 (DNS hardening) ============================================== platform/cilium/chart/values.yaml — adds two defensive defaults to mitigate cilium/cilium#28456 ("DNS races during node bring-up when BPF maps allocate on-demand"): - cilium.bpf.preallocateMaps: true (~12 MiB extra RSS per agent; eliminates the lazy-allocate window where pods on first-join workers fail DNS lookups) - cilium.socketLB.hostNamespaceOnly: true (pinned explicit; future- proofs against an upstream default flip that re-introduces the per-pod-netns kube-proxy-replacement DNS race) Why: fresh worker pods on catalyst-omantel-biz-w2/w3 cannot resolve github.com (DNS lookup races). Operational hack today is scheduling sync Jobs only on w1 (source-controller node). Per feedback_no_mvp_no_workarounds.md rule #3, the chart-side defaults are the canonical fix. Bootstrap-kit slot pin bumped 1.3.0 → 1.3.1 in both _template + omantel overlay. Workstream 4 — catalyst-gitea-token chart-side template ======================================================= products/catalyst/chart/templates/catalyst-gitea-token-secret.yaml NEW — chart 1.4.127. Replaces the kubectl-applied operational hack documented in qa-loop-state/iter12-diagnostic-audit.md §"(e) infra-blocked" TC-081. Pattern mirrors catalyst-openova-kc-credentials- secret.yaml: 1. Helm `lookup` of gitea/gitea-admin-secret to gate render (Sovereign-only; contabo skips because the Secret doesn't exist in that ns layout). 2. Helm `lookup` of catalyst-system/catalyst-gitea-token for idempotency — re-emits same bytes on every reconcile after first install. 3. Post-install Job (helm.sh/hook=post-install,post-upgrade) that calls Gitea's POST /api/v1/users/{admin}/tokens to mint a fresh PAT on first install, patches catalyst-gitea-token.data.token via kubectl. Job is gated on token=="" so it ONLY fires on first install (subsequent reconciles see the token, skip the Job render entirely). RBAC: the minter SA gets get/patch/update on catalyst-gitea-token in catalyst-system + read-only on gitea/gitea-admin-secret. No cluster-wide permissions. Bootstrap-kit slot 13 pin bumped 1.4.126 → 1.4.127. Workstream 3 — keycloak realm verification ========================================== Already deployed via PR #1271 (chart 1.5.0 with sovereignRealm.name parameterized) + PR #1279 (template envsubst plumb of SOVEREIGN_REALM_NAME). Confirmed live state on omantel chroot: SOVEREIGN_REALM_NAME=omantel is set on bootstrap-kit Kustomization postBuild.substitute. Awaiting Flux reconcile of latest main into the in-cluster Gitea (currently blocked on the same DNS pathology Workstream 2 addresses — gitea-mirror Job fails on Could not resolve host: github.com from worker-side pods). Workstream 5 — bp-pdm-operator ============================== Out of scope. TC-345 verifies a DoT cert on `pdm-1.openova.io:853` which is the central PDM (lives on contabo-mkt openova-private). The related per-Sovereign PDM CRs are already chart-side via products/catalyst/chart/templates/qa-fixtures/pdm-qa.yaml. The DoT-on-port-853 question is a contabo-side infra change handled separately. Test plan ========= - helm dependency build + helm template smoke render (offline) — passes for hcloud-ccm + cilium + catalyst chart changes. - Live cluster verification deferred until CI publishes the new Blueprint OCI artifacts and Flux reconciles them onto omantel. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-05-10 11:56:50 +04:00
..
chart	fix(infra): hcloud-CCM + cilium DNS hardening + chart-side gitea token — qa-loop iter-12 Fix #54 (#1281 )	2026-05-10 11:56:50 +04:00
blueprint.yaml	fix(infra): hcloud-CCM + cilium DNS hardening + chart-side gitea token — qa-loop iter-12 Fix #54 (#1281 )	2026-05-10 11:56:50 +04:00
README.md	fix(infra): hcloud-CCM + cilium DNS hardening + chart-side gitea token — qa-loop iter-12 Fix #54 (#1281 )	2026-05-10 11:56:50 +04:00

README.md

bp-hcloud-ccm

Catalyst Blueprint umbrella for hcloud-cloud-controller-manager — Hetzner Cloud's Kubernetes cloud-provider integration.

Why this Blueprint exists

Without a cloud-provider implementation, k3s nodes get providerID: k3s://<node-name> (the --cloud-provider=external default). Two consequences cascade from that:

Service-of-type-LoadBalancer stays <pending> forever. kube-controller-manager has no cloud integration to call out to. This is the root cause clustermesh-apiserver could not migrate from NodePort to LB on omantel multi-region (qa-loop iter-12 Fix #53D + Fix #54 Workstream 1).
Scheduler + cnpg/cnpg-pair cannot pin Pods to Hetzner zones. topology.kubernetes.io/zone and the Hetzner-private-network IP fields are not populated until a CCM hot-fills them from the Hetzner API.

This Blueprint installs the upstream hetznercloud/hcloud-cloud-controller-manager chart, sourcing the Hetzner API token from the canonical flux-system/cloud-credentials Secret cloud-init writes at Phase 0.

Wiring summary

infra/hetzner/cloudinit-control-plane.tftpl
  → flux-system/cloud-credentials  (key: hcloud-token)
       │
       │  Flux `valuesFrom`
       ▼
clusters/<sovereign>/bootstrap-kit/55-bp-hcloud-ccm.yaml HelmRelease
  → bp-hcloud-ccm chart (this directory)
       │
       │  templates/hcloud-token-secret.yaml
       ▼
kube-system/hcloud-token  (key: token)
       │
       │  upstream subchart's env.HCLOUD_TOKEN.valueFrom.secretKeyRef
       ▼
kube-system/hcloud-cloud-controller-manager Pod
  → reads HCLOUD_TOKEN, calls Hetzner Cloud API to:
      a) flip every Node's providerID from k3s://<name>
                                       to hcloud://<server-id>
      b) hot-fill .status.addresses (InternalIP from private network IF
                                     networkID is set, ExternalIP always)
      c) materialise type=LoadBalancer Services as Hetzner Cloud LBs
         (e.g. clustermesh-apiserver svc → real `hcloud://...` LB IP)

Per-Sovereign overlay

# clusters/<sovereign>/bootstrap-kit/55-bp-hcloud-ccm.yaml
spec:
  valuesFrom:
    - kind: Secret
      name: cloud-credentials
      valuesKey: hcloud-token
      targetPath: hcloudCcm.hcloudToken
  values:
    hcloudCcm:
      networkID: ""  # or "12345678" if the Sovereign uses a Hetzner Network

ADR-0001 compliance

Per ADR-0001 §13 (cloud-direct architecture rule): every cloud-API call from inside the cluster is gated through a sanctioned operator. hcloud-CCM is the canonical operator for node providerID + LB materialisation; this Blueprint is the only path to that integration. Bespoke kubectl patch node providerID=... is forbidden.