* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service
PR #546 (Closes#542) introduced a dependency cycle:
hcloud_server.control_plane.user_data → local.control_plane_cloud_init
local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address
`tofu plan` failed with:
Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane
Caught live during otech23 first-end-to-end provisioning attempt.
Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4
Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(infra+api): wire handover_jwt_public_key end-to-end
The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
- main.tf templatefile() call did not pass the key → "vars map does not
contain key handover_jwt_public_key" on tofu plan
- provisioner.writeTfvars never set the var → empty even when wired
Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:
Error: Invalid function argument
on main.tf line 170, in locals:
170: control_plane_cloud_init = replace(templatefile(...
Invalid value for "vars" parameter: vars map does not contain key
"handover_jwt_public_key", referenced at
./cloudinit-control-plane.tftpl:371,9-32.
Fix:
- main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
- provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
server-stamped, never accepted from client JSON)
- handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
- writeTfvars emits the value into tofu.auto.tfvars.json
variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(api): cloud-init kubeconfig postback must live outside RequireSession
The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.
Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.
Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.
Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io
PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:
Failed to pull image "rancher/mirrored-pause:3.6":
unexpected media type text/html for sha256:...
cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.
Wiring (mirrors the GHCRPullToken pattern):
1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
env at New().
2. Stamped onto every Request in Provision() and Destroy() before
writeTfvars.
3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
from the wizard payload.
4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
(mirrored from openova-harbor — Reflector-managed on Sovereign
clusters; copied per-namespace on Catalyst-Zero contabo) as
CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
still come up.
variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)
PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned
Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)
even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.
Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>
PR #557 wrote registries.yaml with mirror endpoints like
https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6
But Harbor proxy-cache projects expose their API at
https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
"unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.
Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:
curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
-> 200 application/vnd.docker.distribution.manifest.list.v2+json
This unblocks every Sovereign image pull through the central Harbor.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it
cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry`
(default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa
overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered
image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` —
doubled prefix, image-not-found, ImagePullBackOff on every fresh
Sovereign. Caught live during otech26.
Fix: drop the redundant prefix. Subchart's default `.image.registry`
keeps it pointing at registry.k8s.io which the new Sovereign's
containerd routes through harbor.openova.io/v2/proxy-k8s/... via
registries.yaml rewrite (#640).
Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB
CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single
node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending
indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux +
cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir +
Loki + Tempo + … each request 50-500m vCPU and the node hits 100%
allocatable before half the workloads schedule.
CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size
that fits the bootstrap-kit with VPA-recommendation headroom. Operators
can still pick CPX32 explicitly if they trim the component set on
StepComponents — but the default SOLO path now provisions a node
that actually boots into a steady state.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2)
- Replace forbidden `:latest` tag with current short-SHA `942be6f` per
docs/INVIOLABLE-PRINCIPLES.md #4.
- Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet
authenticates against private ghcr.io/openova-io/openova/* via the
Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace.
Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff
on every Sovereign — caught live during otech27.
- Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
---------
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>