* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry` (default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` — doubled prefix, image-not-found, ImagePullBackOff on every fresh Sovereign. Caught live during otech26. Fix: drop the redundant prefix. Subchart's default `.image.registry` keeps it pointing at registry.k8s.io which the new Sovereign's containerd routes through harbor.openova.io/v2/proxy-k8s/... via registries.yaml rewrite (#640). Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
136 lines
4.7 KiB
YAML
136 lines
4.7 KiB
YAML
# Catalyst Blueprint umbrella metadata — the upstream chart is resolved as
|
|
# a Helm subchart via Chart.yaml `dependencies:`. Catalyst-curated values
|
|
# under the `vertical-pod-autoscaler:` key flow into the upstream subchart
|
|
# unchanged.
|
|
#
|
|
# Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every operationally-
|
|
# meaningful value is configurable; cluster overlays in clusters/<sovereign>/
|
|
# may override any of these without rebuilding the Blueprint OCI artifact.
|
|
|
|
catalystBlueprint:
|
|
upstream:
|
|
chart: vertical-pod-autoscaler
|
|
version: "11.1.1"
|
|
repo: "https://cowboysysop.github.io/charts/"
|
|
|
|
# ─── Upstream chart values (subchart key: vertical-pod-autoscaler) ───────
|
|
# Note: the cowboysysop chart name on disk is `vertical-pod-autoscaler` —
|
|
# Helm's umbrella convention requires the values key match that name.
|
|
vertical-pod-autoscaler:
|
|
|
|
# Pin upstream image versions — DO NOT use floating tags per
|
|
# docs/INVIOLABLE-PRINCIPLES.md.
|
|
recommender:
|
|
enabled: true
|
|
replicaCount: 1
|
|
image:
|
|
# Upstream cowboysysop chart prepends `.image.registry` (default
|
|
# registry.k8s.io) to `.image.repository`, so we MUST NOT include
|
|
# the registry hostname in repository — the rendered image would
|
|
# be `registry.k8s.io/registry.k8s.io/autoscaling/...` (doubled
|
|
# prefix) and pulls fail with "image not found" (caught live on
|
|
# otech26).
|
|
repository: autoscaling/vpa-recommender
|
|
tag: "1.5.0"
|
|
pullPolicy: IfNotPresent
|
|
resources:
|
|
requests:
|
|
cpu: 50m
|
|
memory: 256Mi
|
|
limits:
|
|
cpu: 500m
|
|
memory: 1Gi
|
|
# Recommender prefers metrics-server but will also scrape Prometheus
|
|
# when configured. The default (metrics-server only) is fine for a
|
|
# fresh Sovereign — per-Sovereign overlays MAY add Prometheus extra
|
|
# args once bp-mimir reconciles.
|
|
extraArgs: {}
|
|
# ServiceMonitor / PrometheusRule — DEFAULT FALSE per
|
|
# docs/BLUEPRINT-AUTHORING.md §11.2. monitoring.coreos.com/v1 CRDs
|
|
# ship with kube-prometheus-stack, which is an Application-tier
|
|
# Blueprint. Operator opts in via per-cluster overlay.
|
|
metrics:
|
|
serviceMonitor:
|
|
enabled: false
|
|
prometheusRule:
|
|
enabled: false
|
|
|
|
updater:
|
|
enabled: true
|
|
replicaCount: 1
|
|
image:
|
|
repository: autoscaling/vpa-updater
|
|
tag: "1.5.0"
|
|
pullPolicy: IfNotPresent
|
|
resources:
|
|
requests:
|
|
cpu: 50m
|
|
memory: 128Mi
|
|
limits:
|
|
cpu: 200m
|
|
memory: 512Mi
|
|
# Per-Sovereign overlays MAY tune `--eviction-tolerance` etc.
|
|
extraArgs: {}
|
|
metrics:
|
|
serviceMonitor:
|
|
enabled: false
|
|
prometheusRule:
|
|
enabled: false
|
|
|
|
admissionController:
|
|
enabled: true
|
|
replicaCount: 1
|
|
image:
|
|
repository: autoscaling/vpa-admission-controller
|
|
tag: "1.5.0"
|
|
pullPolicy: IfNotPresent
|
|
resources:
|
|
requests:
|
|
cpu: 50m
|
|
memory: 128Mi
|
|
limits:
|
|
cpu: 200m
|
|
memory: 512Mi
|
|
# The admission controller serves a MutatingWebhookConfiguration over
|
|
# TLS. The upstream chart bootstraps a self-signed CA + cert via an
|
|
# init job, which is fine for Catalyst because the webhook is
|
|
# cluster-internal and not exposed externally. Per-Sovereign overlays
|
|
# MAY swap to cert-manager-issued certs if preferred.
|
|
generateCertificate: true
|
|
metrics:
|
|
serviceMonitor:
|
|
enabled: false
|
|
prometheusRule:
|
|
enabled: false
|
|
|
|
# CRDs — VPA's CRDs (verticalpodautoscalers, verticalpodautoscalercheck
|
|
# points) ship with the chart. `helm install` creates them. Per
|
|
# docs/INVIOLABLE-PRINCIPLES.md #3, we keep CRD lifecycle managed by
|
|
# the chart — Flux will surface CRD upgrade conflicts as
|
|
# InstallFailed/UpgradeFailed events that the operator resolves
|
|
# explicitly (no silent drift).
|
|
crds:
|
|
install: true
|
|
keep: true
|
|
|
|
# RBAC — chart manages its own ClusterRole + ServiceAccount.
|
|
rbac:
|
|
create: true
|
|
|
|
# Default UpdateMode for autogenerated VPAs is `Off` (recommend only).
|
|
# Per docs/INVIOLABLE-PRINCIPLES.md #1: Catalyst doesn't auto-enable
|
|
# mutation of customer workloads — SREs opt in per-workload via a
|
|
# VerticalPodAutoscaler CR with `updatePolicy.updateMode: Auto`.
|
|
# (This setting is not consumed by the upstream chart directly — it
|
|
# documents intent. Per-workload VPA CRs override it.)
|
|
|
|
# ─── Catalyst overlay values (consumed by templates/ in this chart) ──────
|
|
# Reserved for Catalyst-side overlays (NetworkPolicy) added in a follow-up
|
|
# PR once bp-vpa is consumed in clusters/_template/.
|
|
vpaOverlay:
|
|
networkPolicy:
|
|
enabled: false
|
|
admissionWebhookPort: 8000
|
|
metricsPort: 8942
|
|
metricsServerPort: 4443
|