openova/platform/vpa/chart/values.yaml
e3mrah 487ebebda2
fix(bp-vpa): drop registry.k8s.io/ prefix in repository (upstream prepends it) (#641)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it

cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry`
(default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa
overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered
image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` —
doubled prefix, image-not-found, ImagePullBackOff on every fresh
Sovereign. Caught live during otech26.

Fix: drop the redundant prefix. Subchart's default `.image.registry`
keeps it pointing at registry.k8s.io which the new Sovereign's
containerd routes through harbor.openova.io/v2/proxy-k8s/... via
registries.yaml rewrite (#640).

Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 23:32:35 +04:00

136 lines
4.7 KiB
YAML

# Catalyst Blueprint umbrella metadata — the upstream chart is resolved as
# a Helm subchart via Chart.yaml `dependencies:`. Catalyst-curated values
# under the `vertical-pod-autoscaler:` key flow into the upstream subchart
# unchanged.
#
# Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every operationally-
# meaningful value is configurable; cluster overlays in clusters/<sovereign>/
# may override any of these without rebuilding the Blueprint OCI artifact.
catalystBlueprint:
upstream:
chart: vertical-pod-autoscaler
version: "11.1.1"
repo: "https://cowboysysop.github.io/charts/"
# ─── Upstream chart values (subchart key: vertical-pod-autoscaler) ───────
# Note: the cowboysysop chart name on disk is `vertical-pod-autoscaler` —
# Helm's umbrella convention requires the values key match that name.
vertical-pod-autoscaler:
# Pin upstream image versions — DO NOT use floating tags per
# docs/INVIOLABLE-PRINCIPLES.md.
recommender:
enabled: true
replicaCount: 1
image:
# Upstream cowboysysop chart prepends `.image.registry` (default
# registry.k8s.io) to `.image.repository`, so we MUST NOT include
# the registry hostname in repository — the rendered image would
# be `registry.k8s.io/registry.k8s.io/autoscaling/...` (doubled
# prefix) and pulls fail with "image not found" (caught live on
# otech26).
repository: autoscaling/vpa-recommender
tag: "1.5.0"
pullPolicy: IfNotPresent
resources:
requests:
cpu: 50m
memory: 256Mi
limits:
cpu: 500m
memory: 1Gi
# Recommender prefers metrics-server but will also scrape Prometheus
# when configured. The default (metrics-server only) is fine for a
# fresh Sovereign — per-Sovereign overlays MAY add Prometheus extra
# args once bp-mimir reconciles.
extraArgs: {}
# ServiceMonitor / PrometheusRule — DEFAULT FALSE per
# docs/BLUEPRINT-AUTHORING.md §11.2. monitoring.coreos.com/v1 CRDs
# ship with kube-prometheus-stack, which is an Application-tier
# Blueprint. Operator opts in via per-cluster overlay.
metrics:
serviceMonitor:
enabled: false
prometheusRule:
enabled: false
updater:
enabled: true
replicaCount: 1
image:
repository: autoscaling/vpa-updater
tag: "1.5.0"
pullPolicy: IfNotPresent
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 512Mi
# Per-Sovereign overlays MAY tune `--eviction-tolerance` etc.
extraArgs: {}
metrics:
serviceMonitor:
enabled: false
prometheusRule:
enabled: false
admissionController:
enabled: true
replicaCount: 1
image:
repository: autoscaling/vpa-admission-controller
tag: "1.5.0"
pullPolicy: IfNotPresent
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 512Mi
# The admission controller serves a MutatingWebhookConfiguration over
# TLS. The upstream chart bootstraps a self-signed CA + cert via an
# init job, which is fine for Catalyst because the webhook is
# cluster-internal and not exposed externally. Per-Sovereign overlays
# MAY swap to cert-manager-issued certs if preferred.
generateCertificate: true
metrics:
serviceMonitor:
enabled: false
prometheusRule:
enabled: false
# CRDs — VPA's CRDs (verticalpodautoscalers, verticalpodautoscalercheck
# points) ship with the chart. `helm install` creates them. Per
# docs/INVIOLABLE-PRINCIPLES.md #3, we keep CRD lifecycle managed by
# the chart — Flux will surface CRD upgrade conflicts as
# InstallFailed/UpgradeFailed events that the operator resolves
# explicitly (no silent drift).
crds:
install: true
keep: true
# RBAC — chart manages its own ClusterRole + ServiceAccount.
rbac:
create: true
# Default UpdateMode for autogenerated VPAs is `Off` (recommend only).
# Per docs/INVIOLABLE-PRINCIPLES.md #1: Catalyst doesn't auto-enable
# mutation of customer workloads — SREs opt in per-workload via a
# VerticalPodAutoscaler CR with `updatePolicy.updateMode: Auto`.
# (This setting is not consumed by the upstream chart directly — it
# documents intent. Per-workload VPA CRs override it.)
# ─── Catalyst overlay values (consumed by templates/ in this chart) ──────
# Reserved for Catalyst-side overlays (NetworkPolicy) added in a follow-up
# PR once bp-vpa is consumed in clusters/_template/.
vpaOverlay:
networkPolicy:
enabled: false
admissionWebhookPort: 8000
metricsPort: 8942
metricsServerPort: 4443