openova

Author	SHA1	Message	Date
e3mrah	25ef20a8e5	feat(catalyst-chart): land Blueprint CRD + fix 5 string-form depends (slice B4, #1095 ) (#1112 ) Realizes the Blueprint CRD per docs/BLUEPRINT-AUTHORING.md §3 and design doc §3.2.4. Promotes the doc-contract (apiVersion catalyst.openova.io) from a YAML-loaded contract to a schema-validated CRD. Schema design: - Two versions served from one inline schema (YAML anchors): v1alpha1 (legacy, served, not storage) and v1 (canonical, served, storage). The shared schema means the 38 existing v1alpha1 files in platform/ + products/ continue to validate; migration to v1 is a follow-up slice. - Required at this layer: spec.version (strict semver pattern), spec.card.title (minLength=1). - Card variants accommodated as documented: summary \| description \| tagline interchangeable; category \| family interchangeable; docs \| documentation interchangeable. All optional except title. - visibility enum: listed \| unlisted \| private. - placementSchema.modes enum: single-region \| active-active \| active- hotstandby — same set Application.spec.placement validates against. - depends[].blueprint pattern accepts both bp-* and bare-name (legacy). - manifests accepts both manifests.chart (legacy short-form) AND manifests.source.{kind,ref} (canonical). Three source kinds: HelmChart, Kustomize, OAM. - rotation[].ttl pattern '^[0-9]+(s\|m\|h\|d)$'. - x-kubernetes-preserve-unknown-fields liberally on configSchema (per- Blueprint JSON Schema is arbitrary by design), card, manifests, owner, observability, outputs, depends[].values, manifests.values, etc. Existing files validation: - Surveyed all blueprint.yaml in platform/ + products/ (59 files). - Card field frequency: title (59), summary (38), description (20+1), category (25), family (20), docs (20), documentation (14+1), icon (25), tags (14), license (14). - 54 of 59 files passed the schema unchanged. - 5 files used `depends: [- bp-name]` (string form) instead of the canonical `[- blueprint: bp-name]` object form per BLUEPRINT-AUTHORING §3. Those 5 files are fixed in this commit: * platform/cert-manager-powerdns-webhook/blueprint.yaml * platform/cert-manager-dynadot-webhook/blueprint.yaml * platform/crossplane-claims/blueprint.yaml * platform/powerdns/blueprint.yaml * platform/self-sovereign-cutover/blueprint.yaml - After fix: ALL 59 files pass server-side validation (kubectl apply --dry-run=server) against the new CRD. Negative validation (tests/blueprint-sample-invalid.yaml): - spec.version "1.3" → semver pattern - spec.card missing → required - spec.card.title missing → required - spec.visibility "secret" → enum listed\|unlisted\|private - spec.placementSchema.modes "round-robin" → enum - spec.depends[0] bare string "bp-bad-string" → must be object - spec.depends[1].blueprint "Foo" → pattern fails (uppercase) - spec.rotation[0].ttl "5 days" → pattern '^[0-9]+(s\|m\|h\|d)$' All 8 seeded vectors rejected. This commit ONLY touches new CRD + test files + the 5 depends fixes — leaves the in-flight router.tsx + rootBeforeLoad.test.ts work from a parallel agent and the .claude/worktrees/ directory untouched. Refs: #1094, #1095, docs/EPICS-1-6-unified-design.md §3.2.4, docs/BLUEPRINT-AUTHORING.md §3 Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 22:25:08 +04:00
e3mrah	2e9cfd4a57	fix(bp-cert-manager-dynadot-webhook): pin SHA + add ghcr-pull imagePullSecret (#643 ) * fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry` (default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` — doubled prefix, image-not-found, ImagePullBackOff on every fresh Sovereign. Caught live during otech26. Fix: drop the redundant prefix. Subchart's default `.image.registry` keeps it pointing at registry.k8s.io which the new Sovereign's containerd routes through harbor.openova.io/v2/proxy-k8s/... via registries.yaml rewrite (#640). Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux + cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir + Loki + Tempo + … each request 50-500m vCPU and the node hits 100% allocatable before half the workloads schedule. CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size that fits the bootstrap-kit with VPA-recommendation headroom. Operators can still pick CPX32 explicitly if they trim the component set on StepComponents — but the default SOLO path now provisions a node that actually boots into a steady state. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2) - Replace forbidden `:latest` tag with current short-SHA `942be6f` per docs/INVIOLABLE-PRINCIPLES.md #4. - Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet authenticates against private ghcr.io/openova-io/openova/* via the Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace. Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff on every Sovereign — caught live during otech27. - Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-02 23:52:42 +04:00
e3mrah	19c06c63bc	fix(bp-cert-manager-dynadot-webhook): dedupe template labels (Closes #561 ) (#564 ) deployment.yaml pod template included both selectorLabels and labels named templates; since selectorLabels is a strict subset of labels, this produced duplicate app.kubernetes.io/name and app.kubernetes.io/instance keys in the rendered pod template metadata — triggering the HelmRelease validation error "spec.values.metadata.labels has duplicate key". Remove the redundant selectorLabels include from the pod template (selector.matchLabels still uses selectorLabels correctly). Bump chart 1.1.0 → 1.1.1. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:50:11 +04:00
e3mrah	ccc38987c2	fix(tls): bp-cert-manager-dynadot-webhook slot 49b + DNS-01 JSON bug (Closes #550 ) (#558 ) Root cause: bootstrap-kit installs bp-cert-manager-powerdns-webhook (slot 49) but the letsencrypt-dns01-prod ClusterIssuer wires to the dynadot webhook (groupName: acme.dynadot.openova.io). Without slot 49b the APIService for acme.dynadot.openova.io does not exist → cert-manager gets "forbidden" on every ChallengeRequest → sovereign-wildcard-tls stays in Issuing indefinitely → HTTPS gateway has no cert → SSL_ERROR_SYSCALL on the handover URL. Changes: - core/pkg/dynadot-client: fix SetDnsResponse JSON key (was SetDns2Response, API returns SetDnsResponse); change ResponseCode to json.Number (API returns integer 0, not string "0"); update tests to match real API response format - platform/cert-manager-dynadot-webhook/chart: - rbac.yaml: add domain-solver ClusterRole + ClusterRoleBinding so cert-manager SA can CREATE on acme.dynadot.openova.io (the "forbidden" fix) - values.yaml: add certManager.{namespace,serviceAccountName}, clusterIssuer.* and privateKeySecretRefName; add rbac.create comment for domain-solver - certificate.yaml: trunc 64 on commonName (was 76 bytes, cert-manager rejects >64) - clusterissuer.yaml: new template (skip-render default, enabled via overlay) - deployment.yaml: add imagePullSecrets support (required for private GHCR) - Chart.yaml: bump to 1.1.0 - clusters/_template/bootstrap-kit: - 49b-bp-cert-manager-dynadot-webhook.yaml: new slot (PRE-handover issuer) - kustomization.yaml: add 49b entry - infra/hetzner: - variables.tf: add dynadot_managed_domains variable - main.tf: pass dynadot_{key,secret,managed_domains} to cloud-init template - cloudinit-control-plane.tftpl: write cert-manager/dynadot-api-credentials Secret + apply it before Flux reconciles bootstrap-kit Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:42:13 +04:00
e3mrah	5502d9aa48	feat(dns): cert-manager-dynadot-webhook for DNS-01 wildcard TLS (closes #159 ) (#291 ) Activates the previously-templated `letsencrypt-dns01-prod` ClusterIssuer in bp-cert-manager by shipping the missing piece — a Go binary that satisfies cert-manager's external webhook contract (`webhook.acme.cert-manager.io/v1alpha1`) against the Dynadot api3.json. Architecture ============ * `core/pkg/dynadot-client/` — canonical Dynadot HTTP client (shared with pool-domain-manager and catalyst-dns). Encapsulates the api3.json transport, command builders, response decoding, and the safe read-modify-write semantics required to never accidentally wipe a zone (memory: feedback_dynadot_dns.md). Destructive `set_dns2` variant is unexported. * `core/cmd/cert-manager-dynadot-webhook/` — the cert-manager webhook binary. Implements `Solver.Present` via the client's append-only `AddRecord` path and `Solver.CleanUp` via the read-modify-write `RemoveSubRecord` path. Domain allowlist (`DYNADOT_MANAGED_DOMAINS`) rejects challenges for unmanaged apexes BEFORE any Dynadot call. * `platform/cert-manager-dynadot-webhook/` — Catalyst-authored Helm wrapper. Templates Deployment + Service + APIService + serving Certificate (CA chain via cert-manager Issuer self-signing) + RBAC + ServiceAccount. Mirrors the standard cert-manager external- webhook deployment shape. * `platform/cert-manager/chart/` — flips `dns01.enabled: true` so the paired ClusterIssuer activates. The interim http01 issuer remains templated as the rollback path. Test results ============ core/pkg/dynadot-client — 7 tests PASS (race-clean) core/cmd/cert-manager-dynadot-... — 9 tests PASS (race-clean) Test coverage includes a Present/CleanUp round-trip against an httptest fixture that models Dynadot's zone state, an explicit unmanaged-domain rejection, a regression preserving a pre-existing CNAME across the DNS-01 round-trip (the zone-wipe defence), and a typed-error propagation test that surfaces `ErrInvalidToken` to cert-manager so the controller will retry. Helm template smoke render ========================== `helm template` against the new chart with default values yields 12 resources / 424 lines (APIService, Certificate, ClusterRoleBinding, Deployment, Issuer, Role, RoleBinding, Service, ServiceAccount). The modified bp-cert-manager chart still renders both ClusterIssuers (`letsencrypt-dns01-prod` + `letsencrypt-http01-prod`) with default values; flipping `certManager.issuers.dns01.enabled=false` is the clean rollback. Smoke command (post-deploy) =========================== kubectl get apiservices.apiregistration.k8s.io \ v1alpha1.acme.dynadot.openova.io # Issue a *.<sovereign>.<pool> wildcard cert and watch the # Order/Challenge progress through cert-manager. CI == `.github/workflows/build-cert-manager-dynadot-webhook.yaml` mirrors the pool-domain-manager-build pattern (cosign keyless signing, SBOM attestation, GHCR push at `ghcr.io/openova-io/openova/cert-manager- dynadot-webhook:<sha>`). Triggered by changes to either the binary or the shared dynadot-client package. Closes #159 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:37:47 +04:00

5 Commits