7b59292cad
5 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
25ef20a8e5
|
feat(catalyst-chart): land Blueprint CRD + fix 5 string-form depends (slice B4, #1095) (#1112)
Realizes the Blueprint CRD per docs/BLUEPRINT-AUTHORING.md §3 and design
doc §3.2.4. Promotes the doc-contract (apiVersion catalyst.openova.io)
from a YAML-loaded contract to a schema-validated CRD.
Schema design:
- Two versions served from one inline schema (YAML anchors): v1alpha1
(legacy, served, not storage) and v1 (canonical, served, storage). The
shared schema means the 38 existing v1alpha1 files in platform/ +
products/ continue to validate; migration to v1 is a follow-up slice.
- Required at this layer: spec.version (strict semver pattern),
spec.card.title (minLength=1).
- Card variants accommodated as documented: summary | description |
tagline interchangeable; category | family interchangeable; docs |
documentation interchangeable. All optional except title.
- visibility enum: listed | unlisted | private.
- placementSchema.modes enum: single-region | active-active | active-
hotstandby — same set Application.spec.placement validates against.
- depends[].blueprint pattern accepts both bp-* and bare-name (legacy).
- manifests accepts both manifests.chart (legacy short-form) AND
manifests.source.{kind,ref} (canonical). Three source kinds: HelmChart,
Kustomize, OAM.
- rotation[].ttl pattern '^[0-9]+(s|m|h|d)$'.
- x-kubernetes-preserve-unknown-fields liberally on configSchema (per-
Blueprint JSON Schema is arbitrary by design), card, manifests, owner,
observability, outputs, depends[].values, manifests.values, etc.
Existing files validation:
- Surveyed all blueprint.yaml in platform/ + products/ (59 files).
- Card field frequency: title (59), summary (38), description (20+1),
category (25), family (20), docs (20), documentation (14+1), icon (25),
tags (14), license (14).
- 54 of 59 files passed the schema unchanged.
- 5 files used `depends: [- bp-name]` (string form) instead of the
canonical `[- blueprint: bp-name]` object form per BLUEPRINT-AUTHORING
§3. Those 5 files are fixed in this commit:
* platform/cert-manager-powerdns-webhook/blueprint.yaml
* platform/cert-manager-dynadot-webhook/blueprint.yaml
* platform/crossplane-claims/blueprint.yaml
* platform/powerdns/blueprint.yaml
* platform/self-sovereign-cutover/blueprint.yaml
- After fix: ALL 59 files pass server-side validation (kubectl apply
--dry-run=server) against the new CRD.
Negative validation (tests/blueprint-sample-invalid.yaml):
- spec.version "1.3" → semver pattern
- spec.card missing → required
- spec.card.title missing → required
- spec.visibility "secret" → enum listed|unlisted|private
- spec.placementSchema.modes "round-robin" → enum
- spec.depends[0] bare string "bp-bad-string" → must be object
- spec.depends[1].blueprint "Foo" → pattern fails (uppercase)
- spec.rotation[0].ttl "5 days" → pattern '^[0-9]+(s|m|h|d)$'
All 8 seeded vectors rejected.
This commit ONLY touches new CRD + test files + the 5 depends fixes —
leaves the in-flight router.tsx + rootBeforeLoad.test.ts work from a
parallel agent and the .claude/worktrees/ directory untouched.
Refs: #1094, #1095, docs/EPICS-1-6-unified-design.md §3.2.4,
docs/BLUEPRINT-AUTHORING.md §3
Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
2e9cfd4a57
|
fix(bp-cert-manager-dynadot-webhook): pin SHA + add ghcr-pull imagePullSecret (#643)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry` (default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` — doubled prefix, image-not-found, ImagePullBackOff on every fresh Sovereign. Caught live during otech26. Fix: drop the redundant prefix. Subchart's default `.image.registry` keeps it pointing at registry.k8s.io which the new Sovereign's containerd routes through harbor.openova.io/v2/proxy-k8s/... via registries.yaml rewrite (#640). Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux + cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir + Loki + Tempo + … each request 50-500m vCPU and the node hits 100% allocatable before half the workloads schedule. CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size that fits the bootstrap-kit with VPA-recommendation headroom. Operators can still pick CPX32 explicitly if they trim the component set on StepComponents — but the default SOLO path now provisions a node that actually boots into a steady state. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2) - Replace forbidden `:latest` tag with current short-SHA `942be6f` per docs/INVIOLABLE-PRINCIPLES.md #4. - Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet authenticates against private ghcr.io/openova-io/openova/* via the Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace. Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff on every Sovereign — caught live during otech27. - Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
19c06c63bc
|
fix(bp-cert-manager-dynadot-webhook): dedupe template labels (Closes #561) (#564)
deployment.yaml pod template included both selectorLabels and labels named templates; since selectorLabels is a strict subset of labels, this produced duplicate app.kubernetes.io/name and app.kubernetes.io/instance keys in the rendered pod template metadata — triggering the HelmRelease validation error "spec.values.metadata.labels has duplicate key". Remove the redundant selectorLabels include from the pod template (selector.matchLabels still uses selectorLabels correctly). Bump chart 1.1.0 → 1.1.1. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
ccc38987c2
|
fix(tls): bp-cert-manager-dynadot-webhook slot 49b + DNS-01 JSON bug (Closes #550) (#558)
Root cause: bootstrap-kit installs bp-cert-manager-powerdns-webhook (slot 49)
but the letsencrypt-dns01-prod ClusterIssuer wires to the dynadot webhook
(groupName: acme.dynadot.openova.io). Without slot 49b the APIService for
acme.dynadot.openova.io does not exist → cert-manager gets "forbidden" on
every ChallengeRequest → sovereign-wildcard-tls stays in Issuing indefinitely
→ HTTPS gateway has no cert → SSL_ERROR_SYSCALL on the handover URL.
Changes:
- core/pkg/dynadot-client: fix SetDnsResponse JSON key (was SetDns2Response,
API returns SetDnsResponse); change ResponseCode to json.Number (API returns
integer 0, not string "0"); update tests to match real API response format
- platform/cert-manager-dynadot-webhook/chart:
- rbac.yaml: add domain-solver ClusterRole + ClusterRoleBinding so
cert-manager SA can CREATE on acme.dynadot.openova.io (the "forbidden" fix)
- values.yaml: add certManager.{namespace,serviceAccountName}, clusterIssuer.*
and privateKeySecretRefName; add rbac.create comment for domain-solver
- certificate.yaml: trunc 64 on commonName (was 76 bytes, cert-manager rejects >64)
- clusterissuer.yaml: new template (skip-render default, enabled via overlay)
- deployment.yaml: add imagePullSecrets support (required for private GHCR)
- Chart.yaml: bump to 1.1.0
- clusters/_template/bootstrap-kit:
- 49b-bp-cert-manager-dynadot-webhook.yaml: new slot (PRE-handover issuer)
- kustomization.yaml: add 49b entry
- infra/hetzner:
- variables.tf: add dynadot_managed_domains variable
- main.tf: pass dynadot_{key,secret,managed_domains} to cloud-init template
- cloudinit-control-plane.tftpl: write cert-manager/dynadot-api-credentials
Secret + apply it before Flux reconciles bootstrap-kit
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|
|
5502d9aa48
|
feat(dns): cert-manager-dynadot-webhook for DNS-01 wildcard TLS (closes #159) (#291)
Activates the previously-templated `letsencrypt-dns01-prod` ClusterIssuer
in bp-cert-manager by shipping the missing piece — a Go binary that
satisfies cert-manager's external webhook contract
(`webhook.acme.cert-manager.io/v1alpha1`) against the Dynadot api3.json.
Architecture
============
* `core/pkg/dynadot-client/` — canonical Dynadot HTTP client (shared with
pool-domain-manager and catalyst-dns). Encapsulates the api3.json
transport, command builders, response decoding, and the safe
read-modify-write semantics required to never accidentally wipe a
zone (memory: feedback_dynadot_dns.md). Destructive `set_dns2`
variant is unexported.
* `core/cmd/cert-manager-dynadot-webhook/` — the cert-manager webhook
binary. Implements `Solver.Present` via the client's append-only
`AddRecord` path and `Solver.CleanUp` via the read-modify-write
`RemoveSubRecord` path. Domain allowlist (`DYNADOT_MANAGED_DOMAINS`)
rejects challenges for unmanaged apexes BEFORE any Dynadot call.
* `platform/cert-manager-dynadot-webhook/` — Catalyst-authored Helm
wrapper. Templates Deployment + Service + APIService + serving
Certificate (CA chain via cert-manager Issuer self-signing) +
RBAC + ServiceAccount. Mirrors the standard cert-manager external-
webhook deployment shape.
* `platform/cert-manager/chart/` — flips `dns01.enabled: true` so the
paired ClusterIssuer activates. The interim http01 issuer remains
templated as the rollback path.
Test results
============
core/pkg/dynadot-client — 7 tests PASS (race-clean)
core/cmd/cert-manager-dynadot-... — 9 tests PASS (race-clean)
Test coverage includes a Present/CleanUp round-trip against an
httptest fixture that models Dynadot's zone state, an explicit
unmanaged-domain rejection, a regression preserving a pre-existing
CNAME across the DNS-01 round-trip (the zone-wipe defence), and a
typed-error propagation test that surfaces `ErrInvalidToken` to
cert-manager so the controller will retry.
Helm template smoke render
==========================
`helm template` against the new chart with default values yields 12
resources / 424 lines (APIService, Certificate, ClusterRoleBinding,
Deployment, Issuer, Role, RoleBinding, Service, ServiceAccount). The
modified bp-cert-manager chart still renders both ClusterIssuers
(`letsencrypt-dns01-prod` + `letsencrypt-http01-prod`) with default
values; flipping `certManager.issuers.dns01.enabled=false` is the
clean rollback.
Smoke command (post-deploy)
===========================
kubectl get apiservices.apiregistration.k8s.io \
v1alpha1.acme.dynadot.openova.io
# Issue a *.<sovereign>.<pool> wildcard cert and watch the
# Order/Challenge progress through cert-manager.
CI
==
`.github/workflows/build-cert-manager-dynadot-webhook.yaml` mirrors the
pool-domain-manager-build pattern (cosign keyless signing, SBOM
attestation, GHCR push at `ghcr.io/openova-io/openova/cert-manager-
dynadot-webhook:<sha>`). Triggered by changes to either the binary or
the shared dynadot-client package.
Closes #159
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|