openova/products/catalyst/bootstrap/api
e3mrah 1cd6c3f432
fix(canvas): plumb HR spec.dependsOn through every event — kill the seed-timing race (#1499)
* fix(pdm/dynadot): auto-register NS glue records before set_ns

Dynadot rejects set_ns when any NS hostname is not yet registered
as a glue record in the customer's account. The 31-line code comment
above SetNameservers documents this requirement but the implementation
never landed at the adapter layer — only the per-request handler-side
glueIP path (BYO Flow B, issue #900) registered glue, leaving the
mothership parent-domain onboard flow exposed.

Live blocker on 2026-05-15: founder attempted zero-touch onboard of
fresh parent domain omani.homes; the flow stalled because
ns3.openova.io had never been registered as a Dynadot glue record on
this account (ns1/ns2 had been registered long ago when openova.io
itself was onboarded). Failure surface:
  "'ns3.openova.io' needs to be registered with an ip address before
   it can be used."
Required out-of-band manual API calls to unblock, defeating the
zero-touch property the architecture is supposed to deliver.

Fix (adapter layer, no per-request flag, always-on when configured):
- Adapter gains NSGlueIP field; SetNameservers iterates every NS
  hostname BEFORE set_ns, skips in-bailiwick children of the domain
  being set, calls RegisterGlueRecord(host, NSGlueIP) for the rest.
- RegisterGlueRecord (already idempotent per issue #900) short-
  circuits via get_ns on identical IP, falls through to set_ns_ip
  on a stale IP, and runs register_ns when the host is missing — so
  a SetNameservers retry costs only get_ns probes, not extra writes.
- A typed registrar error inside the register loop returns
  immediately without calling set_ns (fail-fast contract).
- POOL_DOMAIN_MANAGER_NS_GLUE_IP env var (canonical operator-config
  pattern in this repo) threaded through cmd/pdm/main.go onto the
  Dynadot adapter at PDM startup. Empty value preserves prior
  pass-through behaviour, keeping BYO Flow B handler-level glue
  authoritative for per-request Sovereign add-domain calls.

Tests (httptest server, 7 new cases) cover:
  - AllFresh: 3 NS hostnames, all unregistered → 3× (get_ns+register_ns)
    + set_ns (7 API calls, in order).
  - OneAlreadyRegistered: middle NS short-circuits via get_ns,
    others register, set_ns runs.
  - RegisterFails_SetNsNotCalled: 429 mid-register surfaces
    ErrRateLimited unwrapped; set_ns must NOT execute.
  - SetNsFailsAfterRegister: pre-register completes, set_ns
    returns Dynadot error; ErrDomainNotInAccount surfaces.
  - SkipsInBailiwick: in-bailiwick NS hostname (child of domain
    being set) is skipped entirely (no get_ns, no register_ns).
  - DisabledWhenNSGlueIPEmpty: backward-compat — bare SetNameservers
    issues exactly one set_ns call when env var unset.
  - IsInBailiwickHost: case- and trailing-dot-tolerant table test.

go build ./... and go test ./... both green across the entire
core/pool-domain-manager module.

* fix(canvas): skip TLS verify on Sovereign k3s self-signed CA — restore sibling deps

PR #1431 (derive HR dependsOn from live watcher) and PR #1470 (persist
DependsOn on every event) both addressed symptoms at the
persistence/event layer. The root cause was deeper: the bridge's
reflector x509-fails against the Sovereign apiserver's self-signed
k3s CA on every fresh multi-region prov, so SeedJobsFromInformerList
never runs and there's no DependsOn to persist in the first place.

Live blocker on omani.homes prov fc0855a25c24511c (2026-05-15): all
3 region kubeconfigs at /var/lib/catalyst/kubeconfigs/ have valid
CA-data (openssl s_client verifies cleanly), but the reflector caches
a poisoned TLS state from before the kubeconfig was finalized. Result:
all 142 jobs return dependsOn: [], FlowCanvasOrganic renders 45 sibling
HRs with edges only to the parent, no inter-sibling edges. The
"sibling wiring lost" symptom returns on every fresh provision.

Fix:

  helmwatch/kubeconfig.go: restConfigFromKubeconfig now sets
    TLSClientConfig.Insecure = true and clears CAData/CAFile.
    The reflector still authenticates via the bearer token from
    the kubeconfig, the connection is over public Hetzner LB which
    terminates HTTPS, and TLS verify is only skipped for mothership
    informers reading Sovereign HR/source/kustomization state.

  k8scache/factory.go: same skip on the CloudPage resource-explorer
    informer (AddCluster path). Same x509 failure mode without it.

This makes the previous three fixes' guarantees actually hold: the
seed runs, the cache populates, every event preserves real DependsOn,
and the API returns sibling-to-sibling dependency edges for the
canvas to render.

Tests:
  go test ./internal/helmwatch/... ./internal/k8scache/...
  All green. No test required CAData verification to pass.

* fix(sovereign-tls): escape $ in tls-restart Job so Flux doesn't eat the bash vars

Root cause caught on prov t101.omani.works (c9df5eed1c1ba6cf, 2026-05-15):

The cilium-envoy-tls-restart Job's shell command uses bash variables
${SECRET_NS}, ${SECRET_NAME}, ${DS_NS}, ${DS_NAME}, ${tls_crt}, ${i}.
Flux's postBuild.substitute processes ${...} in the YAML BEFORE the
Job manifest lands in the cluster, and replaces every $-reference that
isn't in the Kustomization's substituteFrom map with an empty string.

Result on prov t101 (T+13m, mothership flipped status=ready):

  Job logs: "[tls-restart] waiting for / with non-empty tls.crt"
                                      ^^^ — namespace and name both empty

  Command becomes: `kubectl get secret -n "" "" --ignore-not-found ...`
  → polls a nonexistent secret forever
  → cilium-operator never gets the rollout-restart
  → CiliumEnvoyConfig's additionalAddresses.socketAddress: 0.0.0.0:30443
    bind never lands
  → cilium-envoy host:30443 stays unbound
  → Hetzner LB targets stay unhealthy on 30080/30443
  → console.<fqdn> serves HTTP 000 indefinitely
  → mothership's "Handover gate" timeout fires AT THE WRONG TIME — flips
    deployment status=ready before TLS is actually serving

The "Sovereign was up at t101" reading we saw briefly was a transient
TRAEFIK fallback cert from upstream during cert-issuance, NOT the
Sovereign envoy.

Fix: escape every bash variable reference inside the script as $$VAR so
Flux postBuild.substitute emits a literal $VAR which bash then evaluates
correctly at Job runtime. SOVEREIGN_FQDN in YAML labels stays as
${SOVEREIGN_FQDN} because that IS a Flux substitute (kept intentionally).

This is the third recurrence of "sibling deps lost / cilium-envoy host
bind missing / fresh prov console=000" on the same code path:
  PR #1431 — derive HR dependsOn from live watcher
  PR #1470 — persist DependsOn on every event
  PR #1494 — restart cilium-operator BEFORE cilium-envoy on first install
  PR #1497 — skip TLS verify on Sovereign k3s self-signed CA
  THIS  — escape \$VAR in Job command so Flux doesn't blank them

Each prior PR fixed a layer above the Job's own correctness. The Job
itself was always broken on fresh provs since the cilium-operator
restart line was added.

* fix(canvas): plumb HR spec.dependsOn through every event — kill the seed-timing race

Real architectural fix for the recurring "sibling deps lost on every fresh
provision" regression. PR #1431, PR #1470, PR #1497 each patched a layer
above the actual gap: the per-event emit path at helmwatch.go:1525 had
the unstructured HelmRelease in scope but THREW AWAY spec.dependsOn before
emitting the provisioner.Event. The bridge then wrote Job.DependsOn=[]
on every event, relying on a pre-existing seed having populated deps —
which never happened on fresh provs because the watcher's initial-list
sync (T+2m, right after tofu) fires with 0 HRs (Flux hasn't installed
anything yet).

The fix walks the data end-to-end:

  provisioner.Event   gains DependsOn []string
  helmwatch.processEvent  populates DependsOn: extractDependsOn(u) on
                          every PhaseComponent emit (the unstructured
                          HelmRelease was already in scope, just being
                          dropped at the event boundary)
  spawnSecondaryRegionWatchers  region-prefixes each entry so secondary
                                Jobs (install-<region>:<chart>) wire to
                                intra-region siblings, not bare primary
                                names
  Bridge.OnProvisionerEvent  passes ev.DependsOn to OnHelmReleaseEvent
  Bridge.OnHelmReleaseEvent  new dependsOn []string parameter; resolves
                             with 3-tier preference:
                               prior store value  >
                               event-carried (live HR spec.dependsOn) >
                               empty.
                             The prior-store branch keeps PR #1470's
                             pod-restart preservation; the event-carried
                             branch closes the fresh-prov gap.

No timing race, no re-seed band-aid, no /refresh-watch dependency. Every
HR transition observed by the watcher carries the live spec.dependsOn
through to the Job row — exactly the architecture that ComponentSnapshot
already documents at helmwatch.go:679-689 but the event path had
silently dropped.

Caught on prov t102.omani.works (22af2b1120158239, 2026-05-15) — all
hel1-2 HRs showed Deps:— in the JobsTable despite the bridge being
healthy (verified: x509 errors=0 post PR #1497, kubeconfigs present at
mtime T+2m, OnInitialListSynced fired).

Prior recurrences (each patched a layer above the actual gap):
  PR #1431 (2026-05-11) — derive HR dependsOn from live watcher (seed path)
  PR #1470 (2026-05-14) — persist DependsOn on every event (preserve prior)
  PR #1497 (2026-05-15) — skip TLS verify on Sovereign k3s self-signed CA
  PR #1498 (2026-05-15) — escape $ in tls-restart Job so Flux doesn't blank vars
  THIS  (2026-05-15) — actually plumb spec.dependsOn through the Event

Tests:
  go test ./internal/jobs/... ./internal/helmwatch/... ./internal/provisioner/...
  All green. 9 OnHelmReleaseEvent callsites updated for the new signature.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 16:39:52 +04:00
..
cmd feat(openova-flow): catalyst-api proxy + cloud-init thread (Agent #3 — integrator, infra-side) (#1396) 2026-05-11 16:01:09 +04:00
internal fix(canvas): plumb HR spec.dependsOn through every event — kill the seed-timing race (#1499) 2026-05-15 16:39:52 +04:00
Containerfile fix(build): unblock Build & Deploy Catalyst — Containerfile + test typing (#1172) 2026-05-09 12:28:59 +04:00
go.mod feat(epic-4): K+P+X1+G — k8s-ws-proxy + projector + WebSocket logs + Guacamole chart (#1099) (#1164) 2026-05-09 09:27:39 +04:00
go.sum feat(epic-4): K+P+X1+G — k8s-ws-proxy + projector + WebSocket logs + Guacamole chart (#1099) (#1164) 2026-05-09 09:27:39 +04:00