Two related fixes for multi-region + qa-fixtures DoD on prov #64:
1. **k3s TLS cert needs the public IPv4 in SAN.**
Mothership helmwatch.Bridge connects to secondary CPs via PUBLIC IP
(cloud-init rewrites kubeconfig 127.0.0.1 → CP_PUBLIC_IPV4). k3s
auto-generates the server cert with SANs from --tls-san flags. We
only had [sovereign_fqdn, cp_private_ip] → cert valid for 10.0.10.2
+ cluster-ip + 127.0.0.1 only. Bridge connection from contabo
rejected with:
"x509: certificate is valid for 10.0.10.2, 10.43.0.1, 127.0.0.1,
::1, not 204.168.212.113"
→ silent watcher failure → 0 secondary HRs observed → canvas missing
region sub-groups.
Fix: pre-fetch the CP's public IPv4 from Hetzner metadata before
k3s install, add it as --tls-san=$CP_PUBLIC_IPV4.
2. **openova.io/region=hz-fsn-rtz-prod node label.**
qa-fixtures Pods (CNPGPair primary/replica, status seeder Jobs,
qa-wp Application) carry hard nodeAffinity for
`openova.io/region in [hz-fsn-rtz-prod]` (per qaFixtures.primaryRegion
default in products/catalyst/chart/templates/qa-fixtures/*.yaml).
Without the label every fixture pod FailedScheduling → bp-catalyst-
platform post-install hook waits forever → bootstrap-kit chain hangs
at 44/45 with bp-catalyst-platform Running.
Fix: --node-label openova.io/region=hz-fsn-rtz-prod on primary CP
(qa-fixtures pin to primary by design).
Both shipped in same commit since both are inside the same k3s server
install line.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prov #63 (cpx52 × 3, all PRs live): bp-catalyst-platform install hook
timed out because the catalyst-api Helm-released pod stayed Pending
with "Too many pods. 0/1 nodes are available".
k3s kubelet default max-pods is 110. Full bootstrap-kit (~45 HR-managed
deployments, each with 1-3 pods) + qa-fixtures stack (qa-omantel ns
Application + Continuum + CNPGPair + PDM CRs + seeder Jobs) + Cilium/
flux/cnpg sidecars saturate the slot cleanly. With workers NotReady on
prov #63 the CP carried everything alone and dropped scheduling at 110.
Bump to 220 on both CP and worker so the saturation point doesn't gate
the bootstrap chain. Safe ceiling: each Hetzner cpx52 node has 16 vCPU
+ 32GB RAM, plenty of headroom for 220 pods of typical bootstrap-kit
weight.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prov #62 (cpx52, kernel 6.8.0-111): primary CP cilium init CrashLoop
with "dial tcp 10.0.1.2:6443: i/o timeout". k3s server auto-detects
its node IP from the primary interface, which on Hetzner cpx52 binds
to the public IPv4 (49.x.x.x) instead of the private network IP
(10.0.1.2). kube-apiserver advertises 49.x.x.x and binds there;
nothing answers on 10.0.1.2:6443. Cilium agent's k8s-client wants the
private IP from cilium-config k8sServiceHost — times out, CrashLoop.
Worked by luck on cpx42 (earlier kernel + Hetzner network attach
timing). cpx52 reproduces 100%.
Fix: pass --node-ip=${cp_private_ip} + --advertise-address=${cp_private_ip}
in INSTALL_K3S_EXEC. k3s then binds kube-apiserver on the private IP
AND advertises it as the node's INTERNAL-IP. Pods reaching ${cp_private_ip}:6443
(cilium-config substitute) find the API server every time.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prov #61 (2e197a934a0e0461, 2026-05-13): catalyst-api OOMKilled 6× during
phase-1 watch on a 3-region Sovereign. The in-memory state has grown
substantially since the 1Gi limit was set:
- 1 primary helmwatch.Watcher (45 HRs + informer cache)
- N secondary helmwatch.Watchers (45 HRs × 2 secondary regions, each
with its own informer cache)
- jobs.Store backed by on-disk + in-memory tree
- per-/snapshot poll: composes per-region region groups across all
Job rows + cross-references hrDeps from the live primary watcher
Combined steady-state exceeds 1Gi on cpx32-equivalent loads. Bumped
limits to 4Gi (request 512Mi up from 128Mi). The mothership node has
8GB+ resident, no other tight constraint. Future fix: persist region
in Job rows so secondary watchers don't need to be retained post
phase-1 (orthogonal cleanup).
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(JobsTable): strip <deploymentId>: prefix from row link (404 fix)
Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a
running secondary-region install-* row on /sovereign/provision/<id>/jobs
landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover
and returned "404 page not found".
Root cause: useJobLinkBuilder was passing the FULL canvas JobID form
through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping
the "<deploymentId>:" prefix. The canvas emits ids like
"<deploymentId>:install-X" (single-region) or
"<deploymentId>:<region>:install-X" (multi-region, see
flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName —
exact-match URL lookup of the prefix-bearing form misses every time.
FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the
first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs
row click and a canvas drill-down resolve to the SAME backend endpoint.
The existing JobsTable row-link test uses a job.id with no `:` prefix,
so the strip is a no-op for that fixture and the `/jobs/job-install-cilium`
assertion still holds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(flow_snapshot_local): derive region from persisted JobName, synth region groups
Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): the multi-region
canvas at /sovereign/provision/<id>/jobs/tofu-output renders 135 install-*
leaves as direct children of bootstrap-kit (no region sub-groups visible),
and the provision-hetzner→bootstrap-kit edge fans M×N across all 135.
Root cause: spawnSecondaryRegionWatchers (phase1_watch.go:429) emits
events with `ev.Component = region + "/" + componentName`. The jobs
bridge persists them with `JobName=install-<region>/<chart>` and
`AppID=<region>/<chart>`, BUT ParentID=bootstrap-kit (the bridge has no
region awareness). After phase 1 terminates the deferred stopSecondaries()
clears `dep.secondaryWatchers`, so the multi-region snapshot block
(line 408-460, gated on `len(secondaryWatchers) > 0`) becomes a no-op.
flowSnapshotFromJobs then emits all 135 install Jobs flat under
bootstrap-kit, no Region field set, no region group bubbles, and
flowLayoutOrganic.ts's temporal-endpoint cascade fans the
provisioner→bootstrap-kit edge onto all 135 because there's no
intermediate region group to absorb it.
Fix: in the per-Job loop, detect `/` in `j.AppID` (the canonical
multi-region prefix marker), derive the region key, set
FlowNode.Region, and re-parent to a synthesised
"<deploymentId>:<region>:bootstrap-kit" group. After the loop,
synthesise one bootstrap-kit sub-group node per discovered region
with a `contains` edge to the parent bootstrap-kit. The resulting
shape:
bootstrap-kit
├── 45 primary install-* (legacy parent, no region)
├── <region-A>:bootstrap-kit ── 45 install-* (region tagged)
└── <region-B>:bootstrap-kit ── 45 install-* (region tagged)
This persists ACROSS phase-1 termination because the source of truth
is jobs.Store (durable), not dep.secondaryWatchers (transient).
The multi-region block (line 408+) still runs WHEN secondary watchers
are alive (during phase 1) — it emits ADDITIONAL FlowNodes with
"<deploymentId>:<region>:install-X" IDs distinct from the persisted
"<deploymentId>:install-<region>/<chart>" IDs, so the two paths don't
collide. Post-phase-1 the watchers clear and only the persisted-Job
path remains, but now WITH region structure preserved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(flow_snapshot): remove duplicate live-watcher multi-region block
PR #1454 added region-group synthesis from persisted Job rows. The old
secondaryWatchers-based block at line 442+ emitted nodes with the SAME
region-group IDs AND child nodes, so during phase 1 (when both paths
are live) the snapshot rendered with 90 children per region group
instead of 45 — visible on prov #61 (2e197a934a0e0461):
bootstrap-kit: 49 children
hel1-2:bootstrap-kit: 90 children (should be 45)
nbg1-1:bootstrap-kit: 90 children (should be 45)
Plus the region groups appeared twice in the node list.
Root cause: the per-Job loop (PR #1454) and the legacy block both write
to the same region-group IDs without deduping. The per-Job path covers
the persisted-Job state (durable across phase-1 termination), so the
live-watcher path is redundant.
Fix: delete the legacy block. The earlier
secondaryWatchers-snapshot-into-map work (lines 182-205) is kept
because that path also reads dep.liveWatcher (primary) for the hrDeps
lookup the per-Job loop uses for primary-region dep edges.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(JobsTable): strip <deploymentId>: prefix from row link (404 fix)
Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a
running secondary-region install-* row on /sovereign/provision/<id>/jobs
landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover
and returned "404 page not found".
Root cause: useJobLinkBuilder was passing the FULL canvas JobID form
through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping
the "<deploymentId>:" prefix. The canvas emits ids like
"<deploymentId>:install-X" (single-region) or
"<deploymentId>:<region>:install-X" (multi-region, see
flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName —
exact-match URL lookup of the prefix-bearing form misses every time.
FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the
first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs
row click and a canvas drill-down resolve to the SAME backend endpoint.
The existing JobsTable row-link test uses a job.id with no `:` prefix,
so the strip is a no-op for that fixture and the `/jobs/job-install-cilium`
assertion still holds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(flow_snapshot_local): derive region from persisted JobName, synth region groups
Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): the multi-region
canvas at /sovereign/provision/<id>/jobs/tofu-output renders 135 install-*
leaves as direct children of bootstrap-kit (no region sub-groups visible),
and the provision-hetzner→bootstrap-kit edge fans M×N across all 135.
Root cause: spawnSecondaryRegionWatchers (phase1_watch.go:429) emits
events with `ev.Component = region + "/" + componentName`. The jobs
bridge persists them with `JobName=install-<region>/<chart>` and
`AppID=<region>/<chart>`, BUT ParentID=bootstrap-kit (the bridge has no
region awareness). After phase 1 terminates the deferred stopSecondaries()
clears `dep.secondaryWatchers`, so the multi-region snapshot block
(line 408-460, gated on `len(secondaryWatchers) > 0`) becomes a no-op.
flowSnapshotFromJobs then emits all 135 install Jobs flat under
bootstrap-kit, no Region field set, no region group bubbles, and
flowLayoutOrganic.ts's temporal-endpoint cascade fans the
provisioner→bootstrap-kit edge onto all 135 because there's no
intermediate region group to absorb it.
Fix: in the per-Job loop, detect `/` in `j.AppID` (the canonical
multi-region prefix marker), derive the region key, set
FlowNode.Region, and re-parent to a synthesised
"<deploymentId>:<region>:bootstrap-kit" group. After the loop,
synthesise one bootstrap-kit sub-group node per discovered region
with a `contains` edge to the parent bootstrap-kit. The resulting
shape:
bootstrap-kit
├── 45 primary install-* (legacy parent, no region)
├── <region-A>:bootstrap-kit ── 45 install-* (region tagged)
└── <region-B>:bootstrap-kit ── 45 install-* (region tagged)
This persists ACROSS phase-1 termination because the source of truth
is jobs.Store (durable), not dep.secondaryWatchers (transient).
The multi-region block (line 408+) still runs WHEN secondary watchers
are alive (during phase 1) — it emits ADDITIONAL FlowNodes with
"<deploymentId>:<region>:install-X" IDs distinct from the persisted
"<deploymentId>:install-<region>/<chart>" IDs, so the two paths don't
collide. Post-phase-1 the watchers clear and only the persisted-Job
path remains, but now WITH region structure preserved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a
running secondary-region install-* row on /sovereign/provision/<id>/jobs
landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover
and returned "404 page not found".
Root cause: useJobLinkBuilder was passing the FULL canvas JobID form
through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping
the "<deploymentId>:" prefix. The canvas emits ids like
"<deploymentId>:install-X" (single-region) or
"<deploymentId>:<region>:install-X" (multi-region, see
flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName —
exact-match URL lookup of the prefix-bearing form misses every time.
FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the
first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs
row click and a canvas drill-down resolve to the SAME backend endpoint.
The existing JobsTable row-link test uses a job.id with no `:` prefix,
so the strip is a no-op for that fixture and the `/jobs/job-install-cilium`
assertion still holds.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(infra): pass cp_private_ip to primary CP templatefile too
PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".
Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile
prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl
started consuming ${cp_private_ip} (PR #1446):
Invalid value for "vars" parameter: vars map does not contain key
"cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43.
The primary CP templatefile call (main.tf:342) and the secondary WORKER
templatefile call (main.tf:944) both pass `cp_private_ip`, but the
secondary CP templatefile call (main.tf:860) was missed — every
multi-region provision since PR #1446 lands here at plan-time.
Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the
secondary CP templatefile so each secondary region's cilium-operator
reaches its OWN local CP (matching CA), not the primary across regions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort)
Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops:
fatal: failed to start: daemon creation failed: unable to initialize
BPF masquerade support: BPF masquerade requires NodePort
(--enable-node-port="true")
Chart default kubeProxyReplacement=false leaves enable-node-port=false in
the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true
(also default-on) the cilium-agent rejects the BPF masquerade datapath
on startup. CP cilium-agent survives because it was started by cloudinit
with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the
ConfigMap. Every WORKER node that joins after Flux's upgrade sees the
new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not-
ready taint persists → every post-install Job pod (keycloak-config-cli,
powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain
stalls at ~60% Ready.
Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl
write_files entry /var/lib/catalyst/cilium-values.yaml) already uses
kubeProxyReplacement: true. This change aligns the Flux HR overlay with
the working pre-Flux bootstrap so the agent config never regresses when
helm-controller does its first upgrade.
Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race
prov #55+#56 caught bp-harbor / bp-powerdns failing Helm install with:
Internal error occurred: failed calling webhook "mcluster.cnpg.io":
no endpoints available for service "cnpg-webhook-service"
Chain:
1. bp-cnpg install with disableWait: true → HR goes Ready immediately
when manifests apply (operator pod still spinning up).
2. Flux releases dependents (bp-harbor, bp-powerdns) — they pass the
dependsOn check on bp-cnpg.
3. Downstream chart renders postgresql.cnpg.io/v1.Cluster CRs.
4. cnpg mutating webhook (Service cnpg-webhook-service) has no endpoints
yet → admission webhook call fails → Helm install fails →
RetriesExceeded → entire DB-backed chain wedges.
Carve out the disableWait: true blanket for bp-cnpg specifically.
INVIOLABLE-PRINCIPLES #3's "event-driven install" rationale (avoid the
agent-waits-for-its-own-CRDs deadlock — see bp-cilium) does NOT apply
to CNPG: CNPG's CRDs are loaded by helm-controller BEFORE pods schedule,
so Helm-wait blocks only on pod readiness, not on a self-referencing CRD.
With this change bp-cnpg's HR stays Reconciling until cnpg-controller-
manager + cnpg-webhook-service are both rolled + Available, so Flux
dependsOn correctly gates downstream consumers behind a webhook that's
actually serving.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-guacamole): render.sh expects 19 resources (Fix#125 bootstrap Job)
Fix#125's guacamole-oidc bootstrap Job added 4 K8s resources to the
chart's full-ON render (1 Job + 1 ServiceAccount + 1 Role + 1 RoleBinding)
but render.sh's expect_total was never bumped from 15 → 19. Every
Blueprint Release run since 5b711427 fails the test and bails before
publishing the chart to GHCR.
Consequence: Build bp-guacamole's mirror job successfully mirrors
upstream images + bumps Chart.yaml to 0.1.13/0.1.14/.../0.1.18/0.1.19,
but the chained Blueprint Release on each bump commit fails render.sh
and never publishes. GHCR is stuck at 0.1.12. Bootstrap-kit overlay
HRs pinned to anything beyond 0.1.12 wedge with:
failed to download chart for remote reference: failed to get
'oci://ghcr.io/openova-io/bp-guacamole:0.1.17': not found
Caught on prov #58 (d4f60afe4f13aee9, 2026-05-12) when bp-guacamole
HR went False with that exact error across all 3 regions.
Also bump bootstrap-kit overlay version pin 0.1.17 → 0.1.19 so the
catch-up Blueprint Release (triggered by this commit) lands a tag the
overlay actually references.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(infra): pass cp_private_ip to primary CP templatefile too
PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".
Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile
prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl
started consuming ${cp_private_ip} (PR #1446):
Invalid value for "vars" parameter: vars map does not contain key
"cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43.
The primary CP templatefile call (main.tf:342) and the secondary WORKER
templatefile call (main.tf:944) both pass `cp_private_ip`, but the
secondary CP templatefile call (main.tf:860) was missed — every
multi-region provision since PR #1446 lands here at plan-time.
Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the
secondary CP templatefile so each secondary region's cilium-operator
reaches its OWN local CP (matching CA), not the primary across regions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort)
Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops:
fatal: failed to start: daemon creation failed: unable to initialize
BPF masquerade support: BPF masquerade requires NodePort
(--enable-node-port="true")
Chart default kubeProxyReplacement=false leaves enable-node-port=false in
the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true
(also default-on) the cilium-agent rejects the BPF masquerade datapath
on startup. CP cilium-agent survives because it was started by cloudinit
with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the
ConfigMap. Every WORKER node that joins after Flux's upgrade sees the
new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not-
ready taint persists → every post-install Job pod (keycloak-config-cli,
powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain
stalls at ~60% Ready.
Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl
write_files entry /var/lib/catalyst/cilium-values.yaml) already uses
kubeProxyReplacement: true. This change aligns the Flux HR overlay with
the working pre-Flux bootstrap so the agent config never regresses when
helm-controller does its first upgrade.
Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race
prov #55+#56 caught bp-harbor / bp-powerdns failing Helm install with:
Internal error occurred: failed calling webhook "mcluster.cnpg.io":
no endpoints available for service "cnpg-webhook-service"
Chain:
1. bp-cnpg install with disableWait: true → HR goes Ready immediately
when manifests apply (operator pod still spinning up).
2. Flux releases dependents (bp-harbor, bp-powerdns) — they pass the
dependsOn check on bp-cnpg.
3. Downstream chart renders postgresql.cnpg.io/v1.Cluster CRs.
4. cnpg mutating webhook (Service cnpg-webhook-service) has no endpoints
yet → admission webhook call fails → Helm install fails →
RetriesExceeded → entire DB-backed chain wedges.
Carve out the disableWait: true blanket for bp-cnpg specifically.
INVIOLABLE-PRINCIPLES #3's "event-driven install" rationale (avoid the
agent-waits-for-its-own-CRDs deadlock — see bp-cilium) does NOT apply
to CNPG: CNPG's CRDs are loaded by helm-controller BEFORE pods schedule,
so Helm-wait blocks only on pod readiness, not on a self-referencing CRD.
With this change bp-cnpg's HR stays Reconciling until cnpg-controller-
manager + cnpg-webhook-service are both rolled + Available, so Flux
dependsOn correctly gates downstream consumers behind a webhook that's
actually serving.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(infra): pass cp_private_ip to primary CP templatefile too
PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".
Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile
prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl
started consuming ${cp_private_ip} (PR #1446):
Invalid value for "vars" parameter: vars map does not contain key
"cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43.
The primary CP templatefile call (main.tf:342) and the secondary WORKER
templatefile call (main.tf:944) both pass `cp_private_ip`, but the
secondary CP templatefile call (main.tf:860) was missed — every
multi-region provision since PR #1446 lands here at plan-time.
Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the
secondary CP templatefile so each secondary region's cilium-operator
reaches its OWN local CP (matching CA), not the primary across regions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort)
Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops:
fatal: failed to start: daemon creation failed: unable to initialize
BPF masquerade support: BPF masquerade requires NodePort
(--enable-node-port="true")
Chart default kubeProxyReplacement=false leaves enable-node-port=false in
the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true
(also default-on) the cilium-agent rejects the BPF masquerade datapath
on startup. CP cilium-agent survives because it was started by cloudinit
with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the
ConfigMap. Every WORKER node that joins after Flux's upgrade sees the
new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not-
ready taint persists → every post-install Job pod (keycloak-config-cli,
powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain
stalls at ~60% Ready.
Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl
write_files entry /var/lib/catalyst/cilium-values.yaml) already uses
kubeProxyReplacement: true. This change aligns the Flux HR overlay with
the working pre-Flux bootstrap so the agent config never regresses when
helm-controller does its first upgrade.
Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(infra): pass cp_private_ip to primary CP templatefile too
PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".
Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile
prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl
started consuming ${cp_private_ip} (PR #1446):
Invalid value for "vars" parameter: vars map does not contain key
"cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43.
The primary CP templatefile call (main.tf:342) and the secondary WORKER
templatefile call (main.tf:944) both pass `cp_private_ip`, but the
secondary CP templatefile call (main.tf:860) was missed — every
multi-region provision since PR #1446 lands here at plan-time.
Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the
secondary CP templatefile so each secondary region's cilium-operator
reaches its OWN local CP (matching CA), not the primary across regions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".
Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each region's k3s is an INDEPENDENT cluster per NAMING-CONVENTION §1.3
"no stretched fault domain". Cilium on each region MUST talk to its
OWN local CP's k3s API server, not the primary's 10.0.1.2. Three sites
hardcoded the primary's IP:
1) Pre-Flux cilium helm install (cloudinit-control-plane.tftpl:665):
`k8sServiceHost: 10.0.1.2` → `${cp_private_ip}` (rendered per-region
by main.tf — primary 10.0.1.2, nbg1-1 10.0.11.2, hel1-2 10.0.12.2).
2) k3s install --tls-san=10.0.1.2 (line 1206): same `${cp_private_ip}`
so each region's k3s API cert validates against the LOCAL CP's IP.
3) bp-cilium HelmRelease (clusters/_template/bootstrap-kit/01-cilium.yaml):
add `k8sServiceHost: ${CILIUM_K8S_SERVICE_HOST:=10.0.1.2}` to the HR
values so Flux postBuild.substitute can override per region. The
cloud-init Kustomization renders the substitute var to `${cp_private_ip}`.
Single-region (primary-only) provisions fall back to the
default `10.0.1.2` and stay byte-identical to today.
Live evidence of the bug — prov #52 (3-region) on 2026-05-12:
cilium-operator on nbg1 secondary:
"Establishing connection to apiserver" host="https://10.0.1.2:6443"
"failed to start: ... tls: failed to verify certificate:
x509: certificate signed by unknown authority"
Each region's k3s has its OWN self-signed CA (cluster-init per CP). The
primary's API cert isn't signed by the secondary's CA → cilium crash-
loops → no CNI → flux controllers Pending → no HRs → canvas shows only
primary's HRs. This fix points each region's cilium at the LOCAL CP,
whose API server presents the matching CA from this cluster.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The conflict-resolution Python script in PR #1444 wrote a literal
newline where the regex string needed the two-char "\n" escape. tofu
init rejected with "Invalid multi-line string / Unterminated template
string" on main.tf:925.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator mandate (2026-05-12): the mothership canvas must surface
install-* HRs from EVERY region of a multi-region provision, not just
the primary CP's. Today catalyst-api stores ONE kubeconfig per
deployment (the primary CP's) and spawns ONE helmwatch.Bridge against
it. Result: secondary regions are invisible on the canvas even though
their k3s clusters are fully reconciling.
End-to-end change across infra + handler:
1) cloud-init (cloudinit-control-plane.tftpl): the kubeconfig PUT URL
appends `?region=<kubeconfig_postback_region>` when the var is set.
main.tf templatefile call passes empty for primary CP, `each.key`
(e.g. "nbg1-1", "hel1-2") for each secondary region.
2) PutKubeconfig handler: reads ?region= query param. Empty → primary
path (unchanged: stores at <dir>/<id>.yaml, sets
Result.KubeconfigPath, fires Phase-1 watch + SMTP seed). Non-empty
→ secondary path: stores at <dir>/<id>-<region>.yaml, populates
Deployment.secondaryKubeconfigPaths[region]. Single-use guard is
per-region (the same bearer secures every CP's PUT — secondaries
reuse it for their own slot). NO Phase-1 watch re-launch from a
secondary PUT.
3) phase1_watch.spawnSecondaryRegionWatchers: runs alongside the
primary's watcher. Scans <kubeconfigsDir>/<id>-*.yaml every 15s,
spawns one helmwatch.NewWatcher per kubeconfig discovered, stores
the Watcher on Deployment.secondaryWatchers[region]. Per-region
watchers emit ordinary helmwatch events with region-prefixed
Component names so the wizard's per-component view doesn't collide
primary vs secondary bp-cilium events. They do NOT contribute to
markPhase1Done — outcome remains the primary's classification.
4) flow_snapshot_local.flowSnapshotFromJobs: composes per-region group
bubbles + install-* nodes from each secondary watcher's
SnapshotComponents. Node id: <depID>:<region>:install-<chart>.
FlowNode.region set so the canvas can colour-group. Intra-region
finish-to-start deps emitted from cs.DependsOn — same-region only,
never cross-region (per NAMING-CONVENTION §1.3 independent fault
domains, no stretched cluster).
5) wipe.go: removes both <id>.yaml AND every <id>-*.yaml secondary
kubeconfig file on Sovereign wipe.
Storage model is uniform across SME and corporate Sovereigns. No
hardcoding of provider, region count, or building block.
Caught after operator pointed out that 3-region prov #50 was showing
only 52 install-* nodes (all from fsn1) on the canvas — the
architectural gap.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The F7 fix (Issue #1778) added hcloud_network_name / hcloud_firewall_name /
hcloud_ssh_key_name to cloudinit-control-plane.tftpl so the cluster
autoscaler could attach scale-up VMs to the private network. The
primary CP's templatefile call at main.tf:483-485 was updated, but the
matching call for secondary regions at main.tf:899 was missed.
Result: any provision with regions[] of length > 1 fails at tofu plan
with "vars map does not contain key hcloud_network_name" referenced in
cloudinit-control-plane.tftpl:478.
Hit live on prov #47 (ce25c31fff15c30c, 4-region: fsn1/nbg1/hel1/ash)
at T+0:47. Forward the same three resource refs to every secondary
region's templatefile call.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator-reported design fix completing #1437/#1440 — the cross-phase
ordering between provisioner and bootstrap-kit groups was either an
M×N phantom-edge fan-out (pre-#1437) OR completely disconnected at
leaf level (post-#1440 with the both-elided skip). Neither was right.
Real design: when a group→group dependency edge is lifted onto the
leaf graph because one or both endpoints elided, cascade ONLY to the
temporal endpoint pair:
upstream_terminals → downstream_initials
Where:
- upstream_terminals = visible descendants of the upstream group
that nothing else in the group depends on (sinks of intra-group
DAG). For the tofu chain this collapses to just cluster-bootstrap.
- downstream_initials = visible descendants of the downstream group
that depend on nothing else in the group (sources of intra-group
DAG). For bootstrap-kit this is install-cilium / install-flux /
install-gateway-api / etc — the install-* roots.
Net result for provisioner→bootstrap-kit at depth=all: a small fan of
edges from cluster-bootstrap to the bp-* roots — the real temporal
gate, no spurious phantom edges, no missing cross-phase chain.
Two call sites updated:
- Inbound: visibleJob X with X.dependsOn = [elidedGroup G] now
cascades to groupTerminals(G) instead of fanOutVisibleChildren(G).
- Outbound: elidedGroup G with G.dependsOn = [D] cascades to
groupInitials(G) on the receive side; D-side cascades to
groupTerminals(D) when D is also elided, or uses D directly when
D is a visible job.
11/11 flowLayoutOrganic.test.ts pass.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator reported "non of the right click functionalites working
other than the open in new tab". Root cause: the previous handler
only mutated urlFoldedSet, which had no visible effect when the
clicked group was folded by the depth default (same class of bug
toggleFold had before #1439). The menu items also had confusing
labels ("Fold to level N" stepped GLOBAL depth, not subtree-relative).
Rewrite to use the same compose-state pattern toggleFold uses:
- "Show only this group" — switch to depth=all + fold every OTHER
group. Only the clicked group's subtree expands; sibling groups
stay collapsed.
- "Hide this group" — switch to depth=default + add clicked group
to urlFoldedSet. Group renders as a folded bubble; its subtree
hidden.
- "Expand subtree" — switch to depth=all + remove this group and
all its descendant groups from urlFoldedSet. Fully unfolded
subtree.
- "Open in new tab" — unchanged (was working since #1435).
Dropped the misleading "Fold to level N" item (was just stepDepth(-1)).
The depth chip ◀▶ at the top-right is the canonical global depth
control.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three operator-reported issues from the same dblclick session:
1) "Back to jobs" link in JobDetail.tsx (2 sites) and JobsTimeline.tsx
used absolute /jobs which on contabo resolves to /sovereign/jobs —
the mother's flat /jobs view, NOT the chroot-scoped
/sovereign/provision/<id>/jobs. Operator reported "chroot principle
violation". Fix: chroot-aware /provision/<deploymentId>/jobs when
deploymentId is present.
2) Bootstrap and Provision Hetzner group bubbles at ?depth=1 had no
edge between them — temporal ordering invisible. Earlier #1437
dropped the group→group edge entirely because the FE layout's
lift-on-elide cascaded it into M×N phantom edges at ?depth=all.
Re-emit the edge AND fix the lift logic in
flowLayoutOrganic.ts (lines 414-442) to SKIP the lift when BOTH
endpoints of the elided-group dep are elided. At ?depth=1 the
edge renders between the two folded groups as intended; at
?depth=all both groups elide and the lift is suppressed so the
spurious cascade doesn't reappear. The actual install-* deps are
already visible via each leaf's own dependsOn — skipping the lift
costs no information.
3) (Documented separately) Right-click menu only attaches to GROUP
nodes per design (FlowCanvasOrganic line 1277). When all groups
are elided (?depth=all auto-folds groups out), the menu is
unreachable. The dblclick-on-group fold fix (#1439) makes group
bubbles reachable at ?depth=1 where right-click works.
Caught via Playwright after operator reported all three.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
toggleFold previously only mutated urlFoldedSet, which had no effect
when the clicked node was folded BY THE DEPTH DEFAULT (not by an
explicit URL override). Result: at ?depth=1 where both groups are
folded by depth-default, double-clicking bootstrap-kit (after #1438's
dblclick-on-group → toggleFold branch) was a no-op — the urlFoldedSet
delete didn't change the composed foldedSet, the canvas didn't budge.
New behaviour:
- If clicked node is folded by ANY source: switch to depth=all AND
explicitly fold every OTHER previously-folded group. Only the
clicked group ends up visibly unfolded — exactly the operator-
requested "expand only the respective parent" UX.
- If clicked node is unfolded: add to urlFoldedSet to fold it
without changing depth.
Caught via Playwright after #1438 landed and dblclick still didn't
unfold the clicked group at ?depth=1.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator reported "double-click on a parent bubble it is expanding
all the parent instead of expanding only the respective parent."
Reproduced in Playwright: at ?depth=1 only the 2 group bubbles
render folded; double-click on bootstrap-kit navigated to
/jobs/bootstrap-kit which DROPPED the ?depth=1 query → new page
defaulted to depth=2 → groups elided → all 50 install-* + Phase-0
bubbles rendered. Exactly the "expanding all parents" symptom.
Two fixes:
1) Branch handleNodeDoubleClick: if the bubble is a group, call
toggleFold(nodeId) in place — fold or unfold ONLY that group.
Tree-explorer UX where a leaf double-click drills in but a group
double-click expands/collapses.
2) For the leaf path, preserve window.location.search across the
navigate so the destination page renders with the same depth /
folded filter the operator had on screen. Without this, the new
page defaults to depth=2 and the visible bubble set changes
beneath them.
Caught via Playwright double-click simulation on bootstrap-kit at
?depth=1 — URL went from .../jobs/install-cnpg?depth=1 (2 bubbles)
to .../jobs/bootstrap-kit (50 bubbles).
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
flowLayoutOrganic.ts lines 414-442 lift an elided group's outbound
deps onto EACH of its visible children, and if the dep target is
itself an elided group, fans out to THAT group's visible children
too. With both top-level groups elided at depth=all, the single
group→group finish-to-start edge I added cascades into M×N phantom
edges (each install-* gains a dep on every tofu-* + cluster-bootstrap
step). The operator-reported "install-cnpg has 5 connections from
terraform jobs" was exactly this layout-side fan-out.
Removing the group→group edge leaves Phase-0 and Phase-1 as separate
connected components on the canvas — the correct minimum-edge
rendering. Ordering between phases is implicit in the timestamps +
status flow, not in the edge graph.
Caught by Playwright-probing the canvas after operator pushback: data
side had only the 1 real direct dep (install-flux → install-cnpg)
yet the canvas drew 5+ phantom lines to install-cnpg from Phase-0.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1435 (depth-chip basepath fix) failed CI because removing `to:`
from navigate() narrowed the search reducer's typed return to never,
producing TS2322 on the `Record<string, unknown>` cast.
Forward-fix: bypass TanStack navigate() entirely for the search-only
mutation path. Update window.location's query string via
history.replaceState (preserves pathname verbatim including basepath)
and dispatch a synthetic popstate so TanStack's useSearch picks up
the new query on next render. No TanStack path resolution → no
basepath drop → no colon re-encoding → depth-chip click stops 404ing.
Re-also fixes open-new-tab (window.open of absolute /sovereign/... )
and handleNodeDoubleClick (strip + encode jobId) carried over from #1435.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two UX-killer bugs the operator hit on the FlowCanvasOrganic surface:
1) Clicking the depth chip arrows (◀ / ▶) on
/sovereign/provision/<id>/jobs/<depId>:install-X pushed the browser
to /provision/<id>/jobs/<depId>%3Ainstall-X — the /sovereign basepath
was dropped AND the colon was re-encoded as %3A, both via TanStack's
`to: '.'` path resolution. The new URL 404s at the BE because the
colon-prefixed jobName misses jobs.Store.GetJob's exact-match lookup.
Fix: omit `to:` entirely. TanStack treats a search-only navigate as
a pure search-params mutation and preserves the current path verbatim
including the basepath. The colon-prefixed jobId in the URL comes
from older deep-links; the strip-on-click fix landed in #1431.
2) Right-click → "Open in new tab" also passed the raw nodeId
verbatim (no prefix strip, no encode, no /sovereign prefix). Mirror
handleNodeDoubleClick: strip the "<deploymentId>:" prefix,
encodeURIComponent the remainder, AND prepend /sovereign for the
absolute-path window.open (window.open isn't routed through
TanStack so basepath isn't auto-prepended).
Caught after operator reported "level arrows redirect to wrong URLs
and giving 404" + "right click on a parent bubble … none of the
functions are working properly."
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
helmwatch.Bridge writes SOME Job.DependsOn entries as bare names
("install-flux") rather than the canonical JobID form
("<deploymentId>:install-flux") — 71 such entries observed on prov
bfdccbdbd6f700e1 (2026-05-12). My flowSnapshotFromJobs emit copied
those bare names verbatim into Relationship.fromId. The canvas
reducer matches FlowNode.id by exact string, so the bare-name fromId
became a phantom edge pointing to a non-existent node. In the
force-directed layout these phantom edges visually routed through
the nearest real bubbles, manifesting as 5-edge fan-outs from every
Phase-0 tofu job to every install-* bubble (operator-reported on
install-cnpg, but symmetric across all install-*).
Normalise every fromId to jobs.JobID(deploymentID, dep) form when
the stored value lacks a ":" separator.
Caught after operator reported "install-cnpg has 5 different
connections from terraform jobs — this is matter of a proper
chaining" — looking at the snapshot showed Job.DependsOn=[install-flux]
without the prefix.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per products/openova-flow/core/src/types.ts line 112:
"contains — toId (parent) contains fromId (child)"
My emit had this inverted: I set FromID=parent, ToID=child, which
made the FE adapter (flowStreamToOrganic.ts line 134) interpret every
install-* leaf as a group containing the bootstrap-kit/provisioner
group nodes. Net result: only 2 bubbles ever rendered on the canvas
regardless of ?depth= because the hierarchy graph was upside-down.
Caught by opening the canvas in a browser via Playwright after the
operator reported "still showing only 2 bubbles, no drill-down".
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the Pod restarts between PutKubeconfig writing the file AND the
next Result.Save() persisting the field, dep.Result.KubeconfigPath
comes back empty even though the file exists at the canonical
convention <kubeconfigsDir>/<deploymentID>.yaml. RefreshWatch was
returning 409 watch-not-resumable in this state, which left the
mothership canvas frozen because the live watcher couldn't re-attach
to source HR.spec.dependsOn for the install-* edge derivation.
Hit live on prov bfdccbdbd6f700e1 (2026-05-12): chart roll for
PR #1431 restarted catalyst-api Pod, the file
/var/lib/catalyst/kubeconfigs/bfdccbdbd6f700e1.yaml was on disk but
RefreshWatch refused to use it because the record field was empty.
Fix: when KubeconfigPath is empty AND h.kubeconfigsDir is configured
AND a file exists at <dir>/<depID>.yaml, use that path and patch the
record so subsequent /components/state + flow snapshot calls see a
populated field.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>