* fix(infra): pass cp_private_ip to primary CP templatefile too
PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".
Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile
prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl
started consuming ${cp_private_ip} (PR #1446):
Invalid value for "vars" parameter: vars map does not contain key
"cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43.
The primary CP templatefile call (main.tf:342) and the secondary WORKER
templatefile call (main.tf:944) both pass `cp_private_ip`, but the
secondary CP templatefile call (main.tf:860) was missed — every
multi-region provision since PR #1446 lands here at plan-time.
Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the
secondary CP templatefile so each secondary region's cilium-operator
reaches its OWN local CP (matching CA), not the primary across regions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort)
Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops:
fatal: failed to start: daemon creation failed: unable to initialize
BPF masquerade support: BPF masquerade requires NodePort
(--enable-node-port="true")
Chart default kubeProxyReplacement=false leaves enable-node-port=false in
the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true
(also default-on) the cilium-agent rejects the BPF masquerade datapath
on startup. CP cilium-agent survives because it was started by cloudinit
with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the
ConfigMap. Every WORKER node that joins after Flux's upgrade sees the
new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not-
ready taint persists → every post-install Job pod (keycloak-config-cli,
powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain
stalls at ~60% Ready.
Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl
write_files entry /var/lib/catalyst/cilium-values.yaml) already uses
kubeProxyReplacement: true. This change aligns the Flux HR overlay with
the working pre-Flux bootstrap so the agent config never regresses when
helm-controller does its first upgrade.
Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race
prov #55+#56 caught bp-harbor / bp-powerdns failing Helm install with:
Internal error occurred: failed calling webhook "mcluster.cnpg.io":
no endpoints available for service "cnpg-webhook-service"
Chain:
1. bp-cnpg install with disableWait: true → HR goes Ready immediately
when manifests apply (operator pod still spinning up).
2. Flux releases dependents (bp-harbor, bp-powerdns) — they pass the
dependsOn check on bp-cnpg.
3. Downstream chart renders postgresql.cnpg.io/v1.Cluster CRs.
4. cnpg mutating webhook (Service cnpg-webhook-service) has no endpoints
yet → admission webhook call fails → Helm install fails →
RetriesExceeded → entire DB-backed chain wedges.
Carve out the disableWait: true blanket for bp-cnpg specifically.
INVIOLABLE-PRINCIPLES #3's "event-driven install" rationale (avoid the
agent-waits-for-its-own-CRDs deadlock — see bp-cilium) does NOT apply
to CNPG: CNPG's CRDs are loaded by helm-controller BEFORE pods schedule,
so Helm-wait blocks only on pod readiness, not on a self-referencing CRD.
With this change bp-cnpg's HR stays Reconciling until cnpg-controller-
manager + cnpg-webhook-service are both rolled + Available, so Flux
dependsOn correctly gates downstream consumers behind a webhook that's
actually serving.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(infra): pass cp_private_ip to primary CP templatefile too
PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".
Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile
prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl
started consuming ${cp_private_ip} (PR #1446):
Invalid value for "vars" parameter: vars map does not contain key
"cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43.
The primary CP templatefile call (main.tf:342) and the secondary WORKER
templatefile call (main.tf:944) both pass `cp_private_ip`, but the
secondary CP templatefile call (main.tf:860) was missed — every
multi-region provision since PR #1446 lands here at plan-time.
Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the
secondary CP templatefile so each secondary region's cilium-operator
reaches its OWN local CP (matching CA), not the primary across regions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort)
Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops:
fatal: failed to start: daemon creation failed: unable to initialize
BPF masquerade support: BPF masquerade requires NodePort
(--enable-node-port="true")
Chart default kubeProxyReplacement=false leaves enable-node-port=false in
the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true
(also default-on) the cilium-agent rejects the BPF masquerade datapath
on startup. CP cilium-agent survives because it was started by cloudinit
with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the
ConfigMap. Every WORKER node that joins after Flux's upgrade sees the
new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not-
ready taint persists → every post-install Job pod (keycloak-config-cli,
powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain
stalls at ~60% Ready.
Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl
write_files entry /var/lib/catalyst/cilium-values.yaml) already uses
kubeProxyReplacement: true. This change aligns the Flux HR overlay with
the working pre-Flux bootstrap so the agent config never regresses when
helm-controller does its first upgrade.
Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(infra): pass cp_private_ip to primary CP templatefile too
PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".
Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile
prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl
started consuming ${cp_private_ip} (PR #1446):
Invalid value for "vars" parameter: vars map does not contain key
"cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43.
The primary CP templatefile call (main.tf:342) and the secondary WORKER
templatefile call (main.tf:944) both pass `cp_private_ip`, but the
secondary CP templatefile call (main.tf:860) was missed — every
multi-region provision since PR #1446 lands here at plan-time.
Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the
secondary CP templatefile so each secondary region's cilium-operator
reaches its OWN local CP (matching CA), not the primary across regions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".
Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each region's k3s is an INDEPENDENT cluster per NAMING-CONVENTION §1.3
"no stretched fault domain". Cilium on each region MUST talk to its
OWN local CP's k3s API server, not the primary's 10.0.1.2. Three sites
hardcoded the primary's IP:
1) Pre-Flux cilium helm install (cloudinit-control-plane.tftpl:665):
`k8sServiceHost: 10.0.1.2` → `${cp_private_ip}` (rendered per-region
by main.tf — primary 10.0.1.2, nbg1-1 10.0.11.2, hel1-2 10.0.12.2).
2) k3s install --tls-san=10.0.1.2 (line 1206): same `${cp_private_ip}`
so each region's k3s API cert validates against the LOCAL CP's IP.
3) bp-cilium HelmRelease (clusters/_template/bootstrap-kit/01-cilium.yaml):
add `k8sServiceHost: ${CILIUM_K8S_SERVICE_HOST:=10.0.1.2}` to the HR
values so Flux postBuild.substitute can override per region. The
cloud-init Kustomization renders the substitute var to `${cp_private_ip}`.
Single-region (primary-only) provisions fall back to the
default `10.0.1.2` and stay byte-identical to today.
Live evidence of the bug — prov #52 (3-region) on 2026-05-12:
cilium-operator on nbg1 secondary:
"Establishing connection to apiserver" host="https://10.0.1.2:6443"
"failed to start: ... tls: failed to verify certificate:
x509: certificate signed by unknown authority"
Each region's k3s has its OWN self-signed CA (cluster-init per CP). The
primary's API cert isn't signed by the secondary's CA → cilium crash-
loops → no CNI → flux controllers Pending → no HRs → canvas shows only
primary's HRs. This fix points each region's cilium at the LOCAL CP,
whose API server presents the matching CA from this cluster.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The conflict-resolution Python script in PR #1444 wrote a literal
newline where the regex string needed the two-char "\n" escape. tofu
init rejected with "Invalid multi-line string / Unterminated template
string" on main.tf:925.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator mandate (2026-05-12): the mothership canvas must surface
install-* HRs from EVERY region of a multi-region provision, not just
the primary CP's. Today catalyst-api stores ONE kubeconfig per
deployment (the primary CP's) and spawns ONE helmwatch.Bridge against
it. Result: secondary regions are invisible on the canvas even though
their k3s clusters are fully reconciling.
End-to-end change across infra + handler:
1) cloud-init (cloudinit-control-plane.tftpl): the kubeconfig PUT URL
appends `?region=<kubeconfig_postback_region>` when the var is set.
main.tf templatefile call passes empty for primary CP, `each.key`
(e.g. "nbg1-1", "hel1-2") for each secondary region.
2) PutKubeconfig handler: reads ?region= query param. Empty → primary
path (unchanged: stores at <dir>/<id>.yaml, sets
Result.KubeconfigPath, fires Phase-1 watch + SMTP seed). Non-empty
→ secondary path: stores at <dir>/<id>-<region>.yaml, populates
Deployment.secondaryKubeconfigPaths[region]. Single-use guard is
per-region (the same bearer secures every CP's PUT — secondaries
reuse it for their own slot). NO Phase-1 watch re-launch from a
secondary PUT.
3) phase1_watch.spawnSecondaryRegionWatchers: runs alongside the
primary's watcher. Scans <kubeconfigsDir>/<id>-*.yaml every 15s,
spawns one helmwatch.NewWatcher per kubeconfig discovered, stores
the Watcher on Deployment.secondaryWatchers[region]. Per-region
watchers emit ordinary helmwatch events with region-prefixed
Component names so the wizard's per-component view doesn't collide
primary vs secondary bp-cilium events. They do NOT contribute to
markPhase1Done — outcome remains the primary's classification.
4) flow_snapshot_local.flowSnapshotFromJobs: composes per-region group
bubbles + install-* nodes from each secondary watcher's
SnapshotComponents. Node id: <depID>:<region>:install-<chart>.
FlowNode.region set so the canvas can colour-group. Intra-region
finish-to-start deps emitted from cs.DependsOn — same-region only,
never cross-region (per NAMING-CONVENTION §1.3 independent fault
domains, no stretched cluster).
5) wipe.go: removes both <id>.yaml AND every <id>-*.yaml secondary
kubeconfig file on Sovereign wipe.
Storage model is uniform across SME and corporate Sovereigns. No
hardcoding of provider, region count, or building block.
Caught after operator pointed out that 3-region prov #50 was showing
only 52 install-* nodes (all from fsn1) on the canvas — the
architectural gap.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The F7 fix (Issue #1778) added hcloud_network_name / hcloud_firewall_name /
hcloud_ssh_key_name to cloudinit-control-plane.tftpl so the cluster
autoscaler could attach scale-up VMs to the private network. The
primary CP's templatefile call at main.tf:483-485 was updated, but the
matching call for secondary regions at main.tf:899 was missed.
Result: any provision with regions[] of length > 1 fails at tofu plan
with "vars map does not contain key hcloud_network_name" referenced in
cloudinit-control-plane.tftpl:478.
Hit live on prov #47 (ce25c31fff15c30c, 4-region: fsn1/nbg1/hel1/ash)
at T+0:47. Forward the same three resource refs to every secondary
region's templatefile call.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator-reported design fix completing #1437/#1440 — the cross-phase
ordering between provisioner and bootstrap-kit groups was either an
M×N phantom-edge fan-out (pre-#1437) OR completely disconnected at
leaf level (post-#1440 with the both-elided skip). Neither was right.
Real design: when a group→group dependency edge is lifted onto the
leaf graph because one or both endpoints elided, cascade ONLY to the
temporal endpoint pair:
upstream_terminals → downstream_initials
Where:
- upstream_terminals = visible descendants of the upstream group
that nothing else in the group depends on (sinks of intra-group
DAG). For the tofu chain this collapses to just cluster-bootstrap.
- downstream_initials = visible descendants of the downstream group
that depend on nothing else in the group (sources of intra-group
DAG). For bootstrap-kit this is install-cilium / install-flux /
install-gateway-api / etc — the install-* roots.
Net result for provisioner→bootstrap-kit at depth=all: a small fan of
edges from cluster-bootstrap to the bp-* roots — the real temporal
gate, no spurious phantom edges, no missing cross-phase chain.
Two call sites updated:
- Inbound: visibleJob X with X.dependsOn = [elidedGroup G] now
cascades to groupTerminals(G) instead of fanOutVisibleChildren(G).
- Outbound: elidedGroup G with G.dependsOn = [D] cascades to
groupInitials(G) on the receive side; D-side cascades to
groupTerminals(D) when D is also elided, or uses D directly when
D is a visible job.
11/11 flowLayoutOrganic.test.ts pass.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator reported "non of the right click functionalites working
other than the open in new tab". Root cause: the previous handler
only mutated urlFoldedSet, which had no visible effect when the
clicked group was folded by the depth default (same class of bug
toggleFold had before #1439). The menu items also had confusing
labels ("Fold to level N" stepped GLOBAL depth, not subtree-relative).
Rewrite to use the same compose-state pattern toggleFold uses:
- "Show only this group" — switch to depth=all + fold every OTHER
group. Only the clicked group's subtree expands; sibling groups
stay collapsed.
- "Hide this group" — switch to depth=default + add clicked group
to urlFoldedSet. Group renders as a folded bubble; its subtree
hidden.
- "Expand subtree" — switch to depth=all + remove this group and
all its descendant groups from urlFoldedSet. Fully unfolded
subtree.
- "Open in new tab" — unchanged (was working since #1435).
Dropped the misleading "Fold to level N" item (was just stepDepth(-1)).
The depth chip ◀▶ at the top-right is the canonical global depth
control.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three operator-reported issues from the same dblclick session:
1) "Back to jobs" link in JobDetail.tsx (2 sites) and JobsTimeline.tsx
used absolute /jobs which on contabo resolves to /sovereign/jobs —
the mother's flat /jobs view, NOT the chroot-scoped
/sovereign/provision/<id>/jobs. Operator reported "chroot principle
violation". Fix: chroot-aware /provision/<deploymentId>/jobs when
deploymentId is present.
2) Bootstrap and Provision Hetzner group bubbles at ?depth=1 had no
edge between them — temporal ordering invisible. Earlier #1437
dropped the group→group edge entirely because the FE layout's
lift-on-elide cascaded it into M×N phantom edges at ?depth=all.
Re-emit the edge AND fix the lift logic in
flowLayoutOrganic.ts (lines 414-442) to SKIP the lift when BOTH
endpoints of the elided-group dep are elided. At ?depth=1 the
edge renders between the two folded groups as intended; at
?depth=all both groups elide and the lift is suppressed so the
spurious cascade doesn't reappear. The actual install-* deps are
already visible via each leaf's own dependsOn — skipping the lift
costs no information.
3) (Documented separately) Right-click menu only attaches to GROUP
nodes per design (FlowCanvasOrganic line 1277). When all groups
are elided (?depth=all auto-folds groups out), the menu is
unreachable. The dblclick-on-group fold fix (#1439) makes group
bubbles reachable at ?depth=1 where right-click works.
Caught via Playwright after operator reported all three.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
toggleFold previously only mutated urlFoldedSet, which had no effect
when the clicked node was folded BY THE DEPTH DEFAULT (not by an
explicit URL override). Result: at ?depth=1 where both groups are
folded by depth-default, double-clicking bootstrap-kit (after #1438's
dblclick-on-group → toggleFold branch) was a no-op — the urlFoldedSet
delete didn't change the composed foldedSet, the canvas didn't budge.
New behaviour:
- If clicked node is folded by ANY source: switch to depth=all AND
explicitly fold every OTHER previously-folded group. Only the
clicked group ends up visibly unfolded — exactly the operator-
requested "expand only the respective parent" UX.
- If clicked node is unfolded: add to urlFoldedSet to fold it
without changing depth.
Caught via Playwright after #1438 landed and dblclick still didn't
unfold the clicked group at ?depth=1.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator reported "double-click on a parent bubble it is expanding
all the parent instead of expanding only the respective parent."
Reproduced in Playwright: at ?depth=1 only the 2 group bubbles
render folded; double-click on bootstrap-kit navigated to
/jobs/bootstrap-kit which DROPPED the ?depth=1 query → new page
defaulted to depth=2 → groups elided → all 50 install-* + Phase-0
bubbles rendered. Exactly the "expanding all parents" symptom.
Two fixes:
1) Branch handleNodeDoubleClick: if the bubble is a group, call
toggleFold(nodeId) in place — fold or unfold ONLY that group.
Tree-explorer UX where a leaf double-click drills in but a group
double-click expands/collapses.
2) For the leaf path, preserve window.location.search across the
navigate so the destination page renders with the same depth /
folded filter the operator had on screen. Without this, the new
page defaults to depth=2 and the visible bubble set changes
beneath them.
Caught via Playwright double-click simulation on bootstrap-kit at
?depth=1 — URL went from .../jobs/install-cnpg?depth=1 (2 bubbles)
to .../jobs/bootstrap-kit (50 bubbles).
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
flowLayoutOrganic.ts lines 414-442 lift an elided group's outbound
deps onto EACH of its visible children, and if the dep target is
itself an elided group, fans out to THAT group's visible children
too. With both top-level groups elided at depth=all, the single
group→group finish-to-start edge I added cascades into M×N phantom
edges (each install-* gains a dep on every tofu-* + cluster-bootstrap
step). The operator-reported "install-cnpg has 5 connections from
terraform jobs" was exactly this layout-side fan-out.
Removing the group→group edge leaves Phase-0 and Phase-1 as separate
connected components on the canvas — the correct minimum-edge
rendering. Ordering between phases is implicit in the timestamps +
status flow, not in the edge graph.
Caught by Playwright-probing the canvas after operator pushback: data
side had only the 1 real direct dep (install-flux → install-cnpg)
yet the canvas drew 5+ phantom lines to install-cnpg from Phase-0.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1435 (depth-chip basepath fix) failed CI because removing `to:`
from navigate() narrowed the search reducer's typed return to never,
producing TS2322 on the `Record<string, unknown>` cast.
Forward-fix: bypass TanStack navigate() entirely for the search-only
mutation path. Update window.location's query string via
history.replaceState (preserves pathname verbatim including basepath)
and dispatch a synthetic popstate so TanStack's useSearch picks up
the new query on next render. No TanStack path resolution → no
basepath drop → no colon re-encoding → depth-chip click stops 404ing.
Re-also fixes open-new-tab (window.open of absolute /sovereign/... )
and handleNodeDoubleClick (strip + encode jobId) carried over from #1435.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two UX-killer bugs the operator hit on the FlowCanvasOrganic surface:
1) Clicking the depth chip arrows (◀ / ▶) on
/sovereign/provision/<id>/jobs/<depId>:install-X pushed the browser
to /provision/<id>/jobs/<depId>%3Ainstall-X — the /sovereign basepath
was dropped AND the colon was re-encoded as %3A, both via TanStack's
`to: '.'` path resolution. The new URL 404s at the BE because the
colon-prefixed jobName misses jobs.Store.GetJob's exact-match lookup.
Fix: omit `to:` entirely. TanStack treats a search-only navigate as
a pure search-params mutation and preserves the current path verbatim
including the basepath. The colon-prefixed jobId in the URL comes
from older deep-links; the strip-on-click fix landed in #1431.
2) Right-click → "Open in new tab" also passed the raw nodeId
verbatim (no prefix strip, no encode, no /sovereign prefix). Mirror
handleNodeDoubleClick: strip the "<deploymentId>:" prefix,
encodeURIComponent the remainder, AND prepend /sovereign for the
absolute-path window.open (window.open isn't routed through
TanStack so basepath isn't auto-prepended).
Caught after operator reported "level arrows redirect to wrong URLs
and giving 404" + "right click on a parent bubble … none of the
functions are working properly."
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
helmwatch.Bridge writes SOME Job.DependsOn entries as bare names
("install-flux") rather than the canonical JobID form
("<deploymentId>:install-flux") — 71 such entries observed on prov
bfdccbdbd6f700e1 (2026-05-12). My flowSnapshotFromJobs emit copied
those bare names verbatim into Relationship.fromId. The canvas
reducer matches FlowNode.id by exact string, so the bare-name fromId
became a phantom edge pointing to a non-existent node. In the
force-directed layout these phantom edges visually routed through
the nearest real bubbles, manifesting as 5-edge fan-outs from every
Phase-0 tofu job to every install-* bubble (operator-reported on
install-cnpg, but symmetric across all install-*).
Normalise every fromId to jobs.JobID(deploymentID, dep) form when
the stored value lacks a ":" separator.
Caught after operator reported "install-cnpg has 5 different
connections from terraform jobs — this is matter of a proper
chaining" — looking at the snapshot showed Job.DependsOn=[install-flux]
without the prefix.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per products/openova-flow/core/src/types.ts line 112:
"contains — toId (parent) contains fromId (child)"
My emit had this inverted: I set FromID=parent, ToID=child, which
made the FE adapter (flowStreamToOrganic.ts line 134) interpret every
install-* leaf as a group containing the bootstrap-kit/provisioner
group nodes. Net result: only 2 bubbles ever rendered on the canvas
regardless of ?depth= because the hierarchy graph was upside-down.
Caught by opening the canvas in a browser via Playwright after the
operator reported "still showing only 2 bubbles, no drill-down".
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the Pod restarts between PutKubeconfig writing the file AND the
next Result.Save() persisting the field, dep.Result.KubeconfigPath
comes back empty even though the file exists at the canonical
convention <kubeconfigsDir>/<deploymentID>.yaml. RefreshWatch was
returning 409 watch-not-resumable in this state, which left the
mothership canvas frozen because the live watcher couldn't re-attach
to source HR.spec.dependsOn for the install-* edge derivation.
Hit live on prov bfdccbdbd6f700e1 (2026-05-12): chart roll for
PR #1431 restarted catalyst-api Pod, the file
/var/lib/catalyst/kubeconfigs/bfdccbdbd6f700e1.yaml was on disk but
RefreshWatch refused to use it because the record field was empty.
Fix: when KubeconfigPath is empty AND h.kubeconfigsDir is configured
AND a file exists at <dir>/<depID>.yaml, use that path and patch the
record so subsequent /components/state + flow snapshot calls see a
populated field.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs the operator hit on /sovereign/provision/<id>/jobs:
1) Phase-1 install-* Jobs rendered DISCONNECTED on the canvas —
helmwatch.Bridge doesn't persist Job.DependsOn (only the Phase-0
tofu chain + cluster-bootstrap is wired today). Pull HR.spec.dependsOn
from the live Watcher's informer cache via SnapshotComponents()
(ComponentSnapshot.DependsOn already populated by extractDependsOn)
at snapshot-time and emit finish-to-start edges from upstream
install-<dep> to install-<self>. Also add provisioner→bootstrap-kit
group-to-group finish-to-start so the Phase-0/Phase-1 ordering is
visible on the canvas.
2) Clicking a canvas node → "404 page not found" because
FlowPage.handleNodeDoubleClick passed the full
"<deploymentId>:install-X" id verbatim. The backend Store.GetJob
keys by bare jobName ("install-X"), so the colon-prefixed id missed
exact-match and JobDetail returned 404. Mirror useJobLinkBuilder
(JobsTable.tsx line 364): strip the "<deploymentId>:" prefix and
encodeURIComponent the remainder before pushing to the router.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(catalyst-api): add OPENOVA_FLOW_SERVER_URL env to chart template
Without this env the proxy resolveFlowServerURL() falls back to
per-deployment FQDN lookup (https://openova-flow.<sovereignFQDN>) which
only exists on Sovereigns that already installed bootstrap-kit slot 56
with httproute=enabled. Every other catalyst-api deployment (mothership
contabo + Sovereigns that haven't reached cutover yet) returns 502 on
/api/v1/flows/{deploymentId}/snapshot — the live regression founder
saw at console.openova.io: "No nodes to render."
The env points at the in-cluster Service DNS for the LOCAL openova-flow-
server. Both the mothership (catalyst-system or catalyst namespace) and
each Sovereign chroot run the bp-openova-flow-server chart with a local
Service, so this URL is correct for every cluster catalyst-api runs in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(flow-proxy): assemble snapshot from local jobs.Store before upstream proxy
Mothership canvas at /sovereign/provision/<id>/jobs was empty for the
first ~30 minutes of every fresh provision because the snapshot
endpoint went straight to https://openova-flow.<sovereignFQDN> which
can't serve until cilium + cert-manager + the HTTPRoute TLS cert are
all up on the chroot. The Phase-0 + Phase-1 lifecycle Jobs catalyst-api
ALREADY owns (tofu-init/plan/apply/output, flux-bootstrap,
install-bp-<chart>, ...) were invisible the whole time.
This change adds flowSnapshotFromJobs which assembles the canonical
FlowMessage envelope from h.jobsStore().ListJobs(deploymentID) — every
Job becomes a FlowNode with the legacy <deploymentId>:<jobName> id form
the canvas drill-down already expects, every Job.DependsOn becomes a
finish-to-start Relationship, every Job.ParentID becomes a contains
Relationship. HandleFlowSnapshot checks the local store first and
returns immediately when it has data; otherwise falls through to the
existing upstream proxy path.
HandleFlowStream gets the same treatment via flowStreamLocal: emit a
snapshot frame on connect AND every 3 seconds thereafter, plus a 15s
heartbeat. The OpenovaFlow consumer's reducer is idempotent on
snapshot replay so re-emitting an unchanged envelope is harmless;
in exchange the canvas reflects Job state transitions within ~3s
of when helmwatch.Bridge writes them.
No FE change required — the same /api/v1/flows/<id>/snapshot and
/stream endpoints serve the same envelope shape the chroot adapter
emits (products/openova-flow/adapter-flux/internal/types/flow.go),
named SSE events including 'snapshot' and 'heartbeat'.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prov #44 (d9399223c3caa4f9) hit the catalyst-api 60m phase1 watch cap
with bp-catalyst-platform HR still mid-retry (failures=3) and 41/45 HRs
True. F1-F7 are correct and live on main (qa-finalizer-strip Completed,
autoscaler workers joined). The remaining wall is total bootstrap-kit
install time exceeding the outer watch budget on a fresh cpx42×1
Sovereign without a warm Harbor proxy-cache.
Two lock-step changes widen both bounds:
1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
install.timeout 15m → 30m, upgrade.timeout 15m → 30m. The umbrella
chart genuinely needs >15m worst case when the full SME + Catalyst
service stack rolls cold.
2. products/catalyst/bootstrap/api/internal/helmwatch/helmwatch.go:
DefaultWatchTimeout 60m → 120m. Worst-case inner HR retry chain is
now 30m × 3 = 90m; the outer phase1 budget MUST be larger so the
watch never terminates while helm-controller still has remediation
attempts left. CATALYST_PHASE1_WATCH_TIMEOUT env-var override path
was already wired (issue #538 baseline) — chart template now
declares the explicit "120m" value so the runtime knob is
discoverable for capacity-bounded environments. Per INVIOLABLE-
PRINCIPLES.md #4 the knob remains runtime-configurable.
New unit test TestPhase1WatchConfig_ProductionDefaultIs120m pins the
F8 floor against future regression. Existing env-var override + field-
override tests still pass unchanged.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause (autoscaler pod log, prov #43 chroot):
W orchestrator.go:626 Node group workers is not ready for scaleup -
backoff with status: Scale-up timed out for node group workers after
15m2.273255226s
Hetzner API confirms autoscaler-spawned workers come up PUBLIC-ONLY:
workers-77439321e2047e3e public_net.ipv4=178.105.102.237 private_net=[]
workers-a6410e81b24cced public_net.ipv4=178.105.73.210 private_net=[]
The worker cloud-init (identical to Phase-0 user_data) issues
curl -sfL https://get.k3s.io | K3S_URL=https://10.0.1.2:6443 ... sh -
against the CP's PRIVATE 10.0.1.2 IP. Without the 10.0.0.0/16 attachment
that URL is unreachable → k3s agent install silent-fails → node never
registers with apiserver → autoscaler 15m timeout → backoff → bp-catalyst-
platform Pending Pods never schedulable → chroot canvas tests blocked.
Fix: wire HCLOUD_NETWORK / HCLOUD_FIREWALL / HCLOUD_SSH_KEY env vars on
the cluster-autoscaler deployment so the Hetzner provider attaches every
scale-up VM to the SAME private network + firewall + ssh-key the Phase-0
Tofu module created (resource names: catalyst-<sov-fqdn-with-dashes>-net /
-fw / catalyst-<sov-fqdn-with-dashes>). Names flow:
Tofu (hcloud_network.main.name + hcloud_firewall.main.name +
hcloud_ssh_key.main.name)
→ cloudinit-control-plane.tftpl (3 new template vars)
→ /var/lib/catalyst/cloud-credentials-secret.yaml (3 new keys)
→ flux-system/cloud-credentials Secret
→ bp-cluster-autoscaler-hcloud HelmRelease valuesFrom (3 optional entries
with targetPath: cluster-autoscaler.extraEnv.HCLOUD_*)
→ upstream chart's deployment env
Chart bumped 1.2.0 → 1.3.0. New smoke-test gates (Cases 5+6) prevent
regression of the three env-var slots in chart values.yaml.
Reaffirms canonical seam: values flow through Tofu → cloud-init →
flux-system Secret → Flux valuesFrom → chart values → upstream env.
Never via kubectl patch, never via bespoke Go API calls.
Refs: prov #38/#39/#41/#43 omantel.biz scale-up backoff.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause (4-layer trace on prov #41, omantel.biz, 2026-05-12 00:28 UTC):
bp-catalyst-platform HR install.timeout=15m
→ Helm pre-install hook: qa-finalizer-strip Job (weight -99)
→ Pod requests 50m CPU + 64Mi memory (tiny)
→ BUT no tolerations → scheduler restricted to worker
→ worker cpx32 (8vCPU/16GB) at 99% CPU requests
(7980m of 8000m allocated) after bootstrap-kit fan-out
→ FailedScheduling: "0/2 nodes are available: 1
Insufficient cpu, 1 node(s) had untolerated taint
{node-role.kubernetes.io/control-plane: true}"
→ autoscaler triggers scale-up worker 2→3 → "1 in backoff
after failed scale-up" → still Pending → 15m timeout
→ InstallFailed → Flux uninstall+rollback → installFailures: 3
→ Flux gives up entirely
Live evidence quoted from chroot kubeconfig on prov #41:
- bp-catalyst-platform HR `Reconciling=True, reason=Progressing,
message="Running 'install' action with timeout of 15m0s"`
- HR `Released=False, reason=InstallFailed, message="Helm install
failed for release catalyst-system/catalyst-platform with chart
bp-catalyst-platform@1.4.140: failed pre-install: 1 error occurred:
* timed out waiting for the condition"`
- Pod `qa-finalizer-strip-m2hdb` status=Pending; events:
`Warning FailedScheduling 108s default-scheduler 0/2 nodes are
available: 1 Insufficient cpu, 1 node(s) had untolerated taint
{node-role.kubernetes.io/control-plane: true}`
- Worker `Allocated cpu 7980m (99%) of 8000m capacity`
- Control-plane `Allocated cpu 635m (7%) of 8000m capacity` (idle)
Fix: add tolerations for the control-plane NoSchedule taint +
priorityClassName: system-cluster-critical so the qa-finalizer-strip
Job can ALWAYS schedule regardless of worker-node CPU saturation.
The hook is a defense-in-depth cleanup that runs in seconds on a
clean cluster; it legitimately belongs anywhere with free capacity
including the control-plane node (which on prov #41 had 7365m CPU
free vs. the hook's 50m request).
Why prior fixes didn't suffice:
- Fix#114 introduced this hook to break a finalizer-deadlock loop
on prov #9. Correct fix for that wedge; never anticipated worker
saturation as a scheduling failure mode for the hook itself.
- Fix#138 (chart 1.4.138) converted the qa-cnpg-backup-s3-seed +
qa-cnpg-status-seed hooks (weight 0/post-install) to regular
release resources to break a circular DAG dep. Different hook
surface.
- Fix#184 (chart 1.4.140) raised the gitea-token-mint pre-install
hook (weight +10) wait budget for cold-start autoscaler. That
hook runs AFTER qa-finalizer-strip (-99 < +10); if the -99 hook
never starts, the +10 hook never runs.
Recurring class: same family as Fix#114 (hook scheduling failure
wedges entire HR install). 3 consecutive recurrences (prov #38, #39,
#41) on chart pin 1.4.140 trigger the category-level audit threshold
(CLAUDE.md rule "CATEGORY-LEVEL THINKING"). Coupled chart hygiene
swept in same commit:
- Switch image from bitnamilegacy/kubectl:1.29.3 (Docker-Hub
redirect for deprecated Bitnami images, 2025-08 cutover
documented at platform/self-sovereign-cutover/chart/values.yaml:
252) → harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4 —
the canonical alpine-based kubectl image already used by sibling
hook catalyst-gitea-token-mint (Fix#163). MIRROR-EVERYTHING +
ARCHITECT-FIRST rules.
Coordinator follow-up tickets:
- Sibling Jobs in templates/qa-fixtures/cnpg-clusters-qa.yaml
(qa-cnpgpair-status-seed) still reference bitnamilegacy/kubectl
:1.29.3 — same Bitnami-deprecation class. Out of scope for this
Fix (not part of the recurrence cluster); flagged for a sweep.
- Worker cpx32 sizing may be undersized for the bootstrap-kit fan-
out on omantel.biz — separate sizing ticket, not blocking.
Changes:
- products/catalyst/chart/templates/qa-fixtures/pre-install-
finalizer-strip.yaml: add tolerations + priorityClassName;
switch image to alpine/k8s:1.31.4. Inline doc comments explain
the 4-layer trace and the Fix #114/#138/#184 history.
- products/catalyst/chart/Chart.yaml: bump 1.4.140 → 1.4.141 with
changelog entry capturing root cause + budget arithmetic.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
bump HR pin 1.4.140 → 1.4.141.
Verification:
- helm template renders cleanly (exit 0, ~6700 lines).
- kubectl apply --dry-run=client validates the rendered Job
manifest (job.batch/qa-finalizer-strip created (dry run)).
- Rendered Job contains tolerations[control-plane Exists NoSchedule],
priorityClassName: system-cluster-critical, image: alpine/k8s:1.31.4.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix#144 raised zoneBootstrap.activeDeadlineSeconds 300s → 840s after
prov #22 hit a 5m DeadlineExceeded on the bp-powerdns post-install hook.
That fix was insufficient: prov #37 + #38 (chroot omantel.biz, 2026-05-12)
both wedged on the SAME chart slot with `BackoffLimitExceeded`, NOT
`DeadlineExceeded`. The deadline never got a chance to fire.
Trace from prov #38 chroot (`KUBECONFIG=/tmp/prov38.kubeconfig kubectl
get hr bp-powerdns -o yaml`):
status:
Helm install failed for release powerdns/powerdns with chart
bp-powerdns@1.2.2: failed post-install: 1 error occurred:
* job powerdns-zone-bootstrap failed: BackoffLimitExceeded
Pod events for powerdns-zone-bootstrap-tq7qq:
59m Started container zone-bootstrap
56m Back-off restarting failed container zone-bootstrap
55m Job has reached the specified backoff limit
Root cause walked end-to-end (per CLAUDE.md TRACE rule):
TEST: bp-powerdns HR Ready=True
↑
HR: Helm install succeeds (post-install Job exits 0)
↑
Zone-bootstrap Job: curl POST succeeds
↑
powerdns:8081 Service: reachable (has Ready endpoints)
↑
powerdns Deployment: Pods Ready (3 replicas) ← Pending, blocked here
↑
CNPG cluster: pdns-pg-app Secret exists
↑
pdns-pg-1-initdb Pod: scheduled, Running, Completed ← Pending too
↑
Worker node has capacity ← 99% CPU requested
The zone-bootstrap container curl'd `http://powerdns:8081`, hit
"connection refused" (empty Service endpoints), exited 7, container
restarted under `restartPolicy: OnFailure`. After 6 Kubernetes-level
backoffs (≈10min wall-time with exponential delay), the Job declared
`BackoffLimitExceeded` — well before activeDeadlineSeconds=840s
(14min) could even consider firing.
Fix#144 was directionally right (the upstream IS slow on cold k3s) but
operated on the wrong knob. The container's outer-loop retry budget is
bounded by backoffLimit × backoff-delay, not by activeDeadlineSeconds.
Bumping only the deadline left the BackoffLimit ceiling unchanged.
Architectural fix (this commit):
1. Move the wait-for-API loop INSIDE the container (one Pod, one inner
poll loop, restartPolicy=Never). The inner loop polls
GET /api/v1/servers every 10s until HTTP 200, bounded by new
`apiReadyTimeoutSeconds` (default 600s = 10min). Now ONE container
run owns the full wait budget instead of N short-lived containers
racing the backoff timer.
2. restartPolicy: OnFailure → Never. The container script handles its
own retry; Kubernetes-level backoff is reserved for genuinely
transient pod failures (image-pull, OS eviction) where the Job-level
backoffLimit=6 still triggers a fresh Pod.
3. Surface POWERDNS_API_READY_TIMEOUT_S env var so operators on slower
clusters can raise the inner deadline without forking the chart
(per docs/INVIOLABLE-PRINCIPLES.md #4).
4. New value `zoneBootstrap.apiReadyTimeoutSeconds` (default 600s).
Sits below activeDeadlineSeconds (840s) so the zone-creation phase
keeps ≥240s of headroom AFTER the API comes Ready.
Curl status handling in the wait loop:
200 → API up, proceed to bootstrap
401|403 → auth failure, FATAL (no retry — operator misconfig)
000|5xx|... → transient, sleep & retry until inner deadline
Files changed:
- platform/powerdns/chart/Chart.yaml 1.2.2 → 1.2.3 + history
- platform/powerdns/chart/values.yaml + apiReadyTimeoutSeconds knob
- platform/powerdns/chart/templates/
zone-bootstrap-job.yaml inner wait-for-API loop;
restartPolicy: Never
- clusters/_template/bootstrap-kit/
11-powerdns.yaml pin to 1.2.3 + HR comment
Why this is sufficient where Fix#144 was not:
Fix#144 worked the chart-level deadline. This commit works the
inner-loop ownership — the wait budget is now owned by the script
inside the container, not by the Job spec arithmetic
(backoffLimit × backoff-delay). The Job's outer activeDeadlineSeconds
still caps the worst-case runtime (no runaway poll), but the script
now actually GETS to use it.
Verification:
- helm template renders cleanly (deps build OK, empty-zones short-
circuit preserved, non-empty zones render Job + RBAC + Audit CM)
- kubectl create --dry-run=client --validate=false: 5/5 resources
created (sa, role, rb, cm, job)
- chart 1.2.3 pinned in clusters/_template/bootstrap-kit/11-powerdns.yaml
Companion infrastructure note (NOT addressed by this commit, flagged
for Coordinator):
The DEEPER bottom of the trace stack is worker capacity. Prov #38's
single cpx32 worker (8 vCPU / 16 GB) is at 99% CPU requested. The
cluster-autoscaler attempted 2→3 scale-up but is in backoff because
two unscheduled pods (gitea/gitea-* PV affinity conflict from a
previous wedged install; trivy-system/node-collector NodeAffinity)
poison the autoscaler's "can the template node fit" check. Even with
this chart fix in place, the powerdns Deployment cannot become Ready
until either:
(a) the worker autoscales successfully (gitea PV migrated / trivy
taints relaxed), or
(b) worker_count is bumped from 2 to 3 in the provisioning body, or
(c) qa_worker_size is bumped to cpx42.
This chart fix ensures bp-powerdns survives a slow CNPG cold-start.
It does NOT fix a fundamentally undersized cluster. Coordinator next
step: reprov with worker_count=3 OR qa_worker_size=cpx42 + this chart
landed. Either should converge.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chroot bp-openova-flow-emitter posts to
http://openova-flow-server.catalyst-system.svc.cluster.local:8080
but the bp-openova-flow-server chart's Service is exposed on :80
(targetPort:8080 → port:80, kubernetes Service indirection).
Result: every event POST from the chroot emitter dial-times-out, the
chroot's openova-flow data plane never populates, and canvas pages
viewing the chroot show empty.
Same fix as PR #124 on mothership emitter-helmrelease.yaml (private
repo). Slot 57 in the bootstrap-kit template was missed in that round.
Live regression on prov #37 (2026-05-11): chroot has 38 bp-* HRs True
but openova-flow snapshot is empty because emitter can't reach server.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TC-035 (iter-2, 2026-05-11): OpenovaFlow rows merged into JobsPage
(PR #1413) lost their region-prefixed identity in the URL. The link
builder sliced the "<prefix>:" segment off every id with a colon —
intended to strip the legacy "<deploymentId>:install-keycloak" form,
but it also stripped "contabo:bp-openova-flow-server" → bare
"bp-openova-flow-server" in the href. The matrix asserts the
verbatim form "/jobs/contabo:bp-openova-flow-server" must appear in
the rendered DOM.
Fix: stop slicing. `encodeURIComponent` still escapes unsafe path
chars (`/` for live K8s job ids like "job/syft-grype/..."), then we
restore `:` because RFC 3986 permits it as a path-segment `pchar`.
FlowPage canvas navigation (PR #1411) and JobDetail flow-fallback
(PR #1412) already pass on the colon-present form, so this round-
trips end-to-end. Legacy "bp-cilium" / "cluster-bootstrap" hrefs are
unchanged (no `:` to encode). The previously-stripped legacy form
"<deploymentId>:install-keycloak" now lands as the full id in the
URL, and JobDetail's `jobsById` lookup is already keyed by BOTH the
canonical id AND the bare jobName (JobDetail.tsx:124-131), so the
resolution path is preserved.
Test coverage: new Case 4 in JobsPage.flow-merge.test.tsx asserts
the openova-flow row's anchor `href` contains
`/jobs/contabo:bp-openova-flow-server` and is NOT the bare-jobName
form. All 4 flow-merge cases PASS. The 3 pre-existing failures in
JobsPage.test.tsx (back-to-apps href, canonical-columns header,
Show-as-Flow button) are the documented iter-2 baseline — untouched
by this change.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>