openova

Author	SHA1	Message	Date
github-actions[bot]	cd5ace8dcb	deploy: update catalyst images to `32e0b40`	2026-05-13 15:42:13 +00:00
e3mrah	32e0b408bf	fix(k3s): add public IP --tls-san + openova.io/region node label (#1459 ) Two related fixes for multi-region + qa-fixtures DoD on prov #64: 1. k3s TLS cert needs the public IPv4 in SAN. Mothership helmwatch.Bridge connects to secondary CPs via PUBLIC IP (cloud-init rewrites kubeconfig 127.0.0.1 → CP_PUBLIC_IPV4). k3s auto-generates the server cert with SANs from --tls-san flags. We only had [sovereign_fqdn, cp_private_ip] → cert valid for 10.0.10.2 + cluster-ip + 127.0.0.1 only. Bridge connection from contabo rejected with: "x509: certificate is valid for 10.0.10.2, 10.43.0.1, 127.0.0.1, ::1, not 204.168.212.113" → silent watcher failure → 0 secondary HRs observed → canvas missing region sub-groups. Fix: pre-fetch the CP's public IPv4 from Hetzner metadata before k3s install, add it as --tls-san=$CP_PUBLIC_IPV4. 2. openova.io/region=hz-fsn-rtz-prod node label. qa-fixtures Pods (CNPGPair primary/replica, status seeder Jobs, qa-wp Application) carry hard nodeAffinity for `openova.io/region in [hz-fsn-rtz-prod]` (per qaFixtures.primaryRegion default in products/catalyst/chart/templates/qa-fixtures/*.yaml). Without the label every fixture pod FailedScheduling → bp-catalyst- platform post-install hook waits forever → bootstrap-kit chain hangs at 44/45 with bp-catalyst-platform Running. Fix: --node-label openova.io/region=hz-fsn-rtz-prod on primary CP (qa-fixtures pin to primary by design). Both shipped in same commit since both are inside the same k3s server install line. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 19:38:25 +04:00
github-actions[bot]	55edb953d5	deploy: update catalyst images to `44913d8`	2026-05-13 14:40:02 +00:00
e3mrah	44913d8a6a	fix(k3s): --kubelet-arg=max-pods=220 (CP + worker) for qa-fixtures load (#1458 ) prov #63 (cpx52 × 3, all PRs live): bp-catalyst-platform install hook timed out because the catalyst-api Helm-released pod stayed Pending with "Too many pods. 0/1 nodes are available". k3s kubelet default max-pods is 110. Full bootstrap-kit (~45 HR-managed deployments, each with 1-3 pods) + qa-fixtures stack (qa-omantel ns Application + Continuum + CNPGPair + PDM CRs + seeder Jobs) + Cilium/ flux/cnpg sidecars saturate the slot cleanly. With workers NotReady on prov #63 the CP carried everything alone and dropped scheduling at 110. Bump to 220 on both CP and worker so the saturation point doesn't gate the bootstrap chain. Safe ceiling: each Hetzner cpx52 node has 16 vCPU + 32GB RAM, plenty of headroom for 220 pods of typical bootstrap-kit weight. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 18:37:42 +04:00
github-actions[bot]	b6e6470ccf	deploy: update catalyst images to `5f4f9f2`	2026-05-13 14:01:04 +00:00
e3mrah	5f4f9f2cb5	fix(k3s): pin --node-ip + --advertise-address to cp_private_ip (#1457 ) prov #62 (cpx52, kernel 6.8.0-111): primary CP cilium init CrashLoop with "dial tcp 10.0.1.2:6443: i/o timeout". k3s server auto-detects its node IP from the primary interface, which on Hetzner cpx52 binds to the public IPv4 (49.x.x.x) instead of the private network IP (10.0.1.2). kube-apiserver advertises 49.x.x.x and binds there; nothing answers on 10.0.1.2:6443. Cilium agent's k8s-client wants the private IP from cilium-config k8sServiceHost — times out, CrashLoop. Worked by luck on cpx42 (earlier kernel + Hetzner network attach timing). cpx52 reproduces 100%. Fix: pass --node-ip=${cp_private_ip} + --advertise-address=${cp_private_ip} in INSTALL_K3S_EXEC. k3s then binds kube-apiserver on the private IP AND advertises it as the node's INTERNAL-IP. Pods reaching ${cp_private_ip}:6443 (cilium-config substitute) find the API server every time. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:34:30 +04:00
e3mrah	6fac1481d3	fix(catalyst-api): bump memory limit 1Gi → 4Gi for multi-region snapshot load (#1456 ) prov #61 (2e197a934a0e0461, 2026-05-13): catalyst-api OOMKilled 6× during phase-1 watch on a 3-region Sovereign. The in-memory state has grown substantially since the 1Gi limit was set: - 1 primary helmwatch.Watcher (45 HRs + informer cache) - N secondary helmwatch.Watchers (45 HRs × 2 secondary regions, each with its own informer cache) - jobs.Store backed by on-disk + in-memory tree - per-/snapshot poll: composes per-region region groups across all Job rows + cross-references hrDeps from the live primary watcher Combined steady-state exceeds 1Gi on cpx32-equivalent loads. Bumped limits to 4Gi (request 512Mi up from 128Mi). The mothership node has 8GB+ resident, no other tight constraint. Future fix: persist region in Job rows so secondary watchers don't need to be retained post phase-1 (orthogonal cleanup). Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:20:00 +04:00
github-actions[bot]	2c6374b200	deploy: update catalyst images to `8518bb1`	2026-05-13 12:48:59 +00:00
e3mrah	8518bb1f50	fix(flow_snapshot): drop duplicate live-watcher multi-region block (#1455 ) * fix(JobsTable): strip <deploymentId>: prefix from row link (404 fix) Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a running secondary-region install-* row on /sovereign/provision/<id>/jobs landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover and returned "404 page not found". Root cause: useJobLinkBuilder was passing the FULL canvas JobID form through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping the "<deploymentId>:" prefix. The canvas emits ids like "<deploymentId>:install-X" (single-region) or "<deploymentId>:<region>:install-X" (multi-region, see flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName — exact-match URL lookup of the prefix-bearing form misses every time. FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs row click and a canvas drill-down resolve to the SAME backend endpoint. The existing JobsTable row-link test uses a job.id with no `:` prefix, so the strip is a no-op for that fixture and the `/jobs/job-install-cilium` assertion still holds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow_snapshot_local): derive region from persisted JobName, synth region groups Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): the multi-region canvas at /sovereign/provision/<id>/jobs/tofu-output renders 135 install-* leaves as direct children of bootstrap-kit (no region sub-groups visible), and the provision-hetzner→bootstrap-kit edge fans M×N across all 135. Root cause: spawnSecondaryRegionWatchers (phase1_watch.go:429) emits events with `ev.Component = region + "/" + componentName`. The jobs bridge persists them with `JobName=install-<region>/<chart>` and `AppID=<region>/<chart>`, BUT ParentID=bootstrap-kit (the bridge has no region awareness). After phase 1 terminates the deferred stopSecondaries() clears `dep.secondaryWatchers`, so the multi-region snapshot block (line 408-460, gated on `len(secondaryWatchers) > 0`) becomes a no-op. flowSnapshotFromJobs then emits all 135 install Jobs flat under bootstrap-kit, no Region field set, no region group bubbles, and flowLayoutOrganic.ts's temporal-endpoint cascade fans the provisioner→bootstrap-kit edge onto all 135 because there's no intermediate region group to absorb it. Fix: in the per-Job loop, detect `/` in `j.AppID` (the canonical multi-region prefix marker), derive the region key, set FlowNode.Region, and re-parent to a synthesised "<deploymentId>:<region>:bootstrap-kit" group. After the loop, synthesise one bootstrap-kit sub-group node per discovered region with a `contains` edge to the parent bootstrap-kit. The resulting shape: bootstrap-kit ├── 45 primary install-* (legacy parent, no region) ├── <region-A>:bootstrap-kit ── 45 install-* (region tagged) └── <region-B>:bootstrap-kit ── 45 install-* (region tagged) This persists ACROSS phase-1 termination because the source of truth is jobs.Store (durable), not dep.secondaryWatchers (transient). The multi-region block (line 408+) still runs WHEN secondary watchers are alive (during phase 1) — it emits ADDITIONAL FlowNodes with "<deploymentId>:<region>:install-X" IDs distinct from the persisted "<deploymentId>:install-<region>/<chart>" IDs, so the two paths don't collide. Post-phase-1 the watchers clear and only the persisted-Job path remains, but now WITH region structure preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow_snapshot): remove duplicate live-watcher multi-region block PR #1454 added region-group synthesis from persisted Job rows. The old secondaryWatchers-based block at line 442+ emitted nodes with the SAME region-group IDs AND child nodes, so during phase 1 (when both paths are live) the snapshot rendered with 90 children per region group instead of 45 — visible on prov #61 (2e197a934a0e0461): bootstrap-kit: 49 children hel1-2:bootstrap-kit: 90 children (should be 45) nbg1-1:bootstrap-kit: 90 children (should be 45) Plus the region groups appeared twice in the node list. Root cause: the per-Job loop (PR #1454) and the legacy block both write to the same region-group IDs without deduping. The per-Job path covers the persisted-Job state (durable across phase-1 termination), so the live-watcher path is redundant. Fix: delete the legacy block. The earlier secondaryWatchers-snapshot-into-map work (lines 182-205) is kept because that path also reads dep.liveWatcher (primary) for the hrDeps lookup the per-Job loop uses for primary-region dep edges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:47:00 +04:00
github-actions[bot]	ed4f66438f	deploy: update catalyst images to `d9d7fa2`	2026-05-13 12:26:59 +00:00
e3mrah	d9d7fa2baa	fix(flow_snapshot): derive region from persisted JobName, synth region groups (#1454 ) * fix(JobsTable): strip <deploymentId>: prefix from row link (404 fix) Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a running secondary-region install-* row on /sovereign/provision/<id>/jobs landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover and returned "404 page not found". Root cause: useJobLinkBuilder was passing the FULL canvas JobID form through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping the "<deploymentId>:" prefix. The canvas emits ids like "<deploymentId>:install-X" (single-region) or "<deploymentId>:<region>:install-X" (multi-region, see flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName — exact-match URL lookup of the prefix-bearing form misses every time. FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs row click and a canvas drill-down resolve to the SAME backend endpoint. The existing JobsTable row-link test uses a job.id with no `:` prefix, so the strip is a no-op for that fixture and the `/jobs/job-install-cilium` assertion still holds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow_snapshot_local): derive region from persisted JobName, synth region groups Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): the multi-region canvas at /sovereign/provision/<id>/jobs/tofu-output renders 135 install-* leaves as direct children of bootstrap-kit (no region sub-groups visible), and the provision-hetzner→bootstrap-kit edge fans M×N across all 135. Root cause: spawnSecondaryRegionWatchers (phase1_watch.go:429) emits events with `ev.Component = region + "/" + componentName`. The jobs bridge persists them with `JobName=install-<region>/<chart>` and `AppID=<region>/<chart>`, BUT ParentID=bootstrap-kit (the bridge has no region awareness). After phase 1 terminates the deferred stopSecondaries() clears `dep.secondaryWatchers`, so the multi-region snapshot block (line 408-460, gated on `len(secondaryWatchers) > 0`) becomes a no-op. flowSnapshotFromJobs then emits all 135 install Jobs flat under bootstrap-kit, no Region field set, no region group bubbles, and flowLayoutOrganic.ts's temporal-endpoint cascade fans the provisioner→bootstrap-kit edge onto all 135 because there's no intermediate region group to absorb it. Fix: in the per-Job loop, detect `/` in `j.AppID` (the canonical multi-region prefix marker), derive the region key, set FlowNode.Region, and re-parent to a synthesised "<deploymentId>:<region>:bootstrap-kit" group. After the loop, synthesise one bootstrap-kit sub-group node per discovered region with a `contains` edge to the parent bootstrap-kit. The resulting shape: bootstrap-kit ├── 45 primary install-* (legacy parent, no region) ├── <region-A>:bootstrap-kit ── 45 install-* (region tagged) └── <region-B>:bootstrap-kit ── 45 install-* (region tagged) This persists ACROSS phase-1 termination because the source of truth is jobs.Store (durable), not dep.secondaryWatchers (transient). The multi-region block (line 408+) still runs WHEN secondary watchers are alive (during phase 1) — it emits ADDITIONAL FlowNodes with "<deploymentId>:<region>:install-X" IDs distinct from the persisted "<deploymentId>:install-<region>/<chart>" IDs, so the two paths don't collide. Post-phase-1 the watchers clear and only the persisted-Job path remains, but now WITH region structure preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:24:20 +04:00
github-actions[bot]	6f50bc0a4a	deploy: update catalyst images to `3a08c23`	2026-05-13 12:05:56 +00:00
e3mrah	3a08c23ae4	fix(JobsTable): strip <deploymentId>: prefix from row link (404 fix) (#1453 ) Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a running secondary-region install-* row on /sovereign/provision/<id>/jobs landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover and returned "404 page not found". Root cause: useJobLinkBuilder was passing the FULL canvas JobID form through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping the "<deploymentId>:" prefix. The canvas emits ids like "<deploymentId>:install-X" (single-region) or "<deploymentId>:<region>:install-X" (multi-region, see flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName — exact-match URL lookup of the prefix-bearing form misses every time. FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs row click and a canvas drill-down resolve to the SAME backend endpoint. The existing JobsTable row-link test uses a job.id with no `:` prefix, so the strip is a no-op for that fixture and the `/jobs/job-install-cilium` assertion still holds. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:03:47 +04:00
github-actions[bot]	f1d77fc9bb	deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.20	2026-05-12 18:53:16 +00:00
e3mrah	64876c0de3	fix(bp-guacamole): render.sh resource count 15→19 unblocks Blueprint Release (#1451 ) * fix(infra): pass cp_private_ip to primary CP templatefile too PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl started consuming ${cp_private_ip} (PR #1446): Invalid value for "vars" parameter: vars map does not contain key "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43. The primary CP templatefile call (main.tf:342) and the secondary WORKER templatefile call (main.tf:944) both pass `cp_private_ip`, but the secondary CP templatefile call (main.tf:860) was missed — every multi-region provision since PR #1446 lands here at plan-time. Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the secondary CP templatefile so each secondary region's cilium-operator reaches its OWN local CP (matching CA), not the primary across regions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort) Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops: fatal: failed to start: daemon creation failed: unable to initialize BPF masquerade support: BPF masquerade requires NodePort (--enable-node-port="true") Chart default kubeProxyReplacement=false leaves enable-node-port=false in the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true (also default-on) the cilium-agent rejects the BPF masquerade datapath on startup. CP cilium-agent survives because it was started by cloudinit with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the ConfigMap. Every WORKER node that joins after Flux's upgrade sees the new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not- ready taint persists → every post-install Job pod (keycloak-config-cli, powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain stalls at ~60% Ready. Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl write_files entry /var/lib/catalyst/cilium-values.yaml) already uses kubeProxyReplacement: true. This change aligns the Flux HR overlay with the working pre-Flux bootstrap so the agent config never regresses when helm-controller does its first upgrade. Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race prov #55+#56 caught bp-harbor / bp-powerdns failing Helm install with: Internal error occurred: failed calling webhook "mcluster.cnpg.io": no endpoints available for service "cnpg-webhook-service" Chain: 1. bp-cnpg install with disableWait: true → HR goes Ready immediately when manifests apply (operator pod still spinning up). 2. Flux releases dependents (bp-harbor, bp-powerdns) — they pass the dependsOn check on bp-cnpg. 3. Downstream chart renders postgresql.cnpg.io/v1.Cluster CRs. 4. cnpg mutating webhook (Service cnpg-webhook-service) has no endpoints yet → admission webhook call fails → Helm install fails → RetriesExceeded → entire DB-backed chain wedges. Carve out the disableWait: true blanket for bp-cnpg specifically. INVIOLABLE-PRINCIPLES #3's "event-driven install" rationale (avoid the agent-waits-for-its-own-CRDs deadlock — see bp-cilium) does NOT apply to CNPG: CNPG's CRDs are loaded by helm-controller BEFORE pods schedule, so Helm-wait blocks only on pod readiness, not on a self-referencing CRD. With this change bp-cnpg's HR stays Reconciling until cnpg-controller- manager + cnpg-webhook-service are both rolled + Available, so Flux dependsOn correctly gates downstream consumers behind a webhook that's actually serving. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-guacamole): render.sh expects 19 resources (Fix #125 bootstrap Job) Fix #125's guacamole-oidc bootstrap Job added 4 K8s resources to the chart's full-ON render (1 Job + 1 ServiceAccount + 1 Role + 1 RoleBinding) but render.sh's expect_total was never bumped from 15 → 19. Every Blueprint Release run since `5b711427` fails the test and bails before publishing the chart to GHCR. Consequence: Build bp-guacamole's mirror job successfully mirrors upstream images + bumps Chart.yaml to 0.1.13/0.1.14/.../0.1.18/0.1.19, but the chained Blueprint Release on each bump commit fails render.sh and never publishes. GHCR is stuck at 0.1.12. Bootstrap-kit overlay HRs pinned to anything beyond 0.1.12 wedge with: failed to download chart for remote reference: failed to get 'oci://ghcr.io/openova-io/bp-guacamole:0.1.17': not found Caught on prov #58 (d4f60afe4f13aee9, 2026-05-12) when bp-guacamole HR went False with that exact error across all 3 regions. Also bump bootstrap-kit overlay version pin 0.1.17 → 0.1.19 so the catch-up Blueprint Release (triggered by this commit) lands a tag the overlay actually references. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 22:52:41 +04:00
github-actions[bot]	500fd47aee	deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.19	2026-05-12 18:50:01 +00:00
e3mrah	855e106d87	fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race (#1450 ) * fix(infra): pass cp_private_ip to primary CP templatefile too PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl started consuming ${cp_private_ip} (PR #1446): Invalid value for "vars" parameter: vars map does not contain key "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43. The primary CP templatefile call (main.tf:342) and the secondary WORKER templatefile call (main.tf:944) both pass `cp_private_ip`, but the secondary CP templatefile call (main.tf:860) was missed — every multi-region provision since PR #1446 lands here at plan-time. Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the secondary CP templatefile so each secondary region's cilium-operator reaches its OWN local CP (matching CA), not the primary across regions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort) Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops: fatal: failed to start: daemon creation failed: unable to initialize BPF masquerade support: BPF masquerade requires NodePort (--enable-node-port="true") Chart default kubeProxyReplacement=false leaves enable-node-port=false in the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true (also default-on) the cilium-agent rejects the BPF masquerade datapath on startup. CP cilium-agent survives because it was started by cloudinit with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the ConfigMap. Every WORKER node that joins after Flux's upgrade sees the new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not- ready taint persists → every post-install Job pod (keycloak-config-cli, powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain stalls at ~60% Ready. Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl write_files entry /var/lib/catalyst/cilium-values.yaml) already uses kubeProxyReplacement: true. This change aligns the Flux HR overlay with the working pre-Flux bootstrap so the agent config never regresses when helm-controller does its first upgrade. Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race prov #55+#56 caught bp-harbor / bp-powerdns failing Helm install with: Internal error occurred: failed calling webhook "mcluster.cnpg.io": no endpoints available for service "cnpg-webhook-service" Chain: 1. bp-cnpg install with disableWait: true → HR goes Ready immediately when manifests apply (operator pod still spinning up). 2. Flux releases dependents (bp-harbor, bp-powerdns) — they pass the dependsOn check on bp-cnpg. 3. Downstream chart renders postgresql.cnpg.io/v1.Cluster CRs. 4. cnpg mutating webhook (Service cnpg-webhook-service) has no endpoints yet → admission webhook call fails → Helm install fails → RetriesExceeded → entire DB-backed chain wedges. Carve out the disableWait: true blanket for bp-cnpg specifically. INVIOLABLE-PRINCIPLES #3's "event-driven install" rationale (avoid the agent-waits-for-its-own-CRDs deadlock — see bp-cilium) does NOT apply to CNPG: CNPG's CRDs are loaded by helm-controller BEFORE pods schedule, so Helm-wait blocks only on pod readiness, not on a self-referencing CRD. With this change bp-cnpg's HR stays Reconciling until cnpg-controller- manager + cnpg-webhook-service are both rolled + Available, so Flux dependsOn correctly gates downstream consumers behind a webhook that's actually serving. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 22:23:04 +04:00
e3mrah	fb563e9fd6	fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort) (#1449 ) * fix(infra): pass cp_private_ip to primary CP templatefile too PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl started consuming ${cp_private_ip} (PR #1446): Invalid value for "vars" parameter: vars map does not contain key "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43. The primary CP templatefile call (main.tf:342) and the secondary WORKER templatefile call (main.tf:944) both pass `cp_private_ip`, but the secondary CP templatefile call (main.tf:860) was missed — every multi-region provision since PR #1446 lands here at plan-time. Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the secondary CP templatefile so each secondary region's cilium-operator reaches its OWN local CP (matching CA), not the primary across regions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort) Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops: fatal: failed to start: daemon creation failed: unable to initialize BPF masquerade support: BPF masquerade requires NodePort (--enable-node-port="true") Chart default kubeProxyReplacement=false leaves enable-node-port=false in the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true (also default-on) the cilium-agent rejects the BPF masquerade datapath on startup. CP cilium-agent survives because it was started by cloudinit with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the ConfigMap. Every WORKER node that joins after Flux's upgrade sees the new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not- ready taint persists → every post-install Job pod (keycloak-config-cli, powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain stalls at ~60% Ready. Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl write_files entry /var/lib/catalyst/cilium-values.yaml) already uses kubeProxyReplacement: true. This change aligns the Flux HR overlay with the working pre-Flux bootstrap so the agent config never regresses when helm-controller does its first upgrade. Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 21:05:01 +04:00
github-actions[bot]	16f41bef56	deploy: update catalyst images to `68372d7`	2026-05-12 16:13:41 +00:00
e3mrah	68372d700b	fix(hetzner): pass cp_private_ip into secondary CP templatefile (multi-region prov #52-54 unblock) (#1448 ) * fix(infra): pass cp_private_ip to primary CP templatefile too PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl started consuming ${cp_private_ip} (PR #1446): Invalid value for "vars" parameter: vars map does not contain key "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43. The primary CP templatefile call (main.tf:342) and the secondary WORKER templatefile call (main.tf:944) both pass `cp_private_ip`, but the secondary CP templatefile call (main.tf:860) was missed — every multi-region provision since PR #1446 lands here at plan-time. Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the secondary CP templatefile so each secondary region's cilium-operator reaches its OWN local CP (matching CA), not the primary across regions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:11:23 +04:00
github-actions[bot]	1c6e82b83b	deploy: update catalyst images to `be47815`	2026-05-12 16:03:56 +00:00
e3mrah	be47815ddf	fix(infra): pass cp_private_ip to primary CP templatefile too (#1447 ) PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:01:43 +04:00
github-actions[bot]	034da82c00	deploy: update catalyst images to `cdcc50a`	2026-05-12 15:58:30 +00:00
e3mrah	cdcc50a213	fix(multi-region): cilium k8sServiceHost uses LOCAL CP private IP per region (#1446 ) Each region's k3s is an INDEPENDENT cluster per NAMING-CONVENTION §1.3 "no stretched fault domain". Cilium on each region MUST talk to its OWN local CP's k3s API server, not the primary's 10.0.1.2. Three sites hardcoded the primary's IP: 1) Pre-Flux cilium helm install (cloudinit-control-plane.tftpl:665): `k8sServiceHost: 10.0.1.2` → `${cp_private_ip}` (rendered per-region by main.tf — primary 10.0.1.2, nbg1-1 10.0.11.2, hel1-2 10.0.12.2). 2) k3s install --tls-san=10.0.1.2 (line 1206): same `${cp_private_ip}` so each region's k3s API cert validates against the LOCAL CP's IP. 3) bp-cilium HelmRelease (clusters/_template/bootstrap-kit/01-cilium.yaml): add `k8sServiceHost: ${CILIUM_K8S_SERVICE_HOST:=10.0.1.2}` to the HR values so Flux postBuild.substitute can override per region. The cloud-init Kustomization renders the substitute var to `${cp_private_ip}`. Single-region (primary-only) provisions fall back to the default `10.0.1.2` and stay byte-identical to today. Live evidence of the bug — prov #52 (3-region) on 2026-05-12: cilium-operator on nbg1 secondary: "Establishing connection to apiserver" host="https://10.0.1.2:6443" "failed to start: ... tls: failed to verify certificate: x509: certificate signed by unknown authority" Each region's k3s has its OWN self-signed CA (cluster-init per CP). The primary's API cert isn't signed by the secondary's CA → cilium crash- loops → no CNI → flux controllers Pending → no HRs → canvas shows only primary's HRs. This fix points each region's cilium at the LOCAL CP, whose API server presents the matching CA from this cluster. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 19:56:18 +04:00
github-actions[bot]	fc71800a52	deploy: update catalyst images to `19a847e`	2026-05-12 12:30:55 +00:00
e3mrah	19a847e514	fix(infra): restore \n escape in secondary CP templatefile regex (#1445 ) The conflict-resolution Python script in PR #1444 wrote a literal newline where the regex string needed the two-char "\n" escape. tofu init rejected with "Invalid multi-line string / Unterminated template string" on main.tf:925. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:27:10 +04:00
github-actions[bot]	bc0f56eb4e	deploy: update catalyst images to `4923938`	2026-05-12 12:15:30 +00:00
e3mrah	4923938c2b	feat(multi-region-canvas): per-region kubeconfig PUT-back + per-region helmwatch (#1444 ) Operator mandate (2026-05-12): the mothership canvas must surface install-* HRs from EVERY region of a multi-region provision, not just the primary CP's. Today catalyst-api stores ONE kubeconfig per deployment (the primary CP's) and spawns ONE helmwatch.Bridge against it. Result: secondary regions are invisible on the canvas even though their k3s clusters are fully reconciling. End-to-end change across infra + handler: 1) cloud-init (cloudinit-control-plane.tftpl): the kubeconfig PUT URL appends `?region=<kubeconfig_postback_region>` when the var is set. main.tf templatefile call passes empty for primary CP, `each.key` (e.g. "nbg1-1", "hel1-2") for each secondary region. 2) PutKubeconfig handler: reads ?region= query param. Empty → primary path (unchanged: stores at <dir>/<id>.yaml, sets Result.KubeconfigPath, fires Phase-1 watch + SMTP seed). Non-empty → secondary path: stores at <dir>/<id>-<region>.yaml, populates Deployment.secondaryKubeconfigPaths[region]. Single-use guard is per-region (the same bearer secures every CP's PUT — secondaries reuse it for their own slot). NO Phase-1 watch re-launch from a secondary PUT. 3) phase1_watch.spawnSecondaryRegionWatchers: runs alongside the primary's watcher. Scans <kubeconfigsDir>/<id>-.yaml every 15s, spawns one helmwatch.NewWatcher per kubeconfig discovered, stores the Watcher on Deployment.secondaryWatchers[region]. Per-region watchers emit ordinary helmwatch events with region-prefixed Component names so the wizard's per-component view doesn't collide primary vs secondary bp-cilium events. They do NOT contribute to markPhase1Done — outcome remains the primary's classification. 4) flow_snapshot_local.flowSnapshotFromJobs: composes per-region group bubbles + install- nodes from each secondary watcher's SnapshotComponents. Node id: <depID>:<region>:install-<chart>. FlowNode.region set so the canvas can colour-group. Intra-region finish-to-start deps emitted from cs.DependsOn — same-region only, never cross-region (per NAMING-CONVENTION §1.3 independent fault domains, no stretched cluster). 5) wipe.go: removes both <id>.yaml AND every <id>-.yaml secondary kubeconfig file on Sovereign wipe. Storage model is uniform across SME and corporate Sovereigns. No hardcoding of provider, region count, or building block. Caught after operator pointed out that 3-region prov #50 was showing only 52 install- nodes (all from fsn1) on the canvas — the architectural gap. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:12:38 +04:00
github-actions[bot]	effd75e4a7	deploy: update catalyst images to `c5d891a`	2026-05-12 11:26:54 +00:00
e3mrah	c5d891ad0b	fix(infra): forward hcloud_*_name to secondary regions' CP cloud-init (#1443 ) The F7 fix (Issue #1778) added hcloud_network_name / hcloud_firewall_name / hcloud_ssh_key_name to cloudinit-control-plane.tftpl so the cluster autoscaler could attach scale-up VMs to the private network. The primary CP's templatefile call at main.tf:483-485 was updated, but the matching call for secondary regions at main.tf:899 was missed. Result: any provision with regions[] of length > 1 fails at tofu plan with "vars map does not contain key hcloud_network_name" referenced in cloudinit-control-plane.tftpl:478. Hit live on prov #47 (ce25c31fff15c30c, 4-region: fsn1/nbg1/hel1/ash) at T+0:47. Forward the same three resource refs to every secondary region's templatefile call. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:23:53 +04:00
github-actions[bot]	5fb99be8e8	deploy: update catalyst images to `bd5d439`	2026-05-12 10:00:04 +00:00
e3mrah	bd5d4393ec	fix(canvas): cross-group edges cascade to leaf temporal endpoints (#1442 ) Operator-reported design fix completing #1437/#1440 — the cross-phase ordering between provisioner and bootstrap-kit groups was either an M×N phantom-edge fan-out (pre-#1437) OR completely disconnected at leaf level (post-#1440 with the both-elided skip). Neither was right. Real design: when a group→group dependency edge is lifted onto the leaf graph because one or both endpoints elided, cascade ONLY to the temporal endpoint pair: upstream_terminals → downstream_initials Where: - upstream_terminals = visible descendants of the upstream group that nothing else in the group depends on (sinks of intra-group DAG). For the tofu chain this collapses to just cluster-bootstrap. - downstream_initials = visible descendants of the downstream group that depend on nothing else in the group (sources of intra-group DAG). For bootstrap-kit this is install-cilium / install-flux / install-gateway-api / etc — the install-* roots. Net result for provisioner→bootstrap-kit at depth=all: a small fan of edges from cluster-bootstrap to the bp-* roots — the real temporal gate, no spurious phantom edges, no missing cross-phase chain. Two call sites updated: - Inbound: visibleJob X with X.dependsOn = [elidedGroup G] now cascades to groupTerminals(G) instead of fanOutVisibleChildren(G). - Outbound: elidedGroup G with G.dependsOn = [D] cascades to groupInitials(G) on the receive side; D-side cascades to groupTerminals(D) when D is also elided, or uses D directly when D is a visible job. 11/11 flowLayoutOrganic.test.ts pass. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 13:47:42 +04:00
github-actions[bot]	064fc3073f	deploy: update catalyst images to `0fe0cac`	2026-05-12 09:32:31 +00:00
e3mrah	0fe0cacc15	fix(canvas): right-click menu actions actually work + clearer labels (#1441 ) Operator reported "non of the right click functionalites working other than the open in new tab". Root cause: the previous handler only mutated urlFoldedSet, which had no visible effect when the clicked group was folded by the depth default (same class of bug toggleFold had before #1439). The menu items also had confusing labels ("Fold to level N" stepped GLOBAL depth, not subtree-relative). Rewrite to use the same compose-state pattern toggleFold uses: - "Show only this group" — switch to depth=all + fold every OTHER group. Only the clicked group's subtree expands; sibling groups stay collapsed. - "Hide this group" — switch to depth=default + add clicked group to urlFoldedSet. Group renders as a folded bubble; its subtree hidden. - "Expand subtree" — switch to depth=all + remove this group and all its descendant groups from urlFoldedSet. Fully unfolded subtree. - "Open in new tab" — unchanged (was working since #1435). Dropped the misleading "Fold to level N" item (was just stepDepth(-1)). The depth chip ◀▶ at the top-right is the canonical global depth control. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 13:30:31 +04:00
github-actions[bot]	c80d43c6d8	deploy: update catalyst images to `2c1f767`	2026-05-12 09:27:06 +00:00
e3mrah	2c1f767b52	fix(canvas): back-to-jobs chroot-scoped + group→group edge w/o M×N lift (#1440 ) Three operator-reported issues from the same dblclick session: 1) "Back to jobs" link in JobDetail.tsx (2 sites) and JobsTimeline.tsx used absolute /jobs which on contabo resolves to /sovereign/jobs — the mother's flat /jobs view, NOT the chroot-scoped /sovereign/provision/<id>/jobs. Operator reported "chroot principle violation". Fix: chroot-aware /provision/<deploymentId>/jobs when deploymentId is present. 2) Bootstrap and Provision Hetzner group bubbles at ?depth=1 had no edge between them — temporal ordering invisible. Earlier #1437 dropped the group→group edge entirely because the FE layout's lift-on-elide cascaded it into M×N phantom edges at ?depth=all. Re-emit the edge AND fix the lift logic in flowLayoutOrganic.ts (lines 414-442) to SKIP the lift when BOTH endpoints of the elided-group dep are elided. At ?depth=1 the edge renders between the two folded groups as intended; at ?depth=all both groups elide and the lift is suppressed so the spurious cascade doesn't reappear. The actual install-* deps are already visible via each leaf's own dependsOn — skipping the lift costs no information. 3) (Documented separately) Right-click menu only attaches to GROUP nodes per design (FlowCanvasOrganic line 1277). When all groups are elided (?depth=all auto-folds groups out), the menu is unreachable. The dblclick-on-group fold fix (#1439) makes group bubbles reachable at ?depth=1 where right-click works. Caught via Playwright after operator reported all three. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 13:24:50 +04:00
github-actions[bot]	fe337d571c	deploy: update catalyst images to `bb1bff2`	2026-05-12 08:42:18 +00:00
e3mrah	bb1bff245a	fix(canvas): toggleFold handles depth-default-folded nodes (#1439 ) toggleFold previously only mutated urlFoldedSet, which had no effect when the clicked node was folded BY THE DEPTH DEFAULT (not by an explicit URL override). Result: at ?depth=1 where both groups are folded by depth-default, double-clicking bootstrap-kit (after #1438's dblclick-on-group → toggleFold branch) was a no-op — the urlFoldedSet delete didn't change the composed foldedSet, the canvas didn't budge. New behaviour: - If clicked node is folded by ANY source: switch to depth=all AND explicitly fold every OTHER previously-folded group. Only the clicked group ends up visibly unfolded — exactly the operator- requested "expand only the respective parent" UX. - If clicked node is unfolded: add to urlFoldedSet to fold it without changing depth. Caught via Playwright after #1438 landed and dblclick still didn't unfold the clicked group at ?depth=1. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:39:58 +04:00
github-actions[bot]	24a2b13870	deploy: update catalyst images to `9da662c`	2026-05-12 08:36:45 +00:00
e3mrah	9da662c6f5	fix(canvas): double-click on group toggles fold (not navigate) (#1438 ) Operator reported "double-click on a parent bubble it is expanding all the parent instead of expanding only the respective parent." Reproduced in Playwright: at ?depth=1 only the 2 group bubbles render folded; double-click on bootstrap-kit navigated to /jobs/bootstrap-kit which DROPPED the ?depth=1 query → new page defaulted to depth=2 → groups elided → all 50 install-* + Phase-0 bubbles rendered. Exactly the "expanding all parents" symptom. Two fixes: 1) Branch handleNodeDoubleClick: if the bubble is a group, call toggleFold(nodeId) in place — fold or unfold ONLY that group. Tree-explorer UX where a leaf double-click drills in but a group double-click expands/collapses. 2) For the leaf path, preserve window.location.search across the navigate so the destination page renders with the same depth / folded filter the operator had on screen. Without this, the new page defaults to depth=2 and the visible bubble set changes beneath them. Caught via Playwright double-click simulation on bootstrap-kit at ?depth=1 — URL went from .../jobs/install-cnpg?depth=1 (2 bubbles) to .../jobs/bootstrap-kit (50 bubbles). Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:33:59 +04:00
github-actions[bot]	41787d66c6	deploy: update catalyst images to `5e96d30`	2026-05-12 08:33:55 +00:00
e3mrah	5e96d30552	fix(flow-snapshot): drop provisioner→bootstrap-kit edge — causes M×N fan-out (#1437 ) flowLayoutOrganic.ts lines 414-442 lift an elided group's outbound deps onto EACH of its visible children, and if the dep target is itself an elided group, fans out to THAT group's visible children too. With both top-level groups elided at depth=all, the single group→group finish-to-start edge I added cascades into M×N phantom edges (each install-* gains a dep on every tofu-* + cluster-bootstrap step). The operator-reported "install-cnpg has 5 connections from terraform jobs" was exactly this layout-side fan-out. Removing the group→group edge leaves Phase-0 and Phase-1 as separate connected components on the canvas — the correct minimum-edge rendering. Ordering between phases is implicit in the timestamps + status flow, not in the edge graph. Caught by Playwright-probing the canvas after operator pushback: data side had only the 1 real direct dep (install-flux → install-cnpg) yet the canvas drew 5+ phantom lines to install-cnpg from Phase-0. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:30:44 +04:00
github-actions[bot]	732949bc73	deploy: update catalyst images to `f980356`	2026-05-12 08:14:36 +00:00
e3mrah	f980356ce9	fix(canvas): setSearchPatch uses window.history (forward-fix CI tsc TS2322) (#1436 ) PR #1435 (depth-chip basepath fix) failed CI because removing `to:` from navigate() narrowed the search reducer's typed return to never, producing TS2322 on the `Record<string, unknown>` cast. Forward-fix: bypass TanStack navigate() entirely for the search-only mutation path. Update window.location's query string via history.replaceState (preserves pathname verbatim including basepath) and dispatch a synthetic popstate so TanStack's useSearch picks up the new query on next render. No TanStack path resolution → no basepath drop → no colon re-encoding → depth-chip click stops 404ing. Re-also fixes open-new-tab (window.open of absolute /sovereign/... ) and handleNodeDoubleClick (strip + encode jobId) carried over from #1435. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:11:26 +04:00
e3mrah	4d1ccfbd44	fix(canvas): depth-chip click drops /sovereign basepath + open-new-tab 404 (#1435 ) Two UX-killer bugs the operator hit on the FlowCanvasOrganic surface: 1) Clicking the depth chip arrows (◀ / ▶) on /sovereign/provision/<id>/jobs/<depId>:install-X pushed the browser to /provision/<id>/jobs/<depId>%3Ainstall-X — the /sovereign basepath was dropped AND the colon was re-encoded as %3A, both via TanStack's `to: '.'` path resolution. The new URL 404s at the BE because the colon-prefixed jobName misses jobs.Store.GetJob's exact-match lookup. Fix: omit `to:` entirely. TanStack treats a search-only navigate as a pure search-params mutation and preserves the current path verbatim including the basepath. The colon-prefixed jobId in the URL comes from older deep-links; the strip-on-click fix landed in #1431. 2) Right-click → "Open in new tab" also passed the raw nodeId verbatim (no prefix strip, no encode, no /sovereign prefix). Mirror handleNodeDoubleClick: strip the "<deploymentId>:" prefix, encodeURIComponent the remainder, AND prepend /sovereign for the absolute-path window.open (window.open isn't routed through TanStack so basepath isn't auto-prepended). Caught after operator reported "level arrows redirect to wrong URLs and giving 404" + "right click on a parent bubble … none of the functions are working properly." Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:02:37 +04:00
e3mrah	1d9dd99915	fix(flow-snapshot): normalise bare-name Job.DependsOn to canonical JobID form (#1434 ) helmwatch.Bridge writes SOME Job.DependsOn entries as bare names ("install-flux") rather than the canonical JobID form ("<deploymentId>:install-flux") — 71 such entries observed on prov bfdccbdbd6f700e1 (2026-05-12). My flowSnapshotFromJobs emit copied those bare names verbatim into Relationship.fromId. The canvas reducer matches FlowNode.id by exact string, so the bare-name fromId became a phantom edge pointing to a non-existent node. In the force-directed layout these phantom edges visually routed through the nearest real bubbles, manifesting as 5-edge fan-outs from every Phase-0 tofu job to every install-* bubble (operator-reported on install-cnpg, but symmetric across all install-*). Normalise every fromId to jobs.JobID(deploymentID, dep) form when the stored value lacks a ":" separator. Caught after operator reported "install-cnpg has 5 different connections from terraform jobs — this is matter of a proper chaining" — looking at the snapshot showed Job.DependsOn=[install-flux] without the prefix. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:00:04 +04:00
github-actions[bot]	1a0333a43f	deploy: update catalyst images to `93c3e81`	2026-05-12 07:27:29 +00:00
e3mrah	93c3e81f0c	fix(flow-snapshot): contains edge direction — toId is parent per canon (#1433 ) Per products/openova-flow/core/src/types.ts line 112: "contains — toId (parent) contains fromId (child)" My emit had this inverted: I set FromID=parent, ToID=child, which made the FE adapter (flowStreamToOrganic.ts line 134) interpret every install-* leaf as a group containing the bootstrap-kit/provisioner group nodes. Net result: only 2 bubbles ever rendered on the canvas regardless of ?depth= because the hierarchy graph was upside-down. Caught by opening the canvas in a browser via Playwright after the operator reported "still showing only 2 bubbles, no drill-down". Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:24:30 +04:00
github-actions[bot]	9011d1b635	deploy: update catalyst images to `048a4d8`	2026-05-12 06:46:54 +00:00
e3mrah	048a4d8910	fix(refresh-watch): disk-fallback when Result.KubeconfigPath is empty (#1432 ) When the Pod restarts between PutKubeconfig writing the file AND the next Result.Save() persisting the field, dep.Result.KubeconfigPath comes back empty even though the file exists at the canonical convention <kubeconfigsDir>/<deploymentID>.yaml. RefreshWatch was returning 409 watch-not-resumable in this state, which left the mothership canvas frozen because the live watcher couldn't re-attach to source HR.spec.dependsOn for the install-* edge derivation. Hit live on prov bfdccbdbd6f700e1 (2026-05-12): chart roll for PR #1431 restarted catalyst-api Pod, the file /var/lib/catalyst/kubeconfigs/bfdccbdbd6f700e1.yaml was on disk but RefreshWatch refused to use it because the record field was empty. Fix: when KubeconfigPath is empty AND h.kubeconfigsDir is configured AND a file exists at <dir>/<depID>.yaml, use that path and patch the record so subsequent /components/state + flow snapshot calls see a populated field. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 10:44:55 +04:00

1 2 3 4 5 ...

1983 Commits