openova

Author	SHA1	Message	Date
github-actions[bot]	500fd47aee	deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.19	2026-05-12 18:50:01 +00:00
e3mrah	855e106d87	fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race (#1450 ) * fix(infra): pass cp_private_ip to primary CP templatefile too PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl started consuming ${cp_private_ip} (PR #1446): Invalid value for "vars" parameter: vars map does not contain key "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43. The primary CP templatefile call (main.tf:342) and the secondary WORKER templatefile call (main.tf:944) both pass `cp_private_ip`, but the secondary CP templatefile call (main.tf:860) was missed — every multi-region provision since PR #1446 lands here at plan-time. Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the secondary CP templatefile so each secondary region's cilium-operator reaches its OWN local CP (matching CA), not the primary across regions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort) Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops: fatal: failed to start: daemon creation failed: unable to initialize BPF masquerade support: BPF masquerade requires NodePort (--enable-node-port="true") Chart default kubeProxyReplacement=false leaves enable-node-port=false in the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true (also default-on) the cilium-agent rejects the BPF masquerade datapath on startup. CP cilium-agent survives because it was started by cloudinit with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the ConfigMap. Every WORKER node that joins after Flux's upgrade sees the new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not- ready taint persists → every post-install Job pod (keycloak-config-cli, powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain stalls at ~60% Ready. Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl write_files entry /var/lib/catalyst/cilium-values.yaml) already uses kubeProxyReplacement: true. This change aligns the Flux HR overlay with the working pre-Flux bootstrap so the agent config never regresses when helm-controller does its first upgrade. Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race prov #55+#56 caught bp-harbor / bp-powerdns failing Helm install with: Internal error occurred: failed calling webhook "mcluster.cnpg.io": no endpoints available for service "cnpg-webhook-service" Chain: 1. bp-cnpg install with disableWait: true → HR goes Ready immediately when manifests apply (operator pod still spinning up). 2. Flux releases dependents (bp-harbor, bp-powerdns) — they pass the dependsOn check on bp-cnpg. 3. Downstream chart renders postgresql.cnpg.io/v1.Cluster CRs. 4. cnpg mutating webhook (Service cnpg-webhook-service) has no endpoints yet → admission webhook call fails → Helm install fails → RetriesExceeded → entire DB-backed chain wedges. Carve out the disableWait: true blanket for bp-cnpg specifically. INVIOLABLE-PRINCIPLES #3's "event-driven install" rationale (avoid the agent-waits-for-its-own-CRDs deadlock — see bp-cilium) does NOT apply to CNPG: CNPG's CRDs are loaded by helm-controller BEFORE pods schedule, so Helm-wait blocks only on pod readiness, not on a self-referencing CRD. With this change bp-cnpg's HR stays Reconciling until cnpg-controller- manager + cnpg-webhook-service are both rolled + Available, so Flux dependsOn correctly gates downstream consumers behind a webhook that's actually serving. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 22:23:04 +04:00
e3mrah	fb563e9fd6	fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort) (#1449 ) * fix(infra): pass cp_private_ip to primary CP templatefile too PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl started consuming ${cp_private_ip} (PR #1446): Invalid value for "vars" parameter: vars map does not contain key "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43. The primary CP templatefile call (main.tf:342) and the secondary WORKER templatefile call (main.tf:944) both pass `cp_private_ip`, but the secondary CP templatefile call (main.tf:860) was missed — every multi-region provision since PR #1446 lands here at plan-time. Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the secondary CP templatefile so each secondary region's cilium-operator reaches its OWN local CP (matching CA), not the primary across regions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort) Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops: fatal: failed to start: daemon creation failed: unable to initialize BPF masquerade support: BPF masquerade requires NodePort (--enable-node-port="true") Chart default kubeProxyReplacement=false leaves enable-node-port=false in the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true (also default-on) the cilium-agent rejects the BPF masquerade datapath on startup. CP cilium-agent survives because it was started by cloudinit with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the ConfigMap. Every WORKER node that joins after Flux's upgrade sees the new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not- ready taint persists → every post-install Job pod (keycloak-config-cli, powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain stalls at ~60% Ready. Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl write_files entry /var/lib/catalyst/cilium-values.yaml) already uses kubeProxyReplacement: true. This change aligns the Flux HR overlay with the working pre-Flux bootstrap so the agent config never regresses when helm-controller does its first upgrade. Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 21:05:01 +04:00
github-actions[bot]	16f41bef56	deploy: update catalyst images to `68372d7`	2026-05-12 16:13:41 +00:00
e3mrah	68372d700b	fix(hetzner): pass cp_private_ip into secondary CP templatefile (multi-region prov #52-54 unblock) (#1448 ) * fix(infra): pass cp_private_ip to primary CP templatefile too PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl started consuming ${cp_private_ip} (PR #1446): Invalid value for "vars" parameter: vars map does not contain key "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43. The primary CP templatefile call (main.tf:342) and the secondary WORKER templatefile call (main.tf:944) both pass `cp_private_ip`, but the secondary CP templatefile call (main.tf:860) was missed — every multi-region provision since PR #1446 lands here at plan-time. Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the secondary CP templatefile so each secondary region's cilium-operator reaches its OWN local CP (matching CA), not the primary across regions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:11:23 +04:00
github-actions[bot]	1c6e82b83b	deploy: update catalyst images to `be47815`	2026-05-12 16:03:56 +00:00
e3mrah	be47815ddf	fix(infra): pass cp_private_ip to primary CP templatefile too (#1447 ) PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:01:43 +04:00
github-actions[bot]	034da82c00	deploy: update catalyst images to `cdcc50a`	2026-05-12 15:58:30 +00:00
e3mrah	cdcc50a213	fix(multi-region): cilium k8sServiceHost uses LOCAL CP private IP per region (#1446 ) Each region's k3s is an INDEPENDENT cluster per NAMING-CONVENTION §1.3 "no stretched fault domain". Cilium on each region MUST talk to its OWN local CP's k3s API server, not the primary's 10.0.1.2. Three sites hardcoded the primary's IP: 1) Pre-Flux cilium helm install (cloudinit-control-plane.tftpl:665): `k8sServiceHost: 10.0.1.2` → `${cp_private_ip}` (rendered per-region by main.tf — primary 10.0.1.2, nbg1-1 10.0.11.2, hel1-2 10.0.12.2). 2) k3s install --tls-san=10.0.1.2 (line 1206): same `${cp_private_ip}` so each region's k3s API cert validates against the LOCAL CP's IP. 3) bp-cilium HelmRelease (clusters/_template/bootstrap-kit/01-cilium.yaml): add `k8sServiceHost: ${CILIUM_K8S_SERVICE_HOST:=10.0.1.2}` to the HR values so Flux postBuild.substitute can override per region. The cloud-init Kustomization renders the substitute var to `${cp_private_ip}`. Single-region (primary-only) provisions fall back to the default `10.0.1.2` and stay byte-identical to today. Live evidence of the bug — prov #52 (3-region) on 2026-05-12: cilium-operator on nbg1 secondary: "Establishing connection to apiserver" host="https://10.0.1.2:6443" "failed to start: ... tls: failed to verify certificate: x509: certificate signed by unknown authority" Each region's k3s has its OWN self-signed CA (cluster-init per CP). The primary's API cert isn't signed by the secondary's CA → cilium crash- loops → no CNI → flux controllers Pending → no HRs → canvas shows only primary's HRs. This fix points each region's cilium at the LOCAL CP, whose API server presents the matching CA from this cluster. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 19:56:18 +04:00
github-actions[bot]	fc71800a52	deploy: update catalyst images to `19a847e`	2026-05-12 12:30:55 +00:00
e3mrah	19a847e514	fix(infra): restore \n escape in secondary CP templatefile regex (#1445 ) The conflict-resolution Python script in PR #1444 wrote a literal newline where the regex string needed the two-char "\n" escape. tofu init rejected with "Invalid multi-line string / Unterminated template string" on main.tf:925. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:27:10 +04:00
github-actions[bot]	bc0f56eb4e	deploy: update catalyst images to `4923938`	2026-05-12 12:15:30 +00:00
e3mrah	4923938c2b	feat(multi-region-canvas): per-region kubeconfig PUT-back + per-region helmwatch (#1444 ) Operator mandate (2026-05-12): the mothership canvas must surface install-* HRs from EVERY region of a multi-region provision, not just the primary CP's. Today catalyst-api stores ONE kubeconfig per deployment (the primary CP's) and spawns ONE helmwatch.Bridge against it. Result: secondary regions are invisible on the canvas even though their k3s clusters are fully reconciling. End-to-end change across infra + handler: 1) cloud-init (cloudinit-control-plane.tftpl): the kubeconfig PUT URL appends `?region=<kubeconfig_postback_region>` when the var is set. main.tf templatefile call passes empty for primary CP, `each.key` (e.g. "nbg1-1", "hel1-2") for each secondary region. 2) PutKubeconfig handler: reads ?region= query param. Empty → primary path (unchanged: stores at <dir>/<id>.yaml, sets Result.KubeconfigPath, fires Phase-1 watch + SMTP seed). Non-empty → secondary path: stores at <dir>/<id>-<region>.yaml, populates Deployment.secondaryKubeconfigPaths[region]. Single-use guard is per-region (the same bearer secures every CP's PUT — secondaries reuse it for their own slot). NO Phase-1 watch re-launch from a secondary PUT. 3) phase1_watch.spawnSecondaryRegionWatchers: runs alongside the primary's watcher. Scans <kubeconfigsDir>/<id>-.yaml every 15s, spawns one helmwatch.NewWatcher per kubeconfig discovered, stores the Watcher on Deployment.secondaryWatchers[region]. Per-region watchers emit ordinary helmwatch events with region-prefixed Component names so the wizard's per-component view doesn't collide primary vs secondary bp-cilium events. They do NOT contribute to markPhase1Done — outcome remains the primary's classification. 4) flow_snapshot_local.flowSnapshotFromJobs: composes per-region group bubbles + install- nodes from each secondary watcher's SnapshotComponents. Node id: <depID>:<region>:install-<chart>. FlowNode.region set so the canvas can colour-group. Intra-region finish-to-start deps emitted from cs.DependsOn — same-region only, never cross-region (per NAMING-CONVENTION §1.3 independent fault domains, no stretched cluster). 5) wipe.go: removes both <id>.yaml AND every <id>-.yaml secondary kubeconfig file on Sovereign wipe. Storage model is uniform across SME and corporate Sovereigns. No hardcoding of provider, region count, or building block. Caught after operator pointed out that 3-region prov #50 was showing only 52 install- nodes (all from fsn1) on the canvas — the architectural gap. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:12:38 +04:00
github-actions[bot]	effd75e4a7	deploy: update catalyst images to `c5d891a`	2026-05-12 11:26:54 +00:00
e3mrah	c5d891ad0b	fix(infra): forward hcloud_*_name to secondary regions' CP cloud-init (#1443 ) The F7 fix (Issue #1778) added hcloud_network_name / hcloud_firewall_name / hcloud_ssh_key_name to cloudinit-control-plane.tftpl so the cluster autoscaler could attach scale-up VMs to the private network. The primary CP's templatefile call at main.tf:483-485 was updated, but the matching call for secondary regions at main.tf:899 was missed. Result: any provision with regions[] of length > 1 fails at tofu plan with "vars map does not contain key hcloud_network_name" referenced in cloudinit-control-plane.tftpl:478. Hit live on prov #47 (ce25c31fff15c30c, 4-region: fsn1/nbg1/hel1/ash) at T+0:47. Forward the same three resource refs to every secondary region's templatefile call. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:23:53 +04:00
github-actions[bot]	5fb99be8e8	deploy: update catalyst images to `bd5d439`	2026-05-12 10:00:04 +00:00
e3mrah	bd5d4393ec	fix(canvas): cross-group edges cascade to leaf temporal endpoints (#1442 ) Operator-reported design fix completing #1437/#1440 — the cross-phase ordering between provisioner and bootstrap-kit groups was either an M×N phantom-edge fan-out (pre-#1437) OR completely disconnected at leaf level (post-#1440 with the both-elided skip). Neither was right. Real design: when a group→group dependency edge is lifted onto the leaf graph because one or both endpoints elided, cascade ONLY to the temporal endpoint pair: upstream_terminals → downstream_initials Where: - upstream_terminals = visible descendants of the upstream group that nothing else in the group depends on (sinks of intra-group DAG). For the tofu chain this collapses to just cluster-bootstrap. - downstream_initials = visible descendants of the downstream group that depend on nothing else in the group (sources of intra-group DAG). For bootstrap-kit this is install-cilium / install-flux / install-gateway-api / etc — the install-* roots. Net result for provisioner→bootstrap-kit at depth=all: a small fan of edges from cluster-bootstrap to the bp-* roots — the real temporal gate, no spurious phantom edges, no missing cross-phase chain. Two call sites updated: - Inbound: visibleJob X with X.dependsOn = [elidedGroup G] now cascades to groupTerminals(G) instead of fanOutVisibleChildren(G). - Outbound: elidedGroup G with G.dependsOn = [D] cascades to groupInitials(G) on the receive side; D-side cascades to groupTerminals(D) when D is also elided, or uses D directly when D is a visible job. 11/11 flowLayoutOrganic.test.ts pass. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 13:47:42 +04:00
github-actions[bot]	064fc3073f	deploy: update catalyst images to `0fe0cac`	2026-05-12 09:32:31 +00:00
e3mrah	0fe0cacc15	fix(canvas): right-click menu actions actually work + clearer labels (#1441 ) Operator reported "non of the right click functionalites working other than the open in new tab". Root cause: the previous handler only mutated urlFoldedSet, which had no visible effect when the clicked group was folded by the depth default (same class of bug toggleFold had before #1439). The menu items also had confusing labels ("Fold to level N" stepped GLOBAL depth, not subtree-relative). Rewrite to use the same compose-state pattern toggleFold uses: - "Show only this group" — switch to depth=all + fold every OTHER group. Only the clicked group's subtree expands; sibling groups stay collapsed. - "Hide this group" — switch to depth=default + add clicked group to urlFoldedSet. Group renders as a folded bubble; its subtree hidden. - "Expand subtree" — switch to depth=all + remove this group and all its descendant groups from urlFoldedSet. Fully unfolded subtree. - "Open in new tab" — unchanged (was working since #1435). Dropped the misleading "Fold to level N" item (was just stepDepth(-1)). The depth chip ◀▶ at the top-right is the canonical global depth control. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 13:30:31 +04:00
github-actions[bot]	c80d43c6d8	deploy: update catalyst images to `2c1f767`	2026-05-12 09:27:06 +00:00
e3mrah	2c1f767b52	fix(canvas): back-to-jobs chroot-scoped + group→group edge w/o M×N lift (#1440 ) Three operator-reported issues from the same dblclick session: 1) "Back to jobs" link in JobDetail.tsx (2 sites) and JobsTimeline.tsx used absolute /jobs which on contabo resolves to /sovereign/jobs — the mother's flat /jobs view, NOT the chroot-scoped /sovereign/provision/<id>/jobs. Operator reported "chroot principle violation". Fix: chroot-aware /provision/<deploymentId>/jobs when deploymentId is present. 2) Bootstrap and Provision Hetzner group bubbles at ?depth=1 had no edge between them — temporal ordering invisible. Earlier #1437 dropped the group→group edge entirely because the FE layout's lift-on-elide cascaded it into M×N phantom edges at ?depth=all. Re-emit the edge AND fix the lift logic in flowLayoutOrganic.ts (lines 414-442) to SKIP the lift when BOTH endpoints of the elided-group dep are elided. At ?depth=1 the edge renders between the two folded groups as intended; at ?depth=all both groups elide and the lift is suppressed so the spurious cascade doesn't reappear. The actual install-* deps are already visible via each leaf's own dependsOn — skipping the lift costs no information. 3) (Documented separately) Right-click menu only attaches to GROUP nodes per design (FlowCanvasOrganic line 1277). When all groups are elided (?depth=all auto-folds groups out), the menu is unreachable. The dblclick-on-group fold fix (#1439) makes group bubbles reachable at ?depth=1 where right-click works. Caught via Playwright after operator reported all three. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 13:24:50 +04:00
github-actions[bot]	fe337d571c	deploy: update catalyst images to `bb1bff2`	2026-05-12 08:42:18 +00:00
e3mrah	bb1bff245a	fix(canvas): toggleFold handles depth-default-folded nodes (#1439 ) toggleFold previously only mutated urlFoldedSet, which had no effect when the clicked node was folded BY THE DEPTH DEFAULT (not by an explicit URL override). Result: at ?depth=1 where both groups are folded by depth-default, double-clicking bootstrap-kit (after #1438's dblclick-on-group → toggleFold branch) was a no-op — the urlFoldedSet delete didn't change the composed foldedSet, the canvas didn't budge. New behaviour: - If clicked node is folded by ANY source: switch to depth=all AND explicitly fold every OTHER previously-folded group. Only the clicked group ends up visibly unfolded — exactly the operator- requested "expand only the respective parent" UX. - If clicked node is unfolded: add to urlFoldedSet to fold it without changing depth. Caught via Playwright after #1438 landed and dblclick still didn't unfold the clicked group at ?depth=1. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:39:58 +04:00
github-actions[bot]	24a2b13870	deploy: update catalyst images to `9da662c`	2026-05-12 08:36:45 +00:00
e3mrah	9da662c6f5	fix(canvas): double-click on group toggles fold (not navigate) (#1438 ) Operator reported "double-click on a parent bubble it is expanding all the parent instead of expanding only the respective parent." Reproduced in Playwright: at ?depth=1 only the 2 group bubbles render folded; double-click on bootstrap-kit navigated to /jobs/bootstrap-kit which DROPPED the ?depth=1 query → new page defaulted to depth=2 → groups elided → all 50 install-* + Phase-0 bubbles rendered. Exactly the "expanding all parents" symptom. Two fixes: 1) Branch handleNodeDoubleClick: if the bubble is a group, call toggleFold(nodeId) in place — fold or unfold ONLY that group. Tree-explorer UX where a leaf double-click drills in but a group double-click expands/collapses. 2) For the leaf path, preserve window.location.search across the navigate so the destination page renders with the same depth / folded filter the operator had on screen. Without this, the new page defaults to depth=2 and the visible bubble set changes beneath them. Caught via Playwright double-click simulation on bootstrap-kit at ?depth=1 — URL went from .../jobs/install-cnpg?depth=1 (2 bubbles) to .../jobs/bootstrap-kit (50 bubbles). Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:33:59 +04:00
github-actions[bot]	41787d66c6	deploy: update catalyst images to `5e96d30`	2026-05-12 08:33:55 +00:00
e3mrah	5e96d30552	fix(flow-snapshot): drop provisioner→bootstrap-kit edge — causes M×N fan-out (#1437 ) flowLayoutOrganic.ts lines 414-442 lift an elided group's outbound deps onto EACH of its visible children, and if the dep target is itself an elided group, fans out to THAT group's visible children too. With both top-level groups elided at depth=all, the single group→group finish-to-start edge I added cascades into M×N phantom edges (each install-* gains a dep on every tofu-* + cluster-bootstrap step). The operator-reported "install-cnpg has 5 connections from terraform jobs" was exactly this layout-side fan-out. Removing the group→group edge leaves Phase-0 and Phase-1 as separate connected components on the canvas — the correct minimum-edge rendering. Ordering between phases is implicit in the timestamps + status flow, not in the edge graph. Caught by Playwright-probing the canvas after operator pushback: data side had only the 1 real direct dep (install-flux → install-cnpg) yet the canvas drew 5+ phantom lines to install-cnpg from Phase-0. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:30:44 +04:00
github-actions[bot]	732949bc73	deploy: update catalyst images to `f980356`	2026-05-12 08:14:36 +00:00
e3mrah	f980356ce9	fix(canvas): setSearchPatch uses window.history (forward-fix CI tsc TS2322) (#1436 ) PR #1435 (depth-chip basepath fix) failed CI because removing `to:` from navigate() narrowed the search reducer's typed return to never, producing TS2322 on the `Record<string, unknown>` cast. Forward-fix: bypass TanStack navigate() entirely for the search-only mutation path. Update window.location's query string via history.replaceState (preserves pathname verbatim including basepath) and dispatch a synthetic popstate so TanStack's useSearch picks up the new query on next render. No TanStack path resolution → no basepath drop → no colon re-encoding → depth-chip click stops 404ing. Re-also fixes open-new-tab (window.open of absolute /sovereign/... ) and handleNodeDoubleClick (strip + encode jobId) carried over from #1435. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:11:26 +04:00
e3mrah	4d1ccfbd44	fix(canvas): depth-chip click drops /sovereign basepath + open-new-tab 404 (#1435 ) Two UX-killer bugs the operator hit on the FlowCanvasOrganic surface: 1) Clicking the depth chip arrows (◀ / ▶) on /sovereign/provision/<id>/jobs/<depId>:install-X pushed the browser to /provision/<id>/jobs/<depId>%3Ainstall-X — the /sovereign basepath was dropped AND the colon was re-encoded as %3A, both via TanStack's `to: '.'` path resolution. The new URL 404s at the BE because the colon-prefixed jobName misses jobs.Store.GetJob's exact-match lookup. Fix: omit `to:` entirely. TanStack treats a search-only navigate as a pure search-params mutation and preserves the current path verbatim including the basepath. The colon-prefixed jobId in the URL comes from older deep-links; the strip-on-click fix landed in #1431. 2) Right-click → "Open in new tab" also passed the raw nodeId verbatim (no prefix strip, no encode, no /sovereign prefix). Mirror handleNodeDoubleClick: strip the "<deploymentId>:" prefix, encodeURIComponent the remainder, AND prepend /sovereign for the absolute-path window.open (window.open isn't routed through TanStack so basepath isn't auto-prepended). Caught after operator reported "level arrows redirect to wrong URLs and giving 404" + "right click on a parent bubble … none of the functions are working properly." Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:02:37 +04:00
e3mrah	1d9dd99915	fix(flow-snapshot): normalise bare-name Job.DependsOn to canonical JobID form (#1434 ) helmwatch.Bridge writes SOME Job.DependsOn entries as bare names ("install-flux") rather than the canonical JobID form ("<deploymentId>:install-flux") — 71 such entries observed on prov bfdccbdbd6f700e1 (2026-05-12). My flowSnapshotFromJobs emit copied those bare names verbatim into Relationship.fromId. The canvas reducer matches FlowNode.id by exact string, so the bare-name fromId became a phantom edge pointing to a non-existent node. In the force-directed layout these phantom edges visually routed through the nearest real bubbles, manifesting as 5-edge fan-outs from every Phase-0 tofu job to every install-* bubble (operator-reported on install-cnpg, but symmetric across all install-*). Normalise every fromId to jobs.JobID(deploymentID, dep) form when the stored value lacks a ":" separator. Caught after operator reported "install-cnpg has 5 different connections from terraform jobs — this is matter of a proper chaining" — looking at the snapshot showed Job.DependsOn=[install-flux] without the prefix. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:00:04 +04:00
github-actions[bot]	1a0333a43f	deploy: update catalyst images to `93c3e81`	2026-05-12 07:27:29 +00:00
e3mrah	93c3e81f0c	fix(flow-snapshot): contains edge direction — toId is parent per canon (#1433 ) Per products/openova-flow/core/src/types.ts line 112: "contains — toId (parent) contains fromId (child)" My emit had this inverted: I set FromID=parent, ToID=child, which made the FE adapter (flowStreamToOrganic.ts line 134) interpret every install-* leaf as a group containing the bootstrap-kit/provisioner group nodes. Net result: only 2 bubbles ever rendered on the canvas regardless of ?depth= because the hierarchy graph was upside-down. Caught by opening the canvas in a browser via Playwright after the operator reported "still showing only 2 bubbles, no drill-down". Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:24:30 +04:00
github-actions[bot]	9011d1b635	deploy: update catalyst images to `048a4d8`	2026-05-12 06:46:54 +00:00
e3mrah	048a4d8910	fix(refresh-watch): disk-fallback when Result.KubeconfigPath is empty (#1432 ) When the Pod restarts between PutKubeconfig writing the file AND the next Result.Save() persisting the field, dep.Result.KubeconfigPath comes back empty even though the file exists at the canonical convention <kubeconfigsDir>/<deploymentID>.yaml. RefreshWatch was returning 409 watch-not-resumable in this state, which left the mothership canvas frozen because the live watcher couldn't re-attach to source HR.spec.dependsOn for the install-* edge derivation. Hit live on prov bfdccbdbd6f700e1 (2026-05-12): chart roll for PR #1431 restarted catalyst-api Pod, the file /var/lib/catalyst/kubeconfigs/bfdccbdbd6f700e1.yaml was on disk but RefreshWatch refused to use it because the record field was empty. Fix: when KubeconfigPath is empty AND h.kubeconfigsDir is configured AND a file exists at <dir>/<depID>.yaml, use that path and patch the record so subsequent /components/state + flow snapshot calls see a populated field. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 10:44:55 +04:00
github-actions[bot]	7e4f38ec62	deploy: update catalyst images to `e3771f6`	2026-05-12 06:38:32 +00:00
e3mrah	e3771f6813	fix(flow): derive HR dependsOn from live watcher + fix canvas drill-down 404 (#1431 ) Two bugs the operator hit on /sovereign/provision/<id>/jobs: 1) Phase-1 install-* Jobs rendered DISCONNECTED on the canvas — helmwatch.Bridge doesn't persist Job.DependsOn (only the Phase-0 tofu chain + cluster-bootstrap is wired today). Pull HR.spec.dependsOn from the live Watcher's informer cache via SnapshotComponents() (ComponentSnapshot.DependsOn already populated by extractDependsOn) at snapshot-time and emit finish-to-start edges from upstream install-<dep> to install-<self>. Also add provisioner→bootstrap-kit group-to-group finish-to-start so the Phase-0/Phase-1 ordering is visible on the canvas. 2) Clicking a canvas node → "404 page not found" because FlowPage.handleNodeDoubleClick passed the full "<deploymentId>:install-X" id verbatim. The backend Store.GetJob keys by bare jobName ("install-X"), so the colon-prefixed id missed exact-match and JobDetail returned 404. Mirror useJobLinkBuilder (JobsTable.tsx line 364): strip the "<deploymentId>:" prefix and encodeURIComponent the remainder before pushing to the router. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 10:36:22 +04:00
github-actions[bot]	59b6940c18	deploy: update catalyst images to `2fbab45`	2026-05-12 06:08:41 +00:00
e3mrah	2fbab45b43	feat(flow-proxy): assemble snapshot from local jobs.Store before upstream proxy (#1429 ) * fix(catalyst-api): add OPENOVA_FLOW_SERVER_URL env to chart template Without this env the proxy resolveFlowServerURL() falls back to per-deployment FQDN lookup (https://openova-flow.<sovereignFQDN>) which only exists on Sovereigns that already installed bootstrap-kit slot 56 with httproute=enabled. Every other catalyst-api deployment (mothership contabo + Sovereigns that haven't reached cutover yet) returns 502 on /api/v1/flows/{deploymentId}/snapshot — the live regression founder saw at console.openova.io: "No nodes to render." The env points at the in-cluster Service DNS for the LOCAL openova-flow- server. Both the mothership (catalyst-system or catalyst namespace) and each Sovereign chroot run the bp-openova-flow-server chart with a local Service, so this URL is correct for every cluster catalyst-api runs in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(flow-proxy): assemble snapshot from local jobs.Store before upstream proxy Mothership canvas at /sovereign/provision/<id>/jobs was empty for the first ~30 minutes of every fresh provision because the snapshot endpoint went straight to https://openova-flow.<sovereignFQDN> which can't serve until cilium + cert-manager + the HTTPRoute TLS cert are all up on the chroot. The Phase-0 + Phase-1 lifecycle Jobs catalyst-api ALREADY owns (tofu-init/plan/apply/output, flux-bootstrap, install-bp-<chart>, ...) were invisible the whole time. This change adds flowSnapshotFromJobs which assembles the canonical FlowMessage envelope from h.jobsStore().ListJobs(deploymentID) — every Job becomes a FlowNode with the legacy <deploymentId>:<jobName> id form the canvas drill-down already expects, every Job.DependsOn becomes a finish-to-start Relationship, every Job.ParentID becomes a contains Relationship. HandleFlowSnapshot checks the local store first and returns immediately when it has data; otherwise falls through to the existing upstream proxy path. HandleFlowStream gets the same treatment via flowStreamLocal: emit a snapshot frame on connect AND every 3 seconds thereafter, plus a 15s heartbeat. The OpenovaFlow consumer's reducer is idempotent on snapshot replay so re-emitting an unchanged envelope is harmless; in exchange the canvas reflects Job state transitions within ~3s of when helmwatch.Bridge writes them. No FE change required — the same /api/v1/flows/<id>/snapshot and /stream endpoints serve the same envelope shape the chroot adapter emits (products/openova-flow/adapter-flux/internal/types/flow.go), named SSE events including 'snapshot' and 'heartbeat'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 10:06:28 +04:00
github-actions[bot]	4ceb74067f	deploy: update catalyst images to `50bf7a5`	2026-05-12 04:12:24 +00:00
e3mrah	50bf7a59ed	fix: F8 - double bp-catalyst-platform HR timeout (15m→30m) + catalyst-api phase1 budget (60m→120m) (#1428 ) prov #44 (d9399223c3caa4f9) hit the catalyst-api 60m phase1 watch cap with bp-catalyst-platform HR still mid-retry (failures=3) and 41/45 HRs True. F1-F7 are correct and live on main (qa-finalizer-strip Completed, autoscaler workers joined). The remaining wall is total bootstrap-kit install time exceeding the outer watch budget on a fresh cpx42×1 Sovereign without a warm Harbor proxy-cache. Two lock-step changes widen both bounds: 1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: install.timeout 15m → 30m, upgrade.timeout 15m → 30m. The umbrella chart genuinely needs >15m worst case when the full SME + Catalyst service stack rolls cold. 2. products/catalyst/bootstrap/api/internal/helmwatch/helmwatch.go: DefaultWatchTimeout 60m → 120m. Worst-case inner HR retry chain is now 30m × 3 = 90m; the outer phase1 budget MUST be larger so the watch never terminates while helm-controller still has remediation attempts left. CATALYST_PHASE1_WATCH_TIMEOUT env-var override path was already wired (issue #538 baseline) — chart template now declares the explicit "120m" value so the runtime knob is discoverable for capacity-bounded environments. Per INVIOLABLE- PRINCIPLES.md #4 the knob remains runtime-configurable. New unit test TestPhase1WatchConfig_ProductionDefaultIs120m pins the F8 floor against future regression. Existing env-var override + field- override tests still pass unchanged. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 08:10:24 +04:00
github-actions[bot]	dd095b8597	deploy: update catalyst images to `b743b64`	2026-05-12 02:13:30 +00:00
e3mrah	b743b646ac	fix(autoscaler): attach scale-up VMs to private network so they k3s-join (#1427 ) Root cause (autoscaler pod log, prov #43 chroot): W orchestrator.go:626 Node group workers is not ready for scaleup - backoff with status: Scale-up timed out for node group workers after 15m2.273255226s Hetzner API confirms autoscaler-spawned workers come up PUBLIC-ONLY: workers-77439321e2047e3e public_net.ipv4=178.105.102.237 private_net=[] workers-a6410e81b24cced public_net.ipv4=178.105.73.210 private_net=[] The worker cloud-init (identical to Phase-0 user_data) issues curl -sfL https://get.k3s.io \| K3S_URL=https://10.0.1.2:6443 ... sh - against the CP's PRIVATE 10.0.1.2 IP. Without the 10.0.0.0/16 attachment that URL is unreachable → k3s agent install silent-fails → node never registers with apiserver → autoscaler 15m timeout → backoff → bp-catalyst- platform Pending Pods never schedulable → chroot canvas tests blocked. Fix: wire HCLOUD_NETWORK / HCLOUD_FIREWALL / HCLOUD_SSH_KEY env vars on the cluster-autoscaler deployment so the Hetzner provider attaches every scale-up VM to the SAME private network + firewall + ssh-key the Phase-0 Tofu module created (resource names: catalyst-<sov-fqdn-with-dashes>-net / -fw / catalyst-<sov-fqdn-with-dashes>). Names flow: Tofu (hcloud_network.main.name + hcloud_firewall.main.name + hcloud_ssh_key.main.name) → cloudinit-control-plane.tftpl (3 new template vars) → /var/lib/catalyst/cloud-credentials-secret.yaml (3 new keys) → flux-system/cloud-credentials Secret → bp-cluster-autoscaler-hcloud HelmRelease valuesFrom (3 optional entries with targetPath: cluster-autoscaler.extraEnv.HCLOUD_*) → upstream chart's deployment env Chart bumped 1.2.0 → 1.3.0. New smoke-test gates (Cases 5+6) prevent regression of the three env-var slots in chart values.yaml. Reaffirms canonical seam: values flow through Tofu → cloud-init → flux-system Secret → Flux valuesFrom → chart values → upstream env. Never via kubectl patch, never via bespoke Go API calls. Refs: prov #38/#39/#41/#43 omantel.biz scale-up backoff. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 06:11:30 +04:00
github-actions[bot]	d4d05f16f6	deploy: update catalyst images to `8c7d326`	2026-05-12 00:38:43 +00:00
e3mrah	8c7d32616e	fix(bp-catalyst-platform): qa-finalizer-strip hook unschedulable on saturated worker (Fix #185 , prov #38/#39/#41 recurrence) (#1426 ) Root cause (4-layer trace on prov #41, omantel.biz, 2026-05-12 00:28 UTC): bp-catalyst-platform HR install.timeout=15m → Helm pre-install hook: qa-finalizer-strip Job (weight -99) → Pod requests 50m CPU + 64Mi memory (tiny) → BUT no tolerations → scheduler restricted to worker → worker cpx32 (8vCPU/16GB) at 99% CPU requests (7980m of 8000m allocated) after bootstrap-kit fan-out → FailedScheduling: "0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: true}" → autoscaler triggers scale-up worker 2→3 → "1 in backoff after failed scale-up" → still Pending → 15m timeout → InstallFailed → Flux uninstall+rollback → installFailures: 3 → Flux gives up entirely Live evidence quoted from chroot kubeconfig on prov #41: - bp-catalyst-platform HR `Reconciling=True, reason=Progressing, message="Running 'install' action with timeout of 15m0s"` - HR `Released=False, reason=InstallFailed, message="Helm install failed for release catalyst-system/catalyst-platform with chart bp-catalyst-platform@1.4.140: failed pre-install: 1 error occurred: * timed out waiting for the condition"` - Pod `qa-finalizer-strip-m2hdb` status=Pending; events: `Warning FailedScheduling 108s default-scheduler 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: true}` - Worker `Allocated cpu 7980m (99%) of 8000m capacity` - Control-plane `Allocated cpu 635m (7%) of 8000m capacity` (idle) Fix: add tolerations for the control-plane NoSchedule taint + priorityClassName: system-cluster-critical so the qa-finalizer-strip Job can ALWAYS schedule regardless of worker-node CPU saturation. The hook is a defense-in-depth cleanup that runs in seconds on a clean cluster; it legitimately belongs anywhere with free capacity including the control-plane node (which on prov #41 had 7365m CPU free vs. the hook's 50m request). Why prior fixes didn't suffice: - Fix #114 introduced this hook to break a finalizer-deadlock loop on prov #9. Correct fix for that wedge; never anticipated worker saturation as a scheduling failure mode for the hook itself. - Fix #138 (chart 1.4.138) converted the qa-cnpg-backup-s3-seed + qa-cnpg-status-seed hooks (weight 0/post-install) to regular release resources to break a circular DAG dep. Different hook surface. - Fix #184 (chart 1.4.140) raised the gitea-token-mint pre-install hook (weight +10) wait budget for cold-start autoscaler. That hook runs AFTER qa-finalizer-strip (-99 < +10); if the -99 hook never starts, the +10 hook never runs. Recurring class: same family as Fix #114 (hook scheduling failure wedges entire HR install). 3 consecutive recurrences (prov #38, #39, #41) on chart pin 1.4.140 trigger the category-level audit threshold (CLAUDE.md rule "CATEGORY-LEVEL THINKING"). Coupled chart hygiene swept in same commit: - Switch image from bitnamilegacy/kubectl:1.29.3 (Docker-Hub redirect for deprecated Bitnami images, 2025-08 cutover documented at platform/self-sovereign-cutover/chart/values.yaml: 252) → harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4 — the canonical alpine-based kubectl image already used by sibling hook catalyst-gitea-token-mint (Fix #163). MIRROR-EVERYTHING + ARCHITECT-FIRST rules. Coordinator follow-up tickets: - Sibling Jobs in templates/qa-fixtures/cnpg-clusters-qa.yaml (qa-cnpgpair-status-seed) still reference bitnamilegacy/kubectl :1.29.3 — same Bitnami-deprecation class. Out of scope for this Fix (not part of the recurrence cluster); flagged for a sweep. - Worker cpx32 sizing may be undersized for the bootstrap-kit fan- out on omantel.biz — separate sizing ticket, not blocking. Changes: - products/catalyst/chart/templates/qa-fixtures/pre-install- finalizer-strip.yaml: add tolerations + priorityClassName; switch image to alpine/k8s:1.31.4. Inline doc comments explain the 4-layer trace and the Fix #114/#138/#184 history. - products/catalyst/chart/Chart.yaml: bump 1.4.140 → 1.4.141 with changelog entry capturing root cause + budget arithmetic. - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bump HR pin 1.4.140 → 1.4.141. Verification: - helm template renders cleanly (exit 0, ~6700 lines). - kubectl apply --dry-run=client validates the rendered Job manifest (job.batch/qa-finalizer-strip created (dry run)). - Rendered Job contains tolerations[control-plane Exists NoSchedule], priorityClassName: system-cluster-critical, image: alpine/k8s:1.31.4. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 04:36:35 +04:00
e3mrah	ce76a7b7ab	fix(bp-powerdns): root-cause Job DeadlineExceeded recurrence (post Fix #144 ) (#1425 ) Fix #144 raised zoneBootstrap.activeDeadlineSeconds 300s → 840s after prov #22 hit a 5m DeadlineExceeded on the bp-powerdns post-install hook. That fix was insufficient: prov #37 + #38 (chroot omantel.biz, 2026-05-12) both wedged on the SAME chart slot with `BackoffLimitExceeded`, NOT `DeadlineExceeded`. The deadline never got a chance to fire. Trace from prov #38 chroot (`KUBECONFIG=/tmp/prov38.kubeconfig kubectl get hr bp-powerdns -o yaml`): status: Helm install failed for release powerdns/powerdns with chart bp-powerdns@1.2.2: failed post-install: 1 error occurred: * job powerdns-zone-bootstrap failed: BackoffLimitExceeded Pod events for powerdns-zone-bootstrap-tq7qq: 59m Started container zone-bootstrap 56m Back-off restarting failed container zone-bootstrap 55m Job has reached the specified backoff limit Root cause walked end-to-end (per CLAUDE.md TRACE rule): TEST: bp-powerdns HR Ready=True ↑ HR: Helm install succeeds (post-install Job exits 0) ↑ Zone-bootstrap Job: curl POST succeeds ↑ powerdns:8081 Service: reachable (has Ready endpoints) ↑ powerdns Deployment: Pods Ready (3 replicas) ← Pending, blocked here ↑ CNPG cluster: pdns-pg-app Secret exists ↑ pdns-pg-1-initdb Pod: scheduled, Running, Completed ← Pending too ↑ Worker node has capacity ← 99% CPU requested The zone-bootstrap container curl'd `http://powerdns:8081`, hit "connection refused" (empty Service endpoints), exited 7, container restarted under `restartPolicy: OnFailure`. After 6 Kubernetes-level backoffs (≈10min wall-time with exponential delay), the Job declared `BackoffLimitExceeded` — well before activeDeadlineSeconds=840s (14min) could even consider firing. Fix #144 was directionally right (the upstream IS slow on cold k3s) but operated on the wrong knob. The container's outer-loop retry budget is bounded by backoffLimit × backoff-delay, not by activeDeadlineSeconds. Bumping only the deadline left the BackoffLimit ceiling unchanged. Architectural fix (this commit): 1. Move the wait-for-API loop INSIDE the container (one Pod, one inner poll loop, restartPolicy=Never). The inner loop polls GET /api/v1/servers every 10s until HTTP 200, bounded by new `apiReadyTimeoutSeconds` (default 600s = 10min). Now ONE container run owns the full wait budget instead of N short-lived containers racing the backoff timer. 2. restartPolicy: OnFailure → Never. The container script handles its own retry; Kubernetes-level backoff is reserved for genuinely transient pod failures (image-pull, OS eviction) where the Job-level backoffLimit=6 still triggers a fresh Pod. 3. Surface POWERDNS_API_READY_TIMEOUT_S env var so operators on slower clusters can raise the inner deadline without forking the chart (per docs/INVIOLABLE-PRINCIPLES.md #4). 4. New value `zoneBootstrap.apiReadyTimeoutSeconds` (default 600s). Sits below activeDeadlineSeconds (840s) so the zone-creation phase keeps ≥240s of headroom AFTER the API comes Ready. Curl status handling in the wait loop: 200 → API up, proceed to bootstrap 401\|403 → auth failure, FATAL (no retry — operator misconfig) 000\|5xx\|... → transient, sleep & retry until inner deadline Files changed: - platform/powerdns/chart/Chart.yaml 1.2.2 → 1.2.3 + history - platform/powerdns/chart/values.yaml + apiReadyTimeoutSeconds knob - platform/powerdns/chart/templates/ zone-bootstrap-job.yaml inner wait-for-API loop; restartPolicy: Never - clusters/_template/bootstrap-kit/ 11-powerdns.yaml pin to 1.2.3 + HR comment Why this is sufficient where Fix #144 was not: Fix #144 worked the chart-level deadline. This commit works the inner-loop ownership — the wait budget is now owned by the script inside the container, not by the Job spec arithmetic (backoffLimit × backoff-delay). The Job's outer activeDeadlineSeconds still caps the worst-case runtime (no runaway poll), but the script now actually GETS to use it. Verification: - helm template renders cleanly (deps build OK, empty-zones short- circuit preserved, non-empty zones render Job + RBAC + Audit CM) - kubectl create --dry-run=client --validate=false: 5/5 resources created (sa, role, rb, cm, job) - chart 1.2.3 pinned in clusters/_template/bootstrap-kit/11-powerdns.yaml Companion infrastructure note (NOT addressed by this commit, flagged for Coordinator): The DEEPER bottom of the trace stack is worker capacity. Prov #38's single cpx32 worker (8 vCPU / 16 GB) is at 99% CPU requested. The cluster-autoscaler attempted 2→3 scale-up but is in backoff because two unscheduled pods (gitea/gitea-* PV affinity conflict from a previous wedged install; trivy-system/node-collector NodeAffinity) poison the autoscaler's "can the template node fit" check. Even with this chart fix in place, the powerdns Deployment cannot become Ready until either: (a) the worker autoscales successfully (gitea PV migrated / trivy taints relaxed), or (b) worker_count is bumped from 2 to 3 in the provisioning body, or (c) qa_worker_size is bumped to cpx42. This chart fix ensures bp-powerdns survives a slow CNPG cold-start. It does NOT fix a fundamentally undersized cluster. Coordinator next step: reprov with worker_count=3 OR qa_worker_size=cpx42 + this chart landed. Either should converge. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 02:13:34 +04:00
Claude Code	569d780b86	fix(bp-openova-flow-emitter slot 57): drop :8080 port (Service is :80) The chroot bp-openova-flow-emitter posts to http://openova-flow-server.catalyst-system.svc.cluster.local:8080 but the bp-openova-flow-server chart's Service is exposed on :80 (targetPort:8080 → port:80, kubernetes Service indirection). Result: every event POST from the chroot emitter dial-times-out, the chroot's openova-flow data plane never populates, and canvas pages viewing the chroot show empty. Same fix as PR #124 on mothership emitter-helmrelease.yaml (private repo). Slot 57 in the bootstrap-kit template was missed in that round. Live regression on prov #37 (2026-05-11): chroot has 38 bp-* HRs True but openova-flow snapshot is empty because emitter can't reach server. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:49:29 +02:00
github-actions[bot]	5fdd33b7c0	deploy: update catalyst images to `0ba87bb`	2026-05-11 18:32:08 +00:00
e3mrah	0ba87bb8da	fix(JobsPage): use FlowNode.id in row anchor href (region prefix) (#1414 ) TC-035 (iter-2, 2026-05-11): OpenovaFlow rows merged into JobsPage (PR #1413) lost their region-prefixed identity in the URL. The link builder sliced the "<prefix>:" segment off every id with a colon — intended to strip the legacy "<deploymentId>:install-keycloak" form, but it also stripped "contabo:bp-openova-flow-server" → bare "bp-openova-flow-server" in the href. The matrix asserts the verbatim form "/jobs/contabo:bp-openova-flow-server" must appear in the rendered DOM. Fix: stop slicing. `encodeURIComponent` still escapes unsafe path chars (`/` for live K8s job ids like "job/syft-grype/..."), then we restore `:` because RFC 3986 permits it as a path-segment `pchar`. FlowPage canvas navigation (PR #1411) and JobDetail flow-fallback (PR #1412) already pass on the colon-present form, so this round- trips end-to-end. Legacy "bp-cilium" / "cluster-bootstrap" hrefs are unchanged (no `:` to encode). The previously-stripped legacy form "<deploymentId>:install-keycloak" now lands as the full id in the URL, and JobDetail's `jobsById` lookup is already keyed by BOTH the canonical id AND the bare jobName (JobDetail.tsx:124-131), so the resolution path is preserved. Test coverage: new Case 4 in JobsPage.flow-merge.test.tsx asserts the openova-flow row's anchor `href` contains `/jobs/contabo:bp-openova-flow-server` and is NOT the bare-jobName form. All 4 flow-merge cases PASS. The 3 pre-existing failures in JobsPage.test.tsx (back-to-apps href, canonical-columns header, Show-as-Flow button) are the documented iter-2 baseline — untouched by this change. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:29:46 +04:00
github-actions[bot]	5c987309b5	deploy: update catalyst images to `5332ed0`	2026-05-11 17:56:31 +00:00

... 3 4 5 6 7 ...

2168 Commits