Commit Graph

2168 Commits

Author SHA1 Message Date
github-actions[bot]
500fd47aee deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.19 2026-05-12 18:50:01 +00:00
e3mrah
855e106d87
fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race (#1450)
* fix(infra): pass cp_private_ip to primary CP templatefile too

PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".

Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile

prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl
started consuming ${cp_private_ip} (PR #1446):

    Invalid value for "vars" parameter: vars map does not contain key
    "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43.

The primary CP templatefile call (main.tf:342) and the secondary WORKER
templatefile call (main.tf:944) both pass `cp_private_ip`, but the
secondary CP templatefile call (main.tf:860) was missed — every
multi-region provision since PR #1446 lands here at plan-time.

Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the
secondary CP templatefile so each secondary region's cilium-operator
reaches its OWN local CP (matching CA), not the primary across regions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort)

Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops:

  fatal: failed to start: daemon creation failed: unable to initialize
  BPF masquerade support: BPF masquerade requires NodePort
  (--enable-node-port="true")

Chart default kubeProxyReplacement=false leaves enable-node-port=false in
the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true
(also default-on) the cilium-agent rejects the BPF masquerade datapath
on startup. CP cilium-agent survives because it was started by cloudinit
with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the
ConfigMap. Every WORKER node that joins after Flux's upgrade sees the
new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not-
ready taint persists → every post-install Job pod (keycloak-config-cli,
powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain
stalls at ~60% Ready.

Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl
write_files entry /var/lib/catalyst/cilium-values.yaml) already uses
kubeProxyReplacement: true. This change aligns the Flux HR overlay with
the working pre-Flux bootstrap so the agent config never regresses when
helm-controller does its first upgrade.

Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race

prov #55+#56 caught bp-harbor / bp-powerdns failing Helm install with:

  Internal error occurred: failed calling webhook "mcluster.cnpg.io":
  no endpoints available for service "cnpg-webhook-service"

Chain:
1. bp-cnpg install with disableWait: true → HR goes Ready immediately
   when manifests apply (operator pod still spinning up).
2. Flux releases dependents (bp-harbor, bp-powerdns) — they pass the
   dependsOn check on bp-cnpg.
3. Downstream chart renders postgresql.cnpg.io/v1.Cluster CRs.
4. cnpg mutating webhook (Service cnpg-webhook-service) has no endpoints
   yet → admission webhook call fails → Helm install fails →
   RetriesExceeded → entire DB-backed chain wedges.

Carve out the disableWait: true blanket for bp-cnpg specifically.
INVIOLABLE-PRINCIPLES #3's "event-driven install" rationale (avoid the
agent-waits-for-its-own-CRDs deadlock — see bp-cilium) does NOT apply
to CNPG: CNPG's CRDs are loaded by helm-controller BEFORE pods schedule,
so Helm-wait blocks only on pod readiness, not on a self-referencing CRD.

With this change bp-cnpg's HR stays Reconciling until cnpg-controller-
manager + cnpg-webhook-service are both rolled + Available, so Flux
dependsOn correctly gates downstream consumers behind a webhook that's
actually serving.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 22:23:04 +04:00
e3mrah
fb563e9fd6
fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort) (#1449)
* fix(infra): pass cp_private_ip to primary CP templatefile too

PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".

Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile

prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl
started consuming ${cp_private_ip} (PR #1446):

    Invalid value for "vars" parameter: vars map does not contain key
    "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43.

The primary CP templatefile call (main.tf:342) and the secondary WORKER
templatefile call (main.tf:944) both pass `cp_private_ip`, but the
secondary CP templatefile call (main.tf:860) was missed — every
multi-region provision since PR #1446 lands here at plan-time.

Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the
secondary CP templatefile so each secondary region's cilium-operator
reaches its OWN local CP (matching CA), not the primary across regions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort)

Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops:

  fatal: failed to start: daemon creation failed: unable to initialize
  BPF masquerade support: BPF masquerade requires NodePort
  (--enable-node-port="true")

Chart default kubeProxyReplacement=false leaves enable-node-port=false in
the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true
(also default-on) the cilium-agent rejects the BPF masquerade datapath
on startup. CP cilium-agent survives because it was started by cloudinit
with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the
ConfigMap. Every WORKER node that joins after Flux's upgrade sees the
new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not-
ready taint persists → every post-install Job pod (keycloak-config-cli,
powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain
stalls at ~60% Ready.

Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl
write_files entry /var/lib/catalyst/cilium-values.yaml) already uses
kubeProxyReplacement: true. This change aligns the Flux HR overlay with
the working pre-Flux bootstrap so the agent config never regresses when
helm-controller does its first upgrade.

Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 21:05:01 +04:00
github-actions[bot]
16f41bef56 deploy: update catalyst images to 68372d7 2026-05-12 16:13:41 +00:00
e3mrah
68372d700b
fix(hetzner): pass cp_private_ip into secondary CP templatefile (multi-region prov #52-54 unblock) (#1448)
* fix(infra): pass cp_private_ip to primary CP templatefile too

PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".

Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile

prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl
started consuming ${cp_private_ip} (PR #1446):

    Invalid value for "vars" parameter: vars map does not contain key
    "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43.

The primary CP templatefile call (main.tf:342) and the secondary WORKER
templatefile call (main.tf:944) both pass `cp_private_ip`, but the
secondary CP templatefile call (main.tf:860) was missed — every
multi-region provision since PR #1446 lands here at plan-time.

Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the
secondary CP templatefile so each secondary region's cilium-operator
reaches its OWN local CP (matching CA), not the primary across regions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 20:11:23 +04:00
github-actions[bot]
1c6e82b83b deploy: update catalyst images to be47815 2026-05-12 16:03:56 +00:00
e3mrah
be47815ddf
fix(infra): pass cp_private_ip to primary CP templatefile too (#1447)
PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl
but only the SECONDARY templatefile call at main.tf:840 already had
that var threaded. The PRIMARY CP call at line 342 was missed and
tofu plan blew up with "vars map does not contain key cp_private_ip".

Set it to "10.0.1.2" for the primary (the hardcoded value the chart
default + worker_cloud_init already use for the canonical 10.0.1.0/24
primary subnet).

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 20:01:43 +04:00
github-actions[bot]
034da82c00 deploy: update catalyst images to cdcc50a 2026-05-12 15:58:30 +00:00
e3mrah
cdcc50a213
fix(multi-region): cilium k8sServiceHost uses LOCAL CP private IP per region (#1446)
Each region's k3s is an INDEPENDENT cluster per NAMING-CONVENTION §1.3
"no stretched fault domain". Cilium on each region MUST talk to its
OWN local CP's k3s API server, not the primary's 10.0.1.2. Three sites
hardcoded the primary's IP:

1) Pre-Flux cilium helm install (cloudinit-control-plane.tftpl:665):
   `k8sServiceHost: 10.0.1.2` → `${cp_private_ip}` (rendered per-region
   by main.tf — primary 10.0.1.2, nbg1-1 10.0.11.2, hel1-2 10.0.12.2).

2) k3s install --tls-san=10.0.1.2 (line 1206): same `${cp_private_ip}`
   so each region's k3s API cert validates against the LOCAL CP's IP.

3) bp-cilium HelmRelease (clusters/_template/bootstrap-kit/01-cilium.yaml):
   add `k8sServiceHost: ${CILIUM_K8S_SERVICE_HOST:=10.0.1.2}` to the HR
   values so Flux postBuild.substitute can override per region. The
   cloud-init Kustomization renders the substitute var to `${cp_private_ip}`.
   Single-region (primary-only) provisions fall back to the
   default `10.0.1.2` and stay byte-identical to today.

Live evidence of the bug — prov #52 (3-region) on 2026-05-12:

  cilium-operator on nbg1 secondary:
  "Establishing connection to apiserver" host="https://10.0.1.2:6443"
  "failed to start: ... tls: failed to verify certificate:
   x509: certificate signed by unknown authority"

Each region's k3s has its OWN self-signed CA (cluster-init per CP). The
primary's API cert isn't signed by the secondary's CA → cilium crash-
loops → no CNI → flux controllers Pending → no HRs → canvas shows only
primary's HRs. This fix points each region's cilium at the LOCAL CP,
whose API server presents the matching CA from this cluster.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 19:56:18 +04:00
github-actions[bot]
fc71800a52 deploy: update catalyst images to 19a847e 2026-05-12 12:30:55 +00:00
e3mrah
19a847e514
fix(infra): restore \n escape in secondary CP templatefile regex (#1445)
The conflict-resolution Python script in PR #1444 wrote a literal
newline where the regex string needed the two-char "\n" escape. tofu
init rejected with "Invalid multi-line string / Unterminated template
string" on main.tf:925.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:27:10 +04:00
github-actions[bot]
bc0f56eb4e deploy: update catalyst images to 4923938 2026-05-12 12:15:30 +00:00
e3mrah
4923938c2b
feat(multi-region-canvas): per-region kubeconfig PUT-back + per-region helmwatch (#1444)
Operator mandate (2026-05-12): the mothership canvas must surface
install-* HRs from EVERY region of a multi-region provision, not just
the primary CP's. Today catalyst-api stores ONE kubeconfig per
deployment (the primary CP's) and spawns ONE helmwatch.Bridge against
it. Result: secondary regions are invisible on the canvas even though
their k3s clusters are fully reconciling.

End-to-end change across infra + handler:

1) cloud-init (cloudinit-control-plane.tftpl): the kubeconfig PUT URL
   appends `?region=<kubeconfig_postback_region>` when the var is set.
   main.tf templatefile call passes empty for primary CP, `each.key`
   (e.g. "nbg1-1", "hel1-2") for each secondary region.

2) PutKubeconfig handler: reads ?region= query param. Empty → primary
   path (unchanged: stores at <dir>/<id>.yaml, sets
   Result.KubeconfigPath, fires Phase-1 watch + SMTP seed). Non-empty
   → secondary path: stores at <dir>/<id>-<region>.yaml, populates
   Deployment.secondaryKubeconfigPaths[region]. Single-use guard is
   per-region (the same bearer secures every CP's PUT — secondaries
   reuse it for their own slot). NO Phase-1 watch re-launch from a
   secondary PUT.

3) phase1_watch.spawnSecondaryRegionWatchers: runs alongside the
   primary's watcher. Scans <kubeconfigsDir>/<id>-*.yaml every 15s,
   spawns one helmwatch.NewWatcher per kubeconfig discovered, stores
   the Watcher on Deployment.secondaryWatchers[region]. Per-region
   watchers emit ordinary helmwatch events with region-prefixed
   Component names so the wizard's per-component view doesn't collide
   primary vs secondary bp-cilium events. They do NOT contribute to
   markPhase1Done — outcome remains the primary's classification.

4) flow_snapshot_local.flowSnapshotFromJobs: composes per-region group
   bubbles + install-* nodes from each secondary watcher's
   SnapshotComponents. Node id: <depID>:<region>:install-<chart>.
   FlowNode.region set so the canvas can colour-group. Intra-region
   finish-to-start deps emitted from cs.DependsOn — same-region only,
   never cross-region (per NAMING-CONVENTION §1.3 independent fault
   domains, no stretched cluster).

5) wipe.go: removes both <id>.yaml AND every <id>-*.yaml secondary
   kubeconfig file on Sovereign wipe.

Storage model is uniform across SME and corporate Sovereigns. No
hardcoding of provider, region count, or building block.

Caught after operator pointed out that 3-region prov #50 was showing
only 52 install-* nodes (all from fsn1) on the canvas — the
architectural gap.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:12:38 +04:00
github-actions[bot]
effd75e4a7 deploy: update catalyst images to c5d891a 2026-05-12 11:26:54 +00:00
e3mrah
c5d891ad0b
fix(infra): forward hcloud_*_name to secondary regions' CP cloud-init (#1443)
The F7 fix (Issue #1778) added hcloud_network_name / hcloud_firewall_name /
hcloud_ssh_key_name to cloudinit-control-plane.tftpl so the cluster
autoscaler could attach scale-up VMs to the private network. The
primary CP's templatefile call at main.tf:483-485 was updated, but the
matching call for secondary regions at main.tf:899 was missed.

Result: any provision with regions[] of length > 1 fails at tofu plan
with "vars map does not contain key hcloud_network_name" referenced in
cloudinit-control-plane.tftpl:478.

Hit live on prov #47 (ce25c31fff15c30c, 4-region: fsn1/nbg1/hel1/ash)
at T+0:47. Forward the same three resource refs to every secondary
region's templatefile call.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:23:53 +04:00
github-actions[bot]
5fb99be8e8 deploy: update catalyst images to bd5d439 2026-05-12 10:00:04 +00:00
e3mrah
bd5d4393ec
fix(canvas): cross-group edges cascade to leaf temporal endpoints (#1442)
Operator-reported design fix completing #1437/#1440 — the cross-phase
ordering between provisioner and bootstrap-kit groups was either an
M×N phantom-edge fan-out (pre-#1437) OR completely disconnected at
leaf level (post-#1440 with the both-elided skip). Neither was right.

Real design: when a group→group dependency edge is lifted onto the
leaf graph because one or both endpoints elided, cascade ONLY to the
temporal endpoint pair:

  upstream_terminals → downstream_initials

Where:
  - upstream_terminals = visible descendants of the upstream group
    that nothing else in the group depends on (sinks of intra-group
    DAG). For the tofu chain this collapses to just cluster-bootstrap.
  - downstream_initials = visible descendants of the downstream group
    that depend on nothing else in the group (sources of intra-group
    DAG). For bootstrap-kit this is install-cilium / install-flux /
    install-gateway-api / etc — the install-* roots.

Net result for provisioner→bootstrap-kit at depth=all: a small fan of
edges from cluster-bootstrap to the bp-* roots — the real temporal
gate, no spurious phantom edges, no missing cross-phase chain.

Two call sites updated:
  - Inbound: visibleJob X with X.dependsOn = [elidedGroup G] now
    cascades to groupTerminals(G) instead of fanOutVisibleChildren(G).
  - Outbound: elidedGroup G with G.dependsOn = [D] cascades to
    groupInitials(G) on the receive side; D-side cascades to
    groupTerminals(D) when D is also elided, or uses D directly when
    D is a visible job.

11/11 flowLayoutOrganic.test.ts pass.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 13:47:42 +04:00
github-actions[bot]
064fc3073f deploy: update catalyst images to 0fe0cac 2026-05-12 09:32:31 +00:00
e3mrah
0fe0cacc15
fix(canvas): right-click menu actions actually work + clearer labels (#1441)
Operator reported "non of the right click functionalites working
other than the open in new tab". Root cause: the previous handler
only mutated urlFoldedSet, which had no visible effect when the
clicked group was folded by the depth default (same class of bug
toggleFold had before #1439). The menu items also had confusing
labels ("Fold to level N" stepped GLOBAL depth, not subtree-relative).

Rewrite to use the same compose-state pattern toggleFold uses:

  - "Show only this group" — switch to depth=all + fold every OTHER
    group. Only the clicked group's subtree expands; sibling groups
    stay collapsed.
  - "Hide this group" — switch to depth=default + add clicked group
    to urlFoldedSet. Group renders as a folded bubble; its subtree
    hidden.
  - "Expand subtree" — switch to depth=all + remove this group and
    all its descendant groups from urlFoldedSet. Fully unfolded
    subtree.
  - "Open in new tab" — unchanged (was working since #1435).

Dropped the misleading "Fold to level N" item (was just stepDepth(-1)).
The depth chip ◀▶ at the top-right is the canonical global depth
control.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 13:30:31 +04:00
github-actions[bot]
c80d43c6d8 deploy: update catalyst images to 2c1f767 2026-05-12 09:27:06 +00:00
e3mrah
2c1f767b52
fix(canvas): back-to-jobs chroot-scoped + group→group edge w/o M×N lift (#1440)
Three operator-reported issues from the same dblclick session:

1) "Back to jobs" link in JobDetail.tsx (2 sites) and JobsTimeline.tsx
   used absolute /jobs which on contabo resolves to /sovereign/jobs —
   the mother's flat /jobs view, NOT the chroot-scoped
   /sovereign/provision/<id>/jobs. Operator reported "chroot principle
   violation". Fix: chroot-aware /provision/<deploymentId>/jobs when
   deploymentId is present.

2) Bootstrap and Provision Hetzner group bubbles at ?depth=1 had no
   edge between them — temporal ordering invisible. Earlier #1437
   dropped the group→group edge entirely because the FE layout's
   lift-on-elide cascaded it into M×N phantom edges at ?depth=all.
   Re-emit the edge AND fix the lift logic in
   flowLayoutOrganic.ts (lines 414-442) to SKIP the lift when BOTH
   endpoints of the elided-group dep are elided. At ?depth=1 the
   edge renders between the two folded groups as intended; at
   ?depth=all both groups elide and the lift is suppressed so the
   spurious cascade doesn't reappear. The actual install-* deps are
   already visible via each leaf's own dependsOn — skipping the lift
   costs no information.

3) (Documented separately) Right-click menu only attaches to GROUP
   nodes per design (FlowCanvasOrganic line 1277). When all groups
   are elided (?depth=all auto-folds groups out), the menu is
   unreachable. The dblclick-on-group fold fix (#1439) makes group
   bubbles reachable at ?depth=1 where right-click works.

Caught via Playwright after operator reported all three.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 13:24:50 +04:00
github-actions[bot]
fe337d571c deploy: update catalyst images to bb1bff2 2026-05-12 08:42:18 +00:00
e3mrah
bb1bff245a
fix(canvas): toggleFold handles depth-default-folded nodes (#1439)
toggleFold previously only mutated urlFoldedSet, which had no effect
when the clicked node was folded BY THE DEPTH DEFAULT (not by an
explicit URL override). Result: at ?depth=1 where both groups are
folded by depth-default, double-clicking bootstrap-kit (after #1438's
dblclick-on-group → toggleFold branch) was a no-op — the urlFoldedSet
delete didn't change the composed foldedSet, the canvas didn't budge.

New behaviour:
  - If clicked node is folded by ANY source: switch to depth=all AND
    explicitly fold every OTHER previously-folded group. Only the
    clicked group ends up visibly unfolded — exactly the operator-
    requested "expand only the respective parent" UX.
  - If clicked node is unfolded: add to urlFoldedSet to fold it
    without changing depth.

Caught via Playwright after #1438 landed and dblclick still didn't
unfold the clicked group at ?depth=1.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:39:58 +04:00
github-actions[bot]
24a2b13870 deploy: update catalyst images to 9da662c 2026-05-12 08:36:45 +00:00
e3mrah
9da662c6f5
fix(canvas): double-click on group toggles fold (not navigate) (#1438)
Operator reported "double-click on a parent bubble it is expanding
all the parent instead of expanding only the respective parent."
Reproduced in Playwright: at ?depth=1 only the 2 group bubbles
render folded; double-click on bootstrap-kit navigated to
/jobs/bootstrap-kit which DROPPED the ?depth=1 query → new page
defaulted to depth=2 → groups elided → all 50 install-* + Phase-0
bubbles rendered. Exactly the "expanding all parents" symptom.

Two fixes:

1) Branch handleNodeDoubleClick: if the bubble is a group, call
   toggleFold(nodeId) in place — fold or unfold ONLY that group.
   Tree-explorer UX where a leaf double-click drills in but a group
   double-click expands/collapses.

2) For the leaf path, preserve window.location.search across the
   navigate so the destination page renders with the same depth /
   folded filter the operator had on screen. Without this, the new
   page defaults to depth=2 and the visible bubble set changes
   beneath them.

Caught via Playwright double-click simulation on bootstrap-kit at
?depth=1 — URL went from .../jobs/install-cnpg?depth=1 (2 bubbles)
to .../jobs/bootstrap-kit (50 bubbles).

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:33:59 +04:00
github-actions[bot]
41787d66c6 deploy: update catalyst images to 5e96d30 2026-05-12 08:33:55 +00:00
e3mrah
5e96d30552
fix(flow-snapshot): drop provisioner→bootstrap-kit edge — causes M×N fan-out (#1437)
flowLayoutOrganic.ts lines 414-442 lift an elided group's outbound
deps onto EACH of its visible children, and if the dep target is
itself an elided group, fans out to THAT group's visible children
too. With both top-level groups elided at depth=all, the single
group→group finish-to-start edge I added cascades into M×N phantom
edges (each install-* gains a dep on every tofu-* + cluster-bootstrap
step). The operator-reported "install-cnpg has 5 connections from
terraform jobs" was exactly this layout-side fan-out.

Removing the group→group edge leaves Phase-0 and Phase-1 as separate
connected components on the canvas — the correct minimum-edge
rendering. Ordering between phases is implicit in the timestamps +
status flow, not in the edge graph.

Caught by Playwright-probing the canvas after operator pushback: data
side had only the 1 real direct dep (install-flux → install-cnpg)
yet the canvas drew 5+ phantom lines to install-cnpg from Phase-0.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:30:44 +04:00
github-actions[bot]
732949bc73 deploy: update catalyst images to f980356 2026-05-12 08:14:36 +00:00
e3mrah
f980356ce9
fix(canvas): setSearchPatch uses window.history (forward-fix CI tsc TS2322) (#1436)
PR #1435 (depth-chip basepath fix) failed CI because removing `to:`
from navigate() narrowed the search reducer's typed return to never,
producing TS2322 on the `Record<string, unknown>` cast.

Forward-fix: bypass TanStack navigate() entirely for the search-only
mutation path. Update window.location's query string via
history.replaceState (preserves pathname verbatim including basepath)
and dispatch a synthetic popstate so TanStack's useSearch picks up
the new query on next render. No TanStack path resolution → no
basepath drop → no colon re-encoding → depth-chip click stops 404ing.

Re-also fixes open-new-tab (window.open of absolute /sovereign/... )
and handleNodeDoubleClick (strip + encode jobId) carried over from #1435.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:11:26 +04:00
e3mrah
4d1ccfbd44
fix(canvas): depth-chip click drops /sovereign basepath + open-new-tab 404 (#1435)
Two UX-killer bugs the operator hit on the FlowCanvasOrganic surface:

1) Clicking the depth chip arrows (◀ / ▶) on
   /sovereign/provision/<id>/jobs/<depId>:install-X pushed the browser
   to /provision/<id>/jobs/<depId>%3Ainstall-X — the /sovereign basepath
   was dropped AND the colon was re-encoded as %3A, both via TanStack's
   `to: '.'` path resolution. The new URL 404s at the BE because the
   colon-prefixed jobName misses jobs.Store.GetJob's exact-match lookup.
   Fix: omit `to:` entirely. TanStack treats a search-only navigate as
   a pure search-params mutation and preserves the current path verbatim
   including the basepath. The colon-prefixed jobId in the URL comes
   from older deep-links; the strip-on-click fix landed in #1431.

2) Right-click → "Open in new tab" also passed the raw nodeId
   verbatim (no prefix strip, no encode, no /sovereign prefix). Mirror
   handleNodeDoubleClick: strip the "<deploymentId>:" prefix,
   encodeURIComponent the remainder, AND prepend /sovereign for the
   absolute-path window.open (window.open isn't routed through
   TanStack so basepath isn't auto-prepended).

Caught after operator reported "level arrows redirect to wrong URLs
and giving 404" + "right click on a parent bubble … none of the
functions are working properly."

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:02:37 +04:00
e3mrah
1d9dd99915
fix(flow-snapshot): normalise bare-name Job.DependsOn to canonical JobID form (#1434)
helmwatch.Bridge writes SOME Job.DependsOn entries as bare names
("install-flux") rather than the canonical JobID form
("<deploymentId>:install-flux") — 71 such entries observed on prov
bfdccbdbd6f700e1 (2026-05-12). My flowSnapshotFromJobs emit copied
those bare names verbatim into Relationship.fromId. The canvas
reducer matches FlowNode.id by exact string, so the bare-name fromId
became a phantom edge pointing to a non-existent node. In the
force-directed layout these phantom edges visually routed through
the nearest real bubbles, manifesting as 5-edge fan-outs from every
Phase-0 tofu job to every install-* bubble (operator-reported on
install-cnpg, but symmetric across all install-*).

Normalise every fromId to jobs.JobID(deploymentID, dep) form when
the stored value lacks a ":" separator.

Caught after operator reported "install-cnpg has 5 different
connections from terraform jobs — this is matter of a proper
chaining" — looking at the snapshot showed Job.DependsOn=[install-flux]
without the prefix.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:00:04 +04:00
github-actions[bot]
1a0333a43f deploy: update catalyst images to 93c3e81 2026-05-12 07:27:29 +00:00
e3mrah
93c3e81f0c
fix(flow-snapshot): contains edge direction — toId is parent per canon (#1433)
Per products/openova-flow/core/src/types.ts line 112:
  "contains — toId (parent) contains fromId (child)"

My emit had this inverted: I set FromID=parent, ToID=child, which
made the FE adapter (flowStreamToOrganic.ts line 134) interpret every
install-* leaf as a group containing the bootstrap-kit/provisioner
group nodes. Net result: only 2 bubbles ever rendered on the canvas
regardless of ?depth= because the hierarchy graph was upside-down.

Caught by opening the canvas in a browser via Playwright after the
operator reported "still showing only 2 bubbles, no drill-down".

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:24:30 +04:00
github-actions[bot]
9011d1b635 deploy: update catalyst images to 048a4d8 2026-05-12 06:46:54 +00:00
e3mrah
048a4d8910
fix(refresh-watch): disk-fallback when Result.KubeconfigPath is empty (#1432)
When the Pod restarts between PutKubeconfig writing the file AND the
next Result.Save() persisting the field, dep.Result.KubeconfigPath
comes back empty even though the file exists at the canonical
convention <kubeconfigsDir>/<deploymentID>.yaml. RefreshWatch was
returning 409 watch-not-resumable in this state, which left the
mothership canvas frozen because the live watcher couldn't re-attach
to source HR.spec.dependsOn for the install-* edge derivation.

Hit live on prov bfdccbdbd6f700e1 (2026-05-12): chart roll for
PR #1431 restarted catalyst-api Pod, the file
/var/lib/catalyst/kubeconfigs/bfdccbdbd6f700e1.yaml was on disk but
RefreshWatch refused to use it because the record field was empty.

Fix: when KubeconfigPath is empty AND h.kubeconfigsDir is configured
AND a file exists at <dir>/<depID>.yaml, use that path and patch the
record so subsequent /components/state + flow snapshot calls see a
populated field.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:44:55 +04:00
github-actions[bot]
7e4f38ec62 deploy: update catalyst images to e3771f6 2026-05-12 06:38:32 +00:00
e3mrah
e3771f6813
fix(flow): derive HR dependsOn from live watcher + fix canvas drill-down 404 (#1431)
Two bugs the operator hit on /sovereign/provision/<id>/jobs:

1) Phase-1 install-* Jobs rendered DISCONNECTED on the canvas —
   helmwatch.Bridge doesn't persist Job.DependsOn (only the Phase-0
   tofu chain + cluster-bootstrap is wired today). Pull HR.spec.dependsOn
   from the live Watcher's informer cache via SnapshotComponents()
   (ComponentSnapshot.DependsOn already populated by extractDependsOn)
   at snapshot-time and emit finish-to-start edges from upstream
   install-<dep> to install-<self>. Also add provisioner→bootstrap-kit
   group-to-group finish-to-start so the Phase-0/Phase-1 ordering is
   visible on the canvas.

2) Clicking a canvas node → "404 page not found" because
   FlowPage.handleNodeDoubleClick passed the full
   "<deploymentId>:install-X" id verbatim. The backend Store.GetJob
   keys by bare jobName ("install-X"), so the colon-prefixed id missed
   exact-match and JobDetail returned 404. Mirror useJobLinkBuilder
   (JobsTable.tsx line 364): strip the "<deploymentId>:" prefix and
   encodeURIComponent the remainder before pushing to the router.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:36:22 +04:00
github-actions[bot]
59b6940c18 deploy: update catalyst images to 2fbab45 2026-05-12 06:08:41 +00:00
e3mrah
2fbab45b43
feat(flow-proxy): assemble snapshot from local jobs.Store before upstream proxy (#1429)
* fix(catalyst-api): add OPENOVA_FLOW_SERVER_URL env to chart template

Without this env the proxy resolveFlowServerURL() falls back to
per-deployment FQDN lookup (https://openova-flow.<sovereignFQDN>) which
only exists on Sovereigns that already installed bootstrap-kit slot 56
with httproute=enabled. Every other catalyst-api deployment (mothership
contabo + Sovereigns that haven't reached cutover yet) returns 502 on
/api/v1/flows/{deploymentId}/snapshot — the live regression founder
saw at console.openova.io: "No nodes to render."

The env points at the in-cluster Service DNS for the LOCAL openova-flow-
server. Both the mothership (catalyst-system or catalyst namespace) and
each Sovereign chroot run the bp-openova-flow-server chart with a local
Service, so this URL is correct for every cluster catalyst-api runs in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(flow-proxy): assemble snapshot from local jobs.Store before upstream proxy

Mothership canvas at /sovereign/provision/<id>/jobs was empty for the
first ~30 minutes of every fresh provision because the snapshot
endpoint went straight to https://openova-flow.<sovereignFQDN> which
can't serve until cilium + cert-manager + the HTTPRoute TLS cert are
all up on the chroot. The Phase-0 + Phase-1 lifecycle Jobs catalyst-api
ALREADY owns (tofu-init/plan/apply/output, flux-bootstrap,
install-bp-<chart>, ...) were invisible the whole time.

This change adds flowSnapshotFromJobs which assembles the canonical
FlowMessage envelope from h.jobsStore().ListJobs(deploymentID) — every
Job becomes a FlowNode with the legacy <deploymentId>:<jobName> id form
the canvas drill-down already expects, every Job.DependsOn becomes a
finish-to-start Relationship, every Job.ParentID becomes a contains
Relationship. HandleFlowSnapshot checks the local store first and
returns immediately when it has data; otherwise falls through to the
existing upstream proxy path.

HandleFlowStream gets the same treatment via flowStreamLocal: emit a
snapshot frame on connect AND every 3 seconds thereafter, plus a 15s
heartbeat. The OpenovaFlow consumer's reducer is idempotent on
snapshot replay so re-emitting an unchanged envelope is harmless;
in exchange the canvas reflects Job state transitions within ~3s
of when helmwatch.Bridge writes them.

No FE change required — the same /api/v1/flows/<id>/snapshot and
/stream endpoints serve the same envelope shape the chroot adapter
emits (products/openova-flow/adapter-flux/internal/types/flow.go),
named SSE events including 'snapshot' and 'heartbeat'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:06:28 +04:00
github-actions[bot]
4ceb74067f deploy: update catalyst images to 50bf7a5 2026-05-12 04:12:24 +00:00
e3mrah
50bf7a59ed
fix: F8 - double bp-catalyst-platform HR timeout (15m→30m) + catalyst-api phase1 budget (60m→120m) (#1428)
prov #44 (d9399223c3caa4f9) hit the catalyst-api 60m phase1 watch cap
with bp-catalyst-platform HR still mid-retry (failures=3) and 41/45 HRs
True. F1-F7 are correct and live on main (qa-finalizer-strip Completed,
autoscaler workers joined). The remaining wall is total bootstrap-kit
install time exceeding the outer watch budget on a fresh cpx42×1
Sovereign without a warm Harbor proxy-cache.

Two lock-step changes widen both bounds:

1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
   install.timeout 15m → 30m, upgrade.timeout 15m → 30m. The umbrella
   chart genuinely needs >15m worst case when the full SME + Catalyst
   service stack rolls cold.

2. products/catalyst/bootstrap/api/internal/helmwatch/helmwatch.go:
   DefaultWatchTimeout 60m → 120m. Worst-case inner HR retry chain is
   now 30m × 3 = 90m; the outer phase1 budget MUST be larger so the
   watch never terminates while helm-controller still has remediation
   attempts left. CATALYST_PHASE1_WATCH_TIMEOUT env-var override path
   was already wired (issue #538 baseline) — chart template now
   declares the explicit "120m" value so the runtime knob is
   discoverable for capacity-bounded environments. Per INVIOLABLE-
   PRINCIPLES.md #4 the knob remains runtime-configurable.

New unit test TestPhase1WatchConfig_ProductionDefaultIs120m pins the
F8 floor against future regression. Existing env-var override + field-
override tests still pass unchanged.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 08:10:24 +04:00
github-actions[bot]
dd095b8597 deploy: update catalyst images to b743b64 2026-05-12 02:13:30 +00:00
e3mrah
b743b646ac
fix(autoscaler): attach scale-up VMs to private network so they k3s-join (#1427)
Root cause (autoscaler pod log, prov #43 chroot):
  W orchestrator.go:626 Node group workers is not ready for scaleup -
  backoff with status: Scale-up timed out for node group workers after
  15m2.273255226s

Hetzner API confirms autoscaler-spawned workers come up PUBLIC-ONLY:
  workers-77439321e2047e3e public_net.ipv4=178.105.102.237 private_net=[]
  workers-a6410e81b24cced  public_net.ipv4=178.105.73.210  private_net=[]

The worker cloud-init (identical to Phase-0 user_data) issues
  curl -sfL https://get.k3s.io | K3S_URL=https://10.0.1.2:6443 ... sh -
against the CP's PRIVATE 10.0.1.2 IP. Without the 10.0.0.0/16 attachment
that URL is unreachable → k3s agent install silent-fails → node never
registers with apiserver → autoscaler 15m timeout → backoff → bp-catalyst-
platform Pending Pods never schedulable → chroot canvas tests blocked.

Fix: wire HCLOUD_NETWORK / HCLOUD_FIREWALL / HCLOUD_SSH_KEY env vars on
the cluster-autoscaler deployment so the Hetzner provider attaches every
scale-up VM to the SAME private network + firewall + ssh-key the Phase-0
Tofu module created (resource names: catalyst-<sov-fqdn-with-dashes>-net /
-fw / catalyst-<sov-fqdn-with-dashes>). Names flow:

  Tofu (hcloud_network.main.name + hcloud_firewall.main.name +
        hcloud_ssh_key.main.name)
   → cloudinit-control-plane.tftpl (3 new template vars)
   → /var/lib/catalyst/cloud-credentials-secret.yaml (3 new keys)
   → flux-system/cloud-credentials Secret
   → bp-cluster-autoscaler-hcloud HelmRelease valuesFrom (3 optional entries
     with targetPath: cluster-autoscaler.extraEnv.HCLOUD_*)
   → upstream chart's deployment env

Chart bumped 1.2.0 → 1.3.0. New smoke-test gates (Cases 5+6) prevent
regression of the three env-var slots in chart values.yaml.

Reaffirms canonical seam: values flow through Tofu → cloud-init →
flux-system Secret → Flux valuesFrom → chart values → upstream env.
Never via kubectl patch, never via bespoke Go API calls.

Refs: prov #38/#39/#41/#43 omantel.biz scale-up backoff.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 06:11:30 +04:00
github-actions[bot]
d4d05f16f6 deploy: update catalyst images to 8c7d326 2026-05-12 00:38:43 +00:00
e3mrah
8c7d32616e
fix(bp-catalyst-platform): qa-finalizer-strip hook unschedulable on saturated worker (Fix #185, prov #38/#39/#41 recurrence) (#1426)
Root cause (4-layer trace on prov #41, omantel.biz, 2026-05-12 00:28 UTC):

  bp-catalyst-platform HR install.timeout=15m
    → Helm pre-install hook: qa-finalizer-strip Job (weight -99)
      → Pod requests 50m CPU + 64Mi memory (tiny)
        → BUT no tolerations → scheduler restricted to worker
          → worker cpx32 (8vCPU/16GB) at 99% CPU requests
            (7980m of 8000m allocated) after bootstrap-kit fan-out
            → FailedScheduling: "0/2 nodes are available: 1
              Insufficient cpu, 1 node(s) had untolerated taint
              {node-role.kubernetes.io/control-plane: true}"
            → autoscaler triggers scale-up worker 2→3 → "1 in backoff
              after failed scale-up" → still Pending → 15m timeout
              → InstallFailed → Flux uninstall+rollback → installFailures: 3
              → Flux gives up entirely

Live evidence quoted from chroot kubeconfig on prov #41:
  - bp-catalyst-platform HR `Reconciling=True, reason=Progressing,
    message="Running 'install' action with timeout of 15m0s"`
  - HR `Released=False, reason=InstallFailed, message="Helm install
    failed for release catalyst-system/catalyst-platform with chart
    bp-catalyst-platform@1.4.140: failed pre-install: 1 error occurred:
    * timed out waiting for the condition"`
  - Pod `qa-finalizer-strip-m2hdb` status=Pending; events:
    `Warning  FailedScheduling 108s default-scheduler 0/2 nodes are
    available: 1 Insufficient cpu, 1 node(s) had untolerated taint
    {node-role.kubernetes.io/control-plane: true}`
  - Worker `Allocated cpu 7980m (99%) of 8000m capacity`
  - Control-plane `Allocated cpu 635m (7%) of 8000m capacity` (idle)

Fix: add tolerations for the control-plane NoSchedule taint +
priorityClassName: system-cluster-critical so the qa-finalizer-strip
Job can ALWAYS schedule regardless of worker-node CPU saturation.
The hook is a defense-in-depth cleanup that runs in seconds on a
clean cluster; it legitimately belongs anywhere with free capacity
including the control-plane node (which on prov #41 had 7365m CPU
free vs. the hook's 50m request).

Why prior fixes didn't suffice:
  - Fix #114 introduced this hook to break a finalizer-deadlock loop
    on prov #9. Correct fix for that wedge; never anticipated worker
    saturation as a scheduling failure mode for the hook itself.
  - Fix #138 (chart 1.4.138) converted the qa-cnpg-backup-s3-seed +
    qa-cnpg-status-seed hooks (weight 0/post-install) to regular
    release resources to break a circular DAG dep. Different hook
    surface.
  - Fix #184 (chart 1.4.140) raised the gitea-token-mint pre-install
    hook (weight +10) wait budget for cold-start autoscaler. That
    hook runs AFTER qa-finalizer-strip (-99 < +10); if the -99 hook
    never starts, the +10 hook never runs.

Recurring class: same family as Fix #114 (hook scheduling failure
wedges entire HR install). 3 consecutive recurrences (prov #38, #39,
#41) on chart pin 1.4.140 trigger the category-level audit threshold
(CLAUDE.md rule "CATEGORY-LEVEL THINKING"). Coupled chart hygiene
swept in same commit:

  - Switch image from bitnamilegacy/kubectl:1.29.3 (Docker-Hub
    redirect for deprecated Bitnami images, 2025-08 cutover
    documented at platform/self-sovereign-cutover/chart/values.yaml:
    252) → harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4 —
    the canonical alpine-based kubectl image already used by sibling
    hook catalyst-gitea-token-mint (Fix #163). MIRROR-EVERYTHING +
    ARCHITECT-FIRST rules.

Coordinator follow-up tickets:
  - Sibling Jobs in templates/qa-fixtures/cnpg-clusters-qa.yaml
    (qa-cnpgpair-status-seed) still reference bitnamilegacy/kubectl
    :1.29.3 — same Bitnami-deprecation class. Out of scope for this
    Fix (not part of the recurrence cluster); flagged for a sweep.
  - Worker cpx32 sizing may be undersized for the bootstrap-kit fan-
    out on omantel.biz — separate sizing ticket, not blocking.

Changes:
  - products/catalyst/chart/templates/qa-fixtures/pre-install-
    finalizer-strip.yaml: add tolerations + priorityClassName;
    switch image to alpine/k8s:1.31.4. Inline doc comments explain
    the 4-layer trace and the Fix #114/#138/#184 history.
  - products/catalyst/chart/Chart.yaml: bump 1.4.140 → 1.4.141 with
    changelog entry capturing root cause + budget arithmetic.
  - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
    bump HR pin 1.4.140 → 1.4.141.

Verification:
  - helm template renders cleanly (exit 0, ~6700 lines).
  - kubectl apply --dry-run=client validates the rendered Job
    manifest (job.batch/qa-finalizer-strip created (dry run)).
  - Rendered Job contains tolerations[control-plane Exists NoSchedule],
    priorityClassName: system-cluster-critical, image: alpine/k8s:1.31.4.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 04:36:35 +04:00
e3mrah
ce76a7b7ab
fix(bp-powerdns): root-cause Job DeadlineExceeded recurrence (post Fix #144) (#1425)
Fix #144 raised zoneBootstrap.activeDeadlineSeconds 300s → 840s after
prov #22 hit a 5m DeadlineExceeded on the bp-powerdns post-install hook.
That fix was insufficient: prov #37 + #38 (chroot omantel.biz, 2026-05-12)
both wedged on the SAME chart slot with `BackoffLimitExceeded`, NOT
`DeadlineExceeded`. The deadline never got a chance to fire.

Trace from prov #38 chroot (`KUBECONFIG=/tmp/prov38.kubeconfig kubectl
get hr bp-powerdns -o yaml`):

  status:
    Helm install failed for release powerdns/powerdns with chart
    bp-powerdns@1.2.2: failed post-install: 1 error occurred:
      * job powerdns-zone-bootstrap failed: BackoffLimitExceeded

Pod events for powerdns-zone-bootstrap-tq7qq:
  59m Started container zone-bootstrap
  56m Back-off restarting failed container zone-bootstrap
  55m Job has reached the specified backoff limit

Root cause walked end-to-end (per CLAUDE.md TRACE rule):

  TEST: bp-powerdns HR Ready=True
    ↑
  HR: Helm install succeeds (post-install Job exits 0)
    ↑
  Zone-bootstrap Job: curl POST succeeds
    ↑
  powerdns:8081 Service: reachable (has Ready endpoints)
    ↑
  powerdns Deployment: Pods Ready (3 replicas)  ← Pending, blocked here
    ↑
  CNPG cluster: pdns-pg-app Secret exists
    ↑
  pdns-pg-1-initdb Pod: scheduled, Running, Completed  ← Pending too
    ↑
  Worker node has capacity                              ← 99% CPU requested

The zone-bootstrap container curl'd `http://powerdns:8081`, hit
"connection refused" (empty Service endpoints), exited 7, container
restarted under `restartPolicy: OnFailure`. After 6 Kubernetes-level
backoffs (≈10min wall-time with exponential delay), the Job declared
`BackoffLimitExceeded` — well before activeDeadlineSeconds=840s
(14min) could even consider firing.

Fix #144 was directionally right (the upstream IS slow on cold k3s) but
operated on the wrong knob. The container's outer-loop retry budget is
bounded by backoffLimit × backoff-delay, not by activeDeadlineSeconds.
Bumping only the deadline left the BackoffLimit ceiling unchanged.

Architectural fix (this commit):

1. Move the wait-for-API loop INSIDE the container (one Pod, one inner
   poll loop, restartPolicy=Never). The inner loop polls
   GET /api/v1/servers every 10s until HTTP 200, bounded by new
   `apiReadyTimeoutSeconds` (default 600s = 10min). Now ONE container
   run owns the full wait budget instead of N short-lived containers
   racing the backoff timer.

2. restartPolicy: OnFailure → Never. The container script handles its
   own retry; Kubernetes-level backoff is reserved for genuinely
   transient pod failures (image-pull, OS eviction) where the Job-level
   backoffLimit=6 still triggers a fresh Pod.

3. Surface POWERDNS_API_READY_TIMEOUT_S env var so operators on slower
   clusters can raise the inner deadline without forking the chart
   (per docs/INVIOLABLE-PRINCIPLES.md #4).

4. New value `zoneBootstrap.apiReadyTimeoutSeconds` (default 600s).
   Sits below activeDeadlineSeconds (840s) so the zone-creation phase
   keeps ≥240s of headroom AFTER the API comes Ready.

Curl status handling in the wait loop:
  200          → API up, proceed to bootstrap
  401|403      → auth failure, FATAL (no retry — operator misconfig)
  000|5xx|...  → transient, sleep & retry until inner deadline

Files changed:
- platform/powerdns/chart/Chart.yaml         1.2.2 → 1.2.3 + history
- platform/powerdns/chart/values.yaml        + apiReadyTimeoutSeconds knob
- platform/powerdns/chart/templates/
    zone-bootstrap-job.yaml                  inner wait-for-API loop;
                                              restartPolicy: Never
- clusters/_template/bootstrap-kit/
    11-powerdns.yaml                         pin to 1.2.3 + HR comment

Why this is sufficient where Fix #144 was not:

Fix #144 worked the chart-level deadline. This commit works the
inner-loop ownership — the wait budget is now owned by the script
inside the container, not by the Job spec arithmetic
(backoffLimit × backoff-delay). The Job's outer activeDeadlineSeconds
still caps the worst-case runtime (no runaway poll), but the script
now actually GETS to use it.

Verification:
- helm template renders cleanly (deps build OK, empty-zones short-
  circuit preserved, non-empty zones render Job + RBAC + Audit CM)
- kubectl create --dry-run=client --validate=false: 5/5 resources
  created (sa, role, rb, cm, job)
- chart 1.2.3 pinned in clusters/_template/bootstrap-kit/11-powerdns.yaml

Companion infrastructure note (NOT addressed by this commit, flagged
for Coordinator):

The DEEPER bottom of the trace stack is worker capacity. Prov #38's
single cpx32 worker (8 vCPU / 16 GB) is at 99% CPU requested. The
cluster-autoscaler attempted 2→3 scale-up but is in backoff because
two unscheduled pods (gitea/gitea-* PV affinity conflict from a
previous wedged install; trivy-system/node-collector NodeAffinity)
poison the autoscaler's "can the template node fit" check. Even with
this chart fix in place, the powerdns Deployment cannot become Ready
until either:
  (a) the worker autoscales successfully (gitea PV migrated / trivy
      taints relaxed), or
  (b) worker_count is bumped from 2 to 3 in the provisioning body, or
  (c) qa_worker_size is bumped to cpx42.

This chart fix ensures bp-powerdns survives a slow CNPG cold-start.
It does NOT fix a fundamentally undersized cluster. Coordinator next
step: reprov with worker_count=3 OR qa_worker_size=cpx42 + this chart
landed. Either should converge.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 02:13:34 +04:00
Claude Code
569d780b86 fix(bp-openova-flow-emitter slot 57): drop :8080 port (Service is :80)
The chroot bp-openova-flow-emitter posts to
http://openova-flow-server.catalyst-system.svc.cluster.local:8080
but the bp-openova-flow-server chart's Service is exposed on :80
(targetPort:8080 → port:80, kubernetes Service indirection).

Result: every event POST from the chroot emitter dial-times-out, the
chroot's openova-flow data plane never populates, and canvas pages
viewing the chroot show empty.

Same fix as PR #124 on mothership emitter-helmrelease.yaml (private
repo). Slot 57 in the bootstrap-kit template was missed in that round.

Live regression on prov #37 (2026-05-11): chroot has 38 bp-* HRs True
but openova-flow snapshot is empty because emitter can't reach server.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:49:29 +02:00
github-actions[bot]
5fdd33b7c0 deploy: update catalyst images to 0ba87bb 2026-05-11 18:32:08 +00:00
e3mrah
0ba87bb8da
fix(JobsPage): use FlowNode.id in row anchor href (region prefix) (#1414)
TC-035 (iter-2, 2026-05-11): OpenovaFlow rows merged into JobsPage
(PR #1413) lost their region-prefixed identity in the URL. The link
builder sliced the "<prefix>:" segment off every id with a colon —
intended to strip the legacy "<deploymentId>:install-keycloak" form,
but it also stripped "contabo:bp-openova-flow-server" → bare
"bp-openova-flow-server" in the href. The matrix asserts the
verbatim form "/jobs/contabo:bp-openova-flow-server" must appear in
the rendered DOM.

Fix: stop slicing. `encodeURIComponent` still escapes unsafe path
chars (`/` for live K8s job ids like "job/syft-grype/..."), then we
restore `:` because RFC 3986 permits it as a path-segment `pchar`.
FlowPage canvas navigation (PR #1411) and JobDetail flow-fallback
(PR #1412) already pass on the colon-present form, so this round-
trips end-to-end. Legacy "bp-cilium" / "cluster-bootstrap" hrefs are
unchanged (no `:` to encode). The previously-stripped legacy form
"<deploymentId>:install-keycloak" now lands as the full id in the
URL, and JobDetail's `jobsById` lookup is already keyed by BOTH the
canonical id AND the bare jobName (JobDetail.tsx:124-131), so the
resolution path is preserved.

Test coverage: new Case 4 in JobsPage.flow-merge.test.tsx asserts
the openova-flow row's anchor `href` contains
`/jobs/contabo:bp-openova-flow-server` and is NOT the bare-jobName
form. All 4 flow-merge cases PASS. The 3 pre-existing failures in
JobsPage.test.tsx (back-to-apps href, canonical-columns header,
Show-as-Flow button) are the documented iter-2 baseline — untouched
by this change.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:29:46 +04:00
github-actions[bot]
5c987309b5 deploy: update catalyst images to 5332ed0 2026-05-11 17:56:31 +00:00