openova

Author	SHA1	Message	Date
e3mrah	115c58885b	fix(cilium-gateway): allow world ingress to reserved:ingress (unblocks Sovereign public surfaces) (#1482 ) * fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled. On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and the wildcard Certificate sticks Ready=False — Cilium Gateway has no valid TLS secret → envoy listener never binds → public TLS handshake to console.<fqdn> dies with SSL_ERROR_SYSCALL. Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ? staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign- tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml references it as ${WILDCARD_CERT_ISSUER}. Default behaviour unchanged for non-QA (production) Sovereigns — they still resolve to letsencrypt-dns01-prod-powerdns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cilium-gateway): allow world ingress to Cilium Gateway reserved:ingress endpoint When Cilium Gateway API runs with gatewayAPI.hostNetwork.enabled=true and a default-deny CCNP is present, every public request to a Sovereign host (console, auth, gitea, registry, api, ...) hits the gateway listener and gets DENIED at envoy's cilium.l7policy filter with: cilium.l7policy: Ingress from 1 policy lookup for endpoint X for port 30443: DENY Public response: HTTP/1.1 403 Forbidden, body "Access denied", server: envoy. Root cause: Cilium creates a special endpoint with identity reserved:ingress (8) representing the gateway listener. By default this endpoint has policy-enabled=both with allowed-ingress-identities=[1 (host)] and empty L4 rules — so no port is permitted. The default-deny CCNP's NotIn-namespace endpointSelector does NOT cover this endpoint (it has no io.kubernetes.pod.namespace label), and our qa-fixtures didn't ship a matching allow-template for it. Net effect: TLS handshake succeeds, HTTPRoutes are Programmed, backends are healthy in-cluster, but every request 403s. Caught live on prov #80 (omantel.biz, 2026-05-14) after the Gateway hostNetwork fix (#1480) finally activated host-bind on :30443. Verified by: - envoy debug log: cilium.l7policy DENY for endpoint 10.42.0.201 port 30443 - cilium-dbg endpoint get 3282 -o json: l4.ingress: [] and allowed-ingress-identities: [1] - transiently applying the same CCNP via kubectl: console.omantel.biz → 200 Fix: ship a CCNP scoped to reserved:ingress that allows ingress from world, cluster, host, remote-node (multi-region CP-to-CP), and kube-apiserver, plus egress to all so envoy can forward to any backend service. This is the canonical Cilium hostNetwork Gateway-API zero-trust pattern. Chart bump: catalyst 1.4.142 → 1.4.143. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>	2026-05-14 18:50:34 +04:00
github-actions[bot]	fb99ae5fd0	deploy: update catalyst images to `a88e132`	2026-05-14 14:27:51 +00:00
e3mrah	a88e132be9	fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu (#1481 ) clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled. On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and the wildcard Certificate sticks Ready=False — Cilium Gateway has no valid TLS secret → envoy listener never binds → public TLS handshake to console.<fqdn> dies with SSL_ERROR_SYSCALL. Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ? staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign- tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml references it as ${WILDCARD_CERT_ISSUER}. Default behaviour unchanged for non-QA (production) Sovereigns — they still resolve to letsencrypt-dns01-prod-powerdns. Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 18:25:45 +04:00
e3mrah	6edb8b4635	fix(cilium): gatewayAPI hostNetwork.nodes.matchLabels (prov #76 ) (#1480 ) Cilium gatewayAPI.hostNetwork.enabled=true was set in values.yaml, but without nodes.matchLabels Cilium silently DISABLES hostNetwork mode. The configmap key gateway-api-hostnetwork-nodelabelselector is rendered EMPTY → eBPF redirect for the gateway NodePorts is never programmed → envoy listener has empty bind address → incoming 30443/30080 traffic dead-ends at the Hetzner LB target. Caught on prov #76 (omantel.biz, 2026-05-14): public TLS handshake to console.omantel.biz returns SSL_ERROR_SYSCALL because envoy isn't listening on the NodePort. cilium service list shows zero 30443/30080 entries. cilium proxy status shows 0 redirects active. Set nodes.matchLabels: kubernetes.io/os: linux (every k3s node carries this label) so the gateway listener is exposed on every CP. Chart: 1.3.4 → 1.3.5. bootstrap-kit slot 01 version pin bumped to match. Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 18:17:35 +04:00
github-actions[bot]	5752fc751f	deploy: update catalyst images to `bdceb3a`	2026-05-14 12:45:34 +00:00
e3mrah	bdceb3a78a	fix(canvas): region phase sub-groups default to pending (not running) (#1479 ) Empty handover/apps phase groups (no Jobs emitted yet for those lifecycle phases) were hardcoded to 'running' which propagated up to the root phase groups. With the rollup fix preserving stored status when no children, the correct stored default is 'pending'. After this, fresh-prov handover + apps groups show 'pending' (accurate — those phases haven't started) and the rollup correctly classifies bootstrap-kit + cutover region groups based on their real install-* children. Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 16:43:24 +04:00
github-actions[bot]	0e4cb67319	deploy: update catalyst images to `690d588`	2026-05-14 12:40:44 +00:00
e3mrah	690d588a04	fix(canvas): rollup preserves leaf status when group has no children (#1478 ) Bug found on prov #76 rollup: cluster-bootstrap (a leaf with family='bootstrap') was being treated as an empty group and reset from succeeded → pending. That status then cascaded up through provisioner (whose 5 children include cluster-bootstrap) making provisioner show pending despite all 5 phase jobs being succeeded. Fix: when a node in groupNodeIdx has zero children in contains rels, keep its STORED status instead of forcing pending. This preserves leaf-with-group-family nodes (cluster-bootstrap) AND empty phase groups (handover/apps before their Jobs exist). Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 16:38:30 +04:00
github-actions[bot]	195c6b5bc5	deploy: update catalyst images to `13d79c7`	2026-05-14 12:35:31 +00:00
e3mrah	13d79c77f5	fix(flow-emit): lazy-start emit loop on snapshot request (#1477 ) Bug found on prov #76: rolled-up group status fix wasn't visible because catalyst-api Pod restart (image roll) killed the emit goroutine. startFlowEmitLoop is only invoked from phase1_watch start — for a deployment already at status=ready, the new Pod has no emit loop until someone fires phase1 again. Add idempotent startFlowEmitLoop call inside HandleFlowSnapshot so any UI page load (which polls snapshot) reactivates the emit loop. Combined with the existing phase1-start invocation, this covers both fresh provisioning and post-restart UI access patterns. Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 16:33:25 +04:00
github-actions[bot]	5527652b49	deploy: update catalyst images to `f334950`	2026-05-14 12:29:07 +00:00
e3mrah	f3349501b8	fix(canvas): roll-up group status from descendants (prov #76 ) (#1476 ) Founder reported on prov #76: 'there are pending and running jobs still I dont think they are true'. Examination showed all 135 install-* leaf statuses are succeeded but the synthetic group nodes (cutover, handover, apps + per-region sub-groups) carried hardcoded placeholder statuses ('running' / 'pending') from emit time. Add bottom-up roll-up after all nodes/rels are emitted: - all descendants succeeded → succeeded - any descendant failed → failed - any descendant running → running - else → pending (no descendants or all pending) Now cutover phase bubble shows succeeded when its install-self- sovereign-cutover child has finished, etc. handover/apps stay pending until real Jobs are emitted for them (jobs.Store integration is the follow-up that materialises those phases). Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 16:26:59 +04:00
github-actions[bot]	ebac4ed63f	chore(deploy): bump openova-flow-server image to `a2167f3` [skip ci]	2026-05-14 10:25:19 +00:00
e3mrah	a2167f36de	fix(openova-flow): COPY go.sum + go mod download in Dockerfile (#1475 ) CI build failed with missing go.sum entry for pgx after the in-memory→CNPG rewrite (now has real deps). The previous Dockerfile only COPYed go.mod — fine when the codebase had zero external deps, broken once pgx + pgxpool + x/text + x/sync landed in go.sum. Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 14:23:57 +04:00
e3mrah	808310b144	fix(openova-flow): pin pgx to v5.5.5 for Go 1.22 build compat (#1472 ) CI Dockerfile uses golang:1.22-alpine. Default pgx@v5.9.2 requires Go 1.25 — fix by pinning pgx@v5.5.5 + x/text@v0.21.0 + x/sync@v0.10.0. Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 14:21:33 +04:00
github-actions[bot]	fb8303766e	deploy: update catalyst images to `587a985`	2026-05-14 10:18:12 +00:00
e3mrah	587a985dc6	refactor(openova-flow): CNPG-backed durable store + emit loop (#1471 ) Founder feedback on prov #75: "uncappetabel stupid design… if our pods are restarting entire flow information are exec logs are being wiped". Root cause: openova-flow-server had ZERO persistence (in-memory map+RingBuffer per flowId) so pod restart wiped all canvas state. catalyst-api's flow_snapshot_local.go composer was added as a "fallback" precisely because openova-flow-server couldn't be trusted — but that created TWO half-broken paths instead of one durable backend. ## Waterfall delivery — single PR, end-to-end ### openova-flow-server: in-memory → CNPG (Postgres) backed - New schema: `flow_instances`, `flow_nodes`, `flow_relationships`, `flow_events`, `flow_log_lines`, `flow_executions` with CASCADE FK, indexes on (flow_id, status/region/family), and a bounded-retention trigger on `flow_events` (keeps last 4096 per flow_id — matches the prior RingBuffer capacity). - `pgstore.go` rewires Append/Snapshot/Subscribe/Drop with pgxpool transactional writes + LISTEN/NOTIFY pub/sub via per-flow channel hash. Migrations applied at startup via embedded `embed.FS`. - Backend abstraction (`store.Backend`) lets api/ swap between PGStore (production) and the legacy MemBackend (tests/dev). `FLOW_SERVER_BACKEND=pg\|memory` env selects. - New endpoints: POST/GET `/v1/flows/{id}/log-lines` for exec log ingest+replay against the `flow_log_lines` table. ### Helm chart: CNPG Cluster CR + DSN wire-in - New `templates/cnpg-cluster.yaml` provisions `openova-flow-pg` via bp-cnpg's `postgresql.cnpg.io/v1.Cluster`. CASCADE-FK-aware schema + Reflector annotations for cross-NS secret access. - Deployment env wires `FLOW_SERVER_PG_DSN` from CNPG's auto-generated `<cluster>-app` Secret (`uri` key — full libpq URI with auth). - `chart 0.1.1 → 0.2.0` (breaking schema change). - bootstrap-kit slot 56: `dependsOn: bp-cnpg` so cold install order is correct. ### catalyst-api: emit loop + remove local fallback first - New `internal/flowemit/` HTTP client posts FlowMessage envelopes (snapshot, upsert-nodes, upsert-rels, delete-*) to `OPENOVA_FLOW_SERVER_URL/v1/flows/{id}/events`. Bounded retry, fire-and-forget. - New `flow_emitter.go` runs a per-deployment 5s ticker goroutine that composes the current snapshot via `flowSnapshotFromJobs` and emits it. State changes via Bridge call `triggerFlowEmit(depID)` for sub-second propagation. - `HandleFlowSnapshot` order INVERTED: proxy to openova-flow-server FIRST, fall back to local composer ONLY in degraded mode (proxy unreachable). Production traffic now durably reads from CNPG. - Emit loop starts when phase 1 watch begins; idempotent; survives catalyst-api restart because state is in CNPG. ## What this delivers - ✅ Canvas data is DURABLE — survives any pod restart (catalyst-api, openova-flow-server, or both). - ✅ openova-flow-server is now stateless — every read hits CNPG. - ✅ Wire contract (FlowMessage envelopes) unchanged. UI unchanged. - ✅ catalyst-api can be horizontally scaled — no in-memory state needed for the graph path (deployments map + jobs.Store retire in follow-up). ## What's NOT in this PR (clear follow-up) - jobs.Store + PVC retirement: exec logs still on PVC. Moving them to `flow_log_lines` requires updating ~30 callers across the catalyst-api handler/ package — out of scope for this single PR's blast radius. The new `POST /v1/flows/{id}/log-lines` endpoint is already in place; only the call sites need to migrate. - flow_snapshot_local.go: kept as the degraded-mode fallback (proxy unreachable). Will be deleted once jobs.Store retirement removes the underlying read path. Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 14:16:11 +04:00
github-actions[bot]	bb2726bcf9	deploy: update catalyst images to `f110a54`	2026-05-14 06:51:04 +00:00
e3mrah	f110a540d8	fix(canvas): persist DependsOn on every event + /refresh-watch fans out to secondary regions (#1470 ) Founder caught on prov #75 (b7ae422089d4fde9) after PR #1469 deploy: all 3 regions' 45-children dep wiring vanished after the catalyst-api pod restart. Root cause: the deps were never in Job.DependsOn — they were only in the Pod's in-memory hrDeps cache built from liveWatcher.SnapshotComponents() Layer-2 in flow_snapshot_local.go. Pod restart killed the cache. ## Two fixes ### Fix A — Bridge.OnHelmReleaseEvent preserves existing DependsOn `OnHelmReleaseEvent` previously hardcoded `DependsOn: []string{}` on every HR state-transition event, relying on `mergeJob` to keep the prior list. That works when SeedJobsFromInformerList wrote the deps FIRST. But the seed fires once at OnInitialListSynced; if the seed ran during a window when HR.spec.dependsOn was being applied/rolling, or if the seed didn't run at all (silent informer failure post-Pod restart), Job.DependsOn stays `[]` forever and every subsequent event re-confirms it. Fix: load the existing Job from store first, carry its DependsOn through on the upsert. Same pattern as OnRawComponentLog at line ~939. Combined with mergeJob's preserve-prev behaviour, deps are durable across event waves. ### Fix B — /refresh-watch respawns secondary watchers `POST /refresh-watch` rebuilt the PRIMARY helmwatch.Watcher and re-ran SeedJobsFromInformerList for the primary. But it did NOT respawn secondary watchers — so after a Pod restart, secondaries' 90 install Jobs stayed flat indefinitely. Fix: call `spawnSecondaryRegionWatchers(dep)` from RefreshWatch (idempotent — already running watchers short-circuit on `stopWatchers[region]`). With this, /refresh-watch restores deps for ALL regions, not just primary. ## Validation Caught the bug via per-region edge audit on prov #75 (NOT aggregate counts — per `feedback_validate_full_dod_before_declaring_pass.md`). Pre-fix: fsn1=0 / hel1-2=0 / nbg1-1=0 intra-region edges. Post-fix target: fsn1=71 / hel1-2=71 / nbg1-1=71. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 10:48:46 +04:00
github-actions[bot]	b4c96a6d0d	deploy: update catalyst images to `df1dfed`	2026-05-14 06:30:40 +00:00
e3mrah	df1dfed707	fix(canvas): opaque bubbles + explicit wires-below layering (#1469 ) Founder rule on prov #75 review: "make sure the bubbles are no more transparent and wires are always below the bubbles". Two fixes: 1. Opaque bubbles always. Previously `groupOpacity = isDimmed ? 0.35 : 1` dropped the entire group's opacity to 35% when another job was open and this node wasn't on the focused path — making the bubble fill see-through and the edges behind visible THROUGH the bubble. Replaced with a CSS `filter: grayscale + brightness` treatment that desaturates the dimmed node without making it transparent. 2. Explicit edges-then-nodes paint layers. Wrapped the edges loop in `<g className="flow-edges-layer" data-layer="edges">` and the nodes loop in `<g className="flow-nodes-layer" data-layer="nodes">`. SVG paint order already produced the correct ordering via JSX source order, but a future code change inserting another element between the two could quietly break it; the explicit wrappers make the contract load-bearing and inspectable. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 10:28:40 +04:00
github-actions[bot]	331e6b2834	deploy: update catalyst images to `b4c2f54`	2026-05-14 06:12:28 +00:00
e3mrah	b4c2f54fa2	fix(canvas): don't region-prefix PRIMARY install deps (prov #74 ) (#1468 ) Regression caught immediately after PR #1467 by founder on prov #74 (be70efe343e58b5a). My validation declared "✅ all 5 issues passed" based on aggregate 292 edges + 5 sampled hel1-2 deps, missing that PRIMARY fsn1 had 0 intra-region edges + 71 phantom cross-region edges. ## Root cause PR #1467 wired primary install jobs into a primary region sub-group (jobRegion = dep.Request.Region) for symmetric multi-region rendering. `regionalise()` triggered on `jobRegion != ""` — over-applying the `fsn1:` prefix to PRIMARY's bare-named DependsOn entries: install-cilium → install-fsn1:cilium (PHANTOM — no such node exists) PRIMARY install Jobs have BARE JobNames in the store ("install-cilium"); only SECONDARY install Jobs have region-prefixed JobNames ("install-hel1-2:cilium"). Region-prefixing primary deps produces a JobID that matches no node, so the edge is dropped or points at nothing. A second related bug: Layer-1 heuristic `!strings.Contains(dep, ":")` was used to detect bare-jobName form, but with the new `:` separator a region-prefixed JobName ("install-hel1-2:cilium") now contains a colon — so the heuristic mis-classified it as "already a full JobID" and emitted FromID without the deploymentID prefix. Phantom edge. ## Fixes 1. `isSecondaryRegionJob := strings.IndexByte(j.AppID, ':') > 0` replaces `jobRegion != ""` as the regionalise() gate. Primary jobs have no `:` in AppID → no prefix injection. 2. `fullJobIDPrefix := deploymentID + ":"` replaces the `strings.Contains(dep, ":")` heuristic. Only deps that ALREADY carry the deploymentID prefix are passed through verbatim; bare JobNames (with or without region prefix) get the JobID() wrap. ## Lesson learned Saved `feedback_validate_full_dod_before_declaring_pass.md` — aggregate metrics and sample checks are NOT validation. Every DoD bullet must run an explicit per-tier pass/fail check before declaring resolved. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 10:10:28 +04:00
github-actions[bot]	2f5b1cd0ee	deploy: update catalyst images to `4814c68`	2026-05-14 05:55:28 +00:00
e3mrah	4814c6849b	fix(canvas): wire deps + phase groups + URL-safe separator (prov #73 ) (#1467 ) Founder caught 5 canvas defects on prov #73 (8cd1ff1a80430dc5): 1. ✅ depth=1 shows 2 bubbles (provisioner + bootstrap-kit) — confirmed correct architecture per composer. 2. ✅ Expanding bootstrap-kit shows 3 region sub-groups — confirmed. 3. 🐛 All 135 install-* nodes had ZERO inter-HR dep edges. Snapshot showed only 5 finish-to-start rels (tofu chain + bootstrap-kit sequence). install-cert-manager → install-cilium etc. all missing. 4. 🐛 Canvas only emitted 2 phase groups (provisioner + bootstrap-kit). Missing cutover/handover/apps despite being part of the canonical 5-phase lifecycle. 5. 🐛 /jobs/install-hel1-2/newapi returned 404 because TanStack Router splits "/" in the $jobId param. ## Fixes ### Fix 3a: mergeJob preserves prev.DependsOn when next is empty store.go:283 — `if len(next.DependsOn)==0 && len(prev.DependsOn)>0` keeps prior list. Without this, every OnHelmReleaseEvent (which hardcodes `DependsOn: []string{}` at line 508 because it doesn't re-look up HR.spec.dependsOn per event) CLOBBERED the seeded deps. Confirmed in store: 135/135 install Jobs had `dependsOn: []` despite SeedJobsFromInformerList running with proper deps. Founder reported this same flat-leaves bug 4 sessions in a row. ### Fix 3b: secondary watchers get region-aware seeder hook New `attachSecondaryBridgeSeederHook` + `snapshotsToSeedsForRegion` wire the seed path for secondary helmwatch.Watchers. Without this, secondary install-* Jobs were only ever created by per-event OnHelmReleaseEvent (DependsOn=[]) so the canvas dep graph was permanently flat under secondary region groups regardless of fix 3a. ### Fix 3c: composer Layer-2 reads secondary watchers' HR.spec.dependsOn flow_snapshot_local.go now also walks dep.secondaryWatchers and populates hrDeps with region-prefixed keys + region-prefixed values. After fix 3a+3b the stored Job.DependsOn is the authoritative source (Layer 1) — this Layer-2 enrichment is the safety net for hot- shipped charts that bypass the seed path. ### Fix 4: cutover/handover/apps phase groups types.go — add GroupCutover/Handover/Apps constants + Display. flow_snapshot_local.go — add phaseForChart() classifier (currently maps self-sovereign-cutover → cutover), reparent install jobs to the correct phase sub-group, synthesise per-region sub-groups for each phase, emit top-level phase groups, and chain them with finish-to-start: provisioner → bootstrap-kit → cutover → handover → apps. ### Fix 5: JobName separator `/` → `:` (canonical per memory rule) phase1_watch.go:457 emits ev.Component = region + ":" + chart. jobs_backfill.go + flow_snapshot_local.go updated to detect ":" instead of "/". useJobLinkBuilder's encodeURIComponent already handles ":". /jobs/install-hel1-2:newapi now matches the TanStack Router $jobId route. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 09:53:23 +04:00
github-actions[bot]	f5929e6114	deploy: update catalyst images to `2626d40`	2026-05-14 04:27:53 +00:00
e3mrah	2626d40117	chore(catalyst-chart): bump 1.4.141 → 1.4.142 — propagate prov #72 fixes (#1466 ) PR #1465 added `catalyst` + `newapi` to default-deny allowlist and shipped `allow-kube-apiserver` CNP for qa-omantel, but the chart version wasn't bumped so HRs across active provisions kept resolving the OLD 1.4.141 artifact (with the broken allowlist). Bumping to 1.4.142 forces Flux on every Sovereign to upgrade and pick up the fix. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 08:25:55 +04:00
github-actions[bot]	edf8e6fd18	deploy: update catalyst images to `c267ab5`	2026-05-14 04:20:59 +00:00
e3mrah	c267ab5338	fix(qa-fixtures): allow catalyst+newapi NS + kube-apiserver egress (prov #72 ) (#1465 ) * fix(flow_snapshot): region-scope dep edges (no cross-region wiring) Founder caught on prov #66 (3dc9249ea73a6840, 2026-05-13): hel1-2's install-* nodes all rendered dep arrows pointing at PRIMARY's install nodes — cross-region edges where NAMING-CONVENTION §1.3 demands independent fault domains (no cross-region wiring). Root cause: helmwatch.Bridge persists secondary-region Jobs with bare dep names ("install-cilium") because HR.spec.dependsOn carries chart names without region context. The snapshot composer's normaliser turned `install-cilium` → `<depID>:install-cilium` which IS the primary's cilium JobID, not hel1-2's `<depID>:install-hel1-2/cilium`. Every secondary install therefore drew a phantom cross-region edge. Fix: in flow_snapshot_local.go, region-scope dep names when the source Job is regional: jobRegion=="hel1-2" + dep="install-cilium" → "install-hel1-2/cilium" → "<depID>:install-hel1-2/cilium" Same fix applied to the Layer-2 hrDeps derivation path (per-AppID lookup also gets bare chart names from the primary watcher). hrDeps lookup is now done with the unprefixed AppID so it actually hits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud-init): wait for private NIC before k3s install (prov #71) Hetzner Cloud hot-attaches the private-network NIC ~10-20s AFTER server create. cloud-init init-local fetches /hetzner/v1/metadata/private-networks BEFORE the NIC is ready, renders netplan with only eth0, and the private NIC (kernel-renamed eth1 → enp7s0 by udev) stays DOWN. Effect on secondary CPs: k3s server starts with --node-ip=10.0.<10+idx>.2 --advertise-address=10.0.<10+idx>.2 and fatals on "listen tcp 10.0.11.2:2380: bind: cannot assign requested address" then crashloops. Caught on prov #71/omantel.biz/nbg1-1-cp1: k3s.service restart counter reached 5394, kubeconfig never PUT back to mothership, canvas showed secondary region as a permanent black hole. Diagnosed via Hetzner rescue mode SSH 2026-05-14. Primary CP works by luck of faster fsn1 zone NIC attach. Fix: in cloud-init runcmd, BEFORE the k3s install, poll up to 120s for the expected private IP (control plane) or a route to it (worker). If the NIC appears DOWN with no netplan stanza, generate one with dhcp4:true and `netplan apply`. Bail loudly if the IP/route never appears — failures surface in cloud-init.log instead of disguising as a slow boot. Symmetric fix in worker template covers autoscaler-spawned secondary workers when worker_count > 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(qa-fixtures): allow catalyst+newapi NS + kube-apiserver egress (prov #72) The qa-fixtures chart's `default-deny` CiliumClusterwideNetworkPolicy excluded `catalyst-system` from its NotIn list but FORGOT `catalyst` (where bp-self-sovereign-cutover's Jobs live: auto-trigger, gitea-mirror, harbor-projects, registry-pivot) and `newapi` (where bp-newapi's Application pods live). Effect on prov #72: - bp-self-sovereign-cutover-auto-trigger Job stuck 20m+ on HTTP 000000 curling http://catalyst-api.catalyst-system.svc → DNS resolution + TCP egress both denied by default-deny. Cutover never fires → handover blocked → bp-catalyst-platform's --wait never completes. - newapi-bp-newapi pod gets `secret newapi-oidc not found` but its inability to resolve apiserver compounds the issue. - qa-omantel cnpg cluster-primary/replica stuck "Setting up primary" for 18m because initdb's `dial tcp 10.43.0.1:443 i/o timeout` — the ClusterIP-rewritten kube-apiserver address has no allow-egress. Fixes: 1. Add `catalyst` + `newapi` to $excludedNamespaces — they're first-party blueprint namespaces analogous to catalyst-system. 2. Add `allow-kube-apiserver` CNP in qa-omantel using Cilium's canonical `toEntities: [kube-apiserver]` directive so cnpg initdb can reach the apiserver regardless of whether traffic resolves to ClusterIP, node IP, or Service VIP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 08:18:54 +04:00
github-actions[bot]	5f2298c550	deploy: update catalyst images to `a75463f`	2026-05-14 03:42:19 +00:00
e3mrah	a75463f76a	fix(cloud-init): wait for private NIC before k3s install (prov #71 ) (#1464 ) * fix(flow_snapshot): region-scope dep edges (no cross-region wiring) Founder caught on prov #66 (3dc9249ea73a6840, 2026-05-13): hel1-2's install-* nodes all rendered dep arrows pointing at PRIMARY's install nodes — cross-region edges where NAMING-CONVENTION §1.3 demands independent fault domains (no cross-region wiring). Root cause: helmwatch.Bridge persists secondary-region Jobs with bare dep names ("install-cilium") because HR.spec.dependsOn carries chart names without region context. The snapshot composer's normaliser turned `install-cilium` → `<depID>:install-cilium` which IS the primary's cilium JobID, not hel1-2's `<depID>:install-hel1-2/cilium`. Every secondary install therefore drew a phantom cross-region edge. Fix: in flow_snapshot_local.go, region-scope dep names when the source Job is regional: jobRegion=="hel1-2" + dep="install-cilium" → "install-hel1-2/cilium" → "<depID>:install-hel1-2/cilium" Same fix applied to the Layer-2 hrDeps derivation path (per-AppID lookup also gets bare chart names from the primary watcher). hrDeps lookup is now done with the unprefixed AppID so it actually hits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud-init): wait for private NIC before k3s install (prov #71) Hetzner Cloud hot-attaches the private-network NIC ~10-20s AFTER server create. cloud-init init-local fetches /hetzner/v1/metadata/private-networks BEFORE the NIC is ready, renders netplan with only eth0, and the private NIC (kernel-renamed eth1 → enp7s0 by udev) stays DOWN. Effect on secondary CPs: k3s server starts with --node-ip=10.0.<10+idx>.2 --advertise-address=10.0.<10+idx>.2 and fatals on "listen tcp 10.0.11.2:2380: bind: cannot assign requested address" then crashloops. Caught on prov #71/omantel.biz/nbg1-1-cp1: k3s.service restart counter reached 5394, kubeconfig never PUT back to mothership, canvas showed secondary region as a permanent black hole. Diagnosed via Hetzner rescue mode SSH 2026-05-14. Primary CP works by luck of faster fsn1 zone NIC attach. Fix: in cloud-init runcmd, BEFORE the k3s install, poll up to 120s for the expected private IP (control plane) or a route to it (worker). If the NIC appears DOWN with no netplan stanza, generate one with dhcp4:true and `netplan apply`. Bail loudly if the IP/route never appears — failures surface in cloud-init.log instead of disguising as a slow boot. Symmetric fix in worker template covers autoscaler-spawned secondary workers when worker_count > 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 07:39:25 +04:00
github-actions[bot]	af3a1e6375	deploy: update catalyst images to `410a3db`	2026-05-13 18:05:18 +00:00
e3mrah	410a3dbd33	fix(flow_snapshot): region-scope dep edges (no cross-region wiring) (#1461 ) Founder caught on prov #66 (3dc9249ea73a6840, 2026-05-13): hel1-2's install-* nodes all rendered dep arrows pointing at PRIMARY's install nodes — cross-region edges where NAMING-CONVENTION §1.3 demands independent fault domains (no cross-region wiring). Root cause: helmwatch.Bridge persists secondary-region Jobs with bare dep names ("install-cilium") because HR.spec.dependsOn carries chart names without region context. The snapshot composer's normaliser turned `install-cilium` → `<depID>:install-cilium` which IS the primary's cilium JobID, not hel1-2's `<depID>:install-hel1-2/cilium`. Every secondary install therefore drew a phantom cross-region edge. Fix: in flow_snapshot_local.go, region-scope dep names when the source Job is regional: jobRegion=="hel1-2" + dep="install-cilium" → "install-hel1-2/cilium" → "<depID>:install-hel1-2/cilium" Same fix applied to the Layer-2 hrDeps derivation path (per-AppID lookup also gets bare chart names from the primary watcher). hrDeps lookup is now done with the unprefixed AppID so it actually hits. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 22:03:06 +04:00
github-actions[bot]	3c38565951	deploy: update catalyst images to `4a14bbf`	2026-05-13 16:34:30 +00:00
e3mrah	4a14bbf328	fix(flow_snapshot): symmetric region groups — primary gets its own too (#1460 ) Founder caught on prov #65 (6e2fd14bb8b6ed4d, 2026-05-13): canvas shows ASYMMETRIC structure — primary's 45 install jobs render as BARE LEAVES directly under bootstrap-kit, while secondary regions get a proper region sub-group. Result: M×N fan-out from provision-hetzner cascades onto every primary leaf because there's no primary region group to absorb the elided-group edge. PR #1454 introduced region derivation from JobName's `/` separator (secondary watchers emit `install-<region>/<chart>`). Primary's bridge emits bare `install-<chart>` names — no `/`, no region derived, no group synthesized. Fix: derive primary region from `dep.Request.Region` and apply it to every install job with no `/` in AppID. The synth-region-group loop below already creates one group per discovered region, so primary automatically gets its own `<deploymentId>:<primaryRegion>:bootstrap-kit` bubble containing all 45 primary installs. End state: 3 symmetric region sub-groups under bootstrap-kit (fsn1 + nbg1-1 + hel1-2 for 3-region prov), each with exactly 45 install-* children, region-bounded temporal-endpoint cascade prevents M×N fan-out at depth=all. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 20:31:20 +04:00
github-actions[bot]	cd5ace8dcb	deploy: update catalyst images to `32e0b40`	2026-05-13 15:42:13 +00:00
e3mrah	32e0b408bf	fix(k3s): add public IP --tls-san + openova.io/region node label (#1459 ) Two related fixes for multi-region + qa-fixtures DoD on prov #64: 1. k3s TLS cert needs the public IPv4 in SAN. Mothership helmwatch.Bridge connects to secondary CPs via PUBLIC IP (cloud-init rewrites kubeconfig 127.0.0.1 → CP_PUBLIC_IPV4). k3s auto-generates the server cert with SANs from --tls-san flags. We only had [sovereign_fqdn, cp_private_ip] → cert valid for 10.0.10.2 + cluster-ip + 127.0.0.1 only. Bridge connection from contabo rejected with: "x509: certificate is valid for 10.0.10.2, 10.43.0.1, 127.0.0.1, ::1, not 204.168.212.113" → silent watcher failure → 0 secondary HRs observed → canvas missing region sub-groups. Fix: pre-fetch the CP's public IPv4 from Hetzner metadata before k3s install, add it as --tls-san=$CP_PUBLIC_IPV4. 2. openova.io/region=hz-fsn-rtz-prod node label. qa-fixtures Pods (CNPGPair primary/replica, status seeder Jobs, qa-wp Application) carry hard nodeAffinity for `openova.io/region in [hz-fsn-rtz-prod]` (per qaFixtures.primaryRegion default in products/catalyst/chart/templates/qa-fixtures/*.yaml). Without the label every fixture pod FailedScheduling → bp-catalyst- platform post-install hook waits forever → bootstrap-kit chain hangs at 44/45 with bp-catalyst-platform Running. Fix: --node-label openova.io/region=hz-fsn-rtz-prod on primary CP (qa-fixtures pin to primary by design). Both shipped in same commit since both are inside the same k3s server install line. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 19:38:25 +04:00
github-actions[bot]	55edb953d5	deploy: update catalyst images to `44913d8`	2026-05-13 14:40:02 +00:00
e3mrah	44913d8a6a	fix(k3s): --kubelet-arg=max-pods=220 (CP + worker) for qa-fixtures load (#1458 ) prov #63 (cpx52 × 3, all PRs live): bp-catalyst-platform install hook timed out because the catalyst-api Helm-released pod stayed Pending with "Too many pods. 0/1 nodes are available". k3s kubelet default max-pods is 110. Full bootstrap-kit (~45 HR-managed deployments, each with 1-3 pods) + qa-fixtures stack (qa-omantel ns Application + Continuum + CNPGPair + PDM CRs + seeder Jobs) + Cilium/ flux/cnpg sidecars saturate the slot cleanly. With workers NotReady on prov #63 the CP carried everything alone and dropped scheduling at 110. Bump to 220 on both CP and worker so the saturation point doesn't gate the bootstrap chain. Safe ceiling: each Hetzner cpx52 node has 16 vCPU + 32GB RAM, plenty of headroom for 220 pods of typical bootstrap-kit weight. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 18:37:42 +04:00
github-actions[bot]	b6e6470ccf	deploy: update catalyst images to `5f4f9f2`	2026-05-13 14:01:04 +00:00
e3mrah	5f4f9f2cb5	fix(k3s): pin --node-ip + --advertise-address to cp_private_ip (#1457 ) prov #62 (cpx52, kernel 6.8.0-111): primary CP cilium init CrashLoop with "dial tcp 10.0.1.2:6443: i/o timeout". k3s server auto-detects its node IP from the primary interface, which on Hetzner cpx52 binds to the public IPv4 (49.x.x.x) instead of the private network IP (10.0.1.2). kube-apiserver advertises 49.x.x.x and binds there; nothing answers on 10.0.1.2:6443. Cilium agent's k8s-client wants the private IP from cilium-config k8sServiceHost — times out, CrashLoop. Worked by luck on cpx42 (earlier kernel + Hetzner network attach timing). cpx52 reproduces 100%. Fix: pass --node-ip=${cp_private_ip} + --advertise-address=${cp_private_ip} in INSTALL_K3S_EXEC. k3s then binds kube-apiserver on the private IP AND advertises it as the node's INTERNAL-IP. Pods reaching ${cp_private_ip}:6443 (cilium-config substitute) find the API server every time. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:34:30 +04:00
e3mrah	6fac1481d3	fix(catalyst-api): bump memory limit 1Gi → 4Gi for multi-region snapshot load (#1456 ) prov #61 (2e197a934a0e0461, 2026-05-13): catalyst-api OOMKilled 6× during phase-1 watch on a 3-region Sovereign. The in-memory state has grown substantially since the 1Gi limit was set: - 1 primary helmwatch.Watcher (45 HRs + informer cache) - N secondary helmwatch.Watchers (45 HRs × 2 secondary regions, each with its own informer cache) - jobs.Store backed by on-disk + in-memory tree - per-/snapshot poll: composes per-region region groups across all Job rows + cross-references hrDeps from the live primary watcher Combined steady-state exceeds 1Gi on cpx32-equivalent loads. Bumped limits to 4Gi (request 512Mi up from 128Mi). The mothership node has 8GB+ resident, no other tight constraint. Future fix: persist region in Job rows so secondary watchers don't need to be retained post phase-1 (orthogonal cleanup). Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:20:00 +04:00
github-actions[bot]	2c6374b200	deploy: update catalyst images to `8518bb1`	2026-05-13 12:48:59 +00:00
e3mrah	8518bb1f50	fix(flow_snapshot): drop duplicate live-watcher multi-region block (#1455 ) * fix(JobsTable): strip <deploymentId>: prefix from row link (404 fix) Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a running secondary-region install-* row on /sovereign/provision/<id>/jobs landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover and returned "404 page not found". Root cause: useJobLinkBuilder was passing the FULL canvas JobID form through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping the "<deploymentId>:" prefix. The canvas emits ids like "<deploymentId>:install-X" (single-region) or "<deploymentId>:<region>:install-X" (multi-region, see flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName — exact-match URL lookup of the prefix-bearing form misses every time. FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs row click and a canvas drill-down resolve to the SAME backend endpoint. The existing JobsTable row-link test uses a job.id with no `:` prefix, so the strip is a no-op for that fixture and the `/jobs/job-install-cilium` assertion still holds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow_snapshot_local): derive region from persisted JobName, synth region groups Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): the multi-region canvas at /sovereign/provision/<id>/jobs/tofu-output renders 135 install-* leaves as direct children of bootstrap-kit (no region sub-groups visible), and the provision-hetzner→bootstrap-kit edge fans M×N across all 135. Root cause: spawnSecondaryRegionWatchers (phase1_watch.go:429) emits events with `ev.Component = region + "/" + componentName`. The jobs bridge persists them with `JobName=install-<region>/<chart>` and `AppID=<region>/<chart>`, BUT ParentID=bootstrap-kit (the bridge has no region awareness). After phase 1 terminates the deferred stopSecondaries() clears `dep.secondaryWatchers`, so the multi-region snapshot block (line 408-460, gated on `len(secondaryWatchers) > 0`) becomes a no-op. flowSnapshotFromJobs then emits all 135 install Jobs flat under bootstrap-kit, no Region field set, no region group bubbles, and flowLayoutOrganic.ts's temporal-endpoint cascade fans the provisioner→bootstrap-kit edge onto all 135 because there's no intermediate region group to absorb it. Fix: in the per-Job loop, detect `/` in `j.AppID` (the canonical multi-region prefix marker), derive the region key, set FlowNode.Region, and re-parent to a synthesised "<deploymentId>:<region>:bootstrap-kit" group. After the loop, synthesise one bootstrap-kit sub-group node per discovered region with a `contains` edge to the parent bootstrap-kit. The resulting shape: bootstrap-kit ├── 45 primary install-* (legacy parent, no region) ├── <region-A>:bootstrap-kit ── 45 install-* (region tagged) └── <region-B>:bootstrap-kit ── 45 install-* (region tagged) This persists ACROSS phase-1 termination because the source of truth is jobs.Store (durable), not dep.secondaryWatchers (transient). The multi-region block (line 408+) still runs WHEN secondary watchers are alive (during phase 1) — it emits ADDITIONAL FlowNodes with "<deploymentId>:<region>:install-X" IDs distinct from the persisted "<deploymentId>:install-<region>/<chart>" IDs, so the two paths don't collide. Post-phase-1 the watchers clear and only the persisted-Job path remains, but now WITH region structure preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow_snapshot): remove duplicate live-watcher multi-region block PR #1454 added region-group synthesis from persisted Job rows. The old secondaryWatchers-based block at line 442+ emitted nodes with the SAME region-group IDs AND child nodes, so during phase 1 (when both paths are live) the snapshot rendered with 90 children per region group instead of 45 — visible on prov #61 (2e197a934a0e0461): bootstrap-kit: 49 children hel1-2:bootstrap-kit: 90 children (should be 45) nbg1-1:bootstrap-kit: 90 children (should be 45) Plus the region groups appeared twice in the node list. Root cause: the per-Job loop (PR #1454) and the legacy block both write to the same region-group IDs without deduping. The per-Job path covers the persisted-Job state (durable across phase-1 termination), so the live-watcher path is redundant. Fix: delete the legacy block. The earlier secondaryWatchers-snapshot-into-map work (lines 182-205) is kept because that path also reads dep.liveWatcher (primary) for the hrDeps lookup the per-Job loop uses for primary-region dep edges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:47:00 +04:00
github-actions[bot]	ed4f66438f	deploy: update catalyst images to `d9d7fa2`	2026-05-13 12:26:59 +00:00
e3mrah	d9d7fa2baa	fix(flow_snapshot): derive region from persisted JobName, synth region groups (#1454 ) * fix(JobsTable): strip <deploymentId>: prefix from row link (404 fix) Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a running secondary-region install-* row on /sovereign/provision/<id>/jobs landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover and returned "404 page not found". Root cause: useJobLinkBuilder was passing the FULL canvas JobID form through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping the "<deploymentId>:" prefix. The canvas emits ids like "<deploymentId>:install-X" (single-region) or "<deploymentId>:<region>:install-X" (multi-region, see flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName — exact-match URL lookup of the prefix-bearing form misses every time. FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs row click and a canvas drill-down resolve to the SAME backend endpoint. The existing JobsTable row-link test uses a job.id with no `:` prefix, so the strip is a no-op for that fixture and the `/jobs/job-install-cilium` assertion still holds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(flow_snapshot_local): derive region from persisted JobName, synth region groups Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): the multi-region canvas at /sovereign/provision/<id>/jobs/tofu-output renders 135 install-* leaves as direct children of bootstrap-kit (no region sub-groups visible), and the provision-hetzner→bootstrap-kit edge fans M×N across all 135. Root cause: spawnSecondaryRegionWatchers (phase1_watch.go:429) emits events with `ev.Component = region + "/" + componentName`. The jobs bridge persists them with `JobName=install-<region>/<chart>` and `AppID=<region>/<chart>`, BUT ParentID=bootstrap-kit (the bridge has no region awareness). After phase 1 terminates the deferred stopSecondaries() clears `dep.secondaryWatchers`, so the multi-region snapshot block (line 408-460, gated on `len(secondaryWatchers) > 0`) becomes a no-op. flowSnapshotFromJobs then emits all 135 install Jobs flat under bootstrap-kit, no Region field set, no region group bubbles, and flowLayoutOrganic.ts's temporal-endpoint cascade fans the provisioner→bootstrap-kit edge onto all 135 because there's no intermediate region group to absorb it. Fix: in the per-Job loop, detect `/` in `j.AppID` (the canonical multi-region prefix marker), derive the region key, set FlowNode.Region, and re-parent to a synthesised "<deploymentId>:<region>:bootstrap-kit" group. After the loop, synthesise one bootstrap-kit sub-group node per discovered region with a `contains` edge to the parent bootstrap-kit. The resulting shape: bootstrap-kit ├── 45 primary install-* (legacy parent, no region) ├── <region-A>:bootstrap-kit ── 45 install-* (region tagged) └── <region-B>:bootstrap-kit ── 45 install-* (region tagged) This persists ACROSS phase-1 termination because the source of truth is jobs.Store (durable), not dep.secondaryWatchers (transient). The multi-region block (line 408+) still runs WHEN secondary watchers are alive (during phase 1) — it emits ADDITIONAL FlowNodes with "<deploymentId>:<region>:install-X" IDs distinct from the persisted "<deploymentId>:install-<region>/<chart>" IDs, so the two paths don't collide. Post-phase-1 the watchers clear and only the persisted-Job path remains, but now WITH region structure preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:24:20 +04:00
github-actions[bot]	6f50bc0a4a	deploy: update catalyst images to `3a08c23`	2026-05-13 12:05:56 +00:00
e3mrah	3a08c23ae4	fix(JobsTable): strip <deploymentId>: prefix from row link (404 fix) (#1453 ) Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a running secondary-region install-* row on /sovereign/provision/<id>/jobs landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover and returned "404 page not found". Root cause: useJobLinkBuilder was passing the FULL canvas JobID form through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping the "<deploymentId>:" prefix. The canvas emits ids like "<deploymentId>:install-X" (single-region) or "<deploymentId>:<region>:install-X" (multi-region, see flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName — exact-match URL lookup of the prefix-bearing form misses every time. FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs row click and a canvas drill-down resolve to the SAME backend endpoint. The existing JobsTable row-link test uses a job.id with no `:` prefix, so the strip is a no-op for that fixture and the `/jobs/job-install-cilium` assertion still holds. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:03:47 +04:00
github-actions[bot]	f1d77fc9bb	deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.20	2026-05-12 18:53:16 +00:00
e3mrah	64876c0de3	fix(bp-guacamole): render.sh resource count 15→19 unblocks Blueprint Release (#1451 ) * fix(infra): pass cp_private_ip to primary CP templatefile too PR #1446 added cp_private_ip references in cloudinit-control-plane.tftpl but only the SECONDARY templatefile call at main.tf:840 already had that var threaded. The PRIMARY CP call at line 342 was missed and tofu plan blew up with "vars map does not contain key cp_private_ip". Set it to "10.0.1.2" for the primary (the hardcoded value the chart default + worker_cloud_init already use for the canonical 10.0.1.0/24 primary subnet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hetzner): pass cp_private_ip into secondary-region CP cloud-init templatefile prov #52-54 all failed at `tofu plan` once cloudinit-control-plane.tftpl started consuming ${cp_private_ip} (PR #1446): Invalid value for "vars" parameter: vars map does not contain key "cp_private_ip", referenced at ./cloudinit-control-plane.tftpl:657,30-43. The primary CP templatefile call (main.tf:342) and the secondary WORKER templatefile call (main.tf:944) both pass `cp_private_ip`, but the secondary CP templatefile call (main.tf:860) was missed — every multi-region provision since PR #1446 lands here at plan-time. Fix: thread `cp_private_ip = local.secondary_region_cp_ips[k]` into the secondary CP templatefile so each secondary region's cilium-operator reaches its OWN local CP (matching CA), not the primary across regions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-cilium 1.3.4): kubeProxyReplacement true (BPF masq needs NodePort) Worker cilium-agent on prov #55 (8d85a64cb8807cdc, 2026-05-12) crashloops: fatal: failed to start: daemon creation failed: unable to initialize BPF masquerade support: BPF masquerade requires NodePort (--enable-node-port="true") Chart default kubeProxyReplacement=false leaves enable-node-port=false in the rendered cilium-config ConfigMap. Combined with bpf.masquerade=true (also default-on) the cilium-agent rejects the BPF masquerade datapath on startup. CP cilium-agent survives because it was started by cloudinit with the working pre-Flux values BEFORE Flux's helm-upgrade rolled the ConfigMap. Every WORKER node that joins after Flux's upgrade sees the new (broken) ConfigMap → CrashLoopBackOff → node.cilium.io/agent-not- ready taint persists → every post-install Job pod (keycloak-config-cli, powerdns, mimir, openbao) stays Pending → whole bootstrap-kit chain stalls at ~60% Ready. Cloud-init's pre-Flux Cilium install (cloudinit-control-plane.tftpl write_files entry /var/lib/catalyst/cilium-values.yaml) already uses kubeProxyReplacement: true. This change aligns the Flux HR overlay with the working pre-Flux bootstrap so the agent config never regresses when helm-controller does its first upgrade. Bumps bp-cilium 1.3.3 → 1.3.4 and the bootstrap-kit overlay pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-cnpg): wait for webhook readiness so downstream Cluster CRs don't race prov #55+#56 caught bp-harbor / bp-powerdns failing Helm install with: Internal error occurred: failed calling webhook "mcluster.cnpg.io": no endpoints available for service "cnpg-webhook-service" Chain: 1. bp-cnpg install with disableWait: true → HR goes Ready immediately when manifests apply (operator pod still spinning up). 2. Flux releases dependents (bp-harbor, bp-powerdns) — they pass the dependsOn check on bp-cnpg. 3. Downstream chart renders postgresql.cnpg.io/v1.Cluster CRs. 4. cnpg mutating webhook (Service cnpg-webhook-service) has no endpoints yet → admission webhook call fails → Helm install fails → RetriesExceeded → entire DB-backed chain wedges. Carve out the disableWait: true blanket for bp-cnpg specifically. INVIOLABLE-PRINCIPLES #3's "event-driven install" rationale (avoid the agent-waits-for-its-own-CRDs deadlock — see bp-cilium) does NOT apply to CNPG: CNPG's CRDs are loaded by helm-controller BEFORE pods schedule, so Helm-wait blocks only on pod readiness, not on a self-referencing CRD. With this change bp-cnpg's HR stays Reconciling until cnpg-controller- manager + cnpg-webhook-service are both rolled + Available, so Flux dependsOn correctly gates downstream consumers behind a webhook that's actually serving. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-guacamole): render.sh expects 19 resources (Fix #125 bootstrap Job) Fix #125's guacamole-oidc bootstrap Job added 4 K8s resources to the chart's full-ON render (1 Job + 1 ServiceAccount + 1 Role + 1 RoleBinding) but render.sh's expect_total was never bumped from 15 → 19. Every Blueprint Release run since `5b711427` fails the test and bails before publishing the chart to GHCR. Consequence: Build bp-guacamole's mirror job successfully mirrors upstream images + bumps Chart.yaml to 0.1.13/0.1.14/.../0.1.18/0.1.19, but the chained Blueprint Release on each bump commit fails render.sh and never publishes. GHCR is stuck at 0.1.12. Bootstrap-kit overlay HRs pinned to anything beyond 0.1.12 wedge with: failed to download chart for remote reference: failed to get 'oci://ghcr.io/openova-io/bp-guacamole:0.1.17': not found Caught on prov #58 (d4f60afe4f13aee9, 2026-05-12) when bp-guacamole HR went False with that exact error across all 3 regions. Also bump bootstrap-kit overlay version pin 0.1.17 → 0.1.19 so the catch-up Blueprint Release (triggered by this commit) lands a tag the overlay actually references. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 22:52:41 +04:00

1 2 3 4 5 ...

2018 Commits