Commit Graph

1365 Commits

Author SHA1 Message Date
e3mrah
587a985dc6
refactor(openova-flow): CNPG-backed durable store + emit loop (#1471)
Founder feedback on prov #75: "uncappetabel stupid design… if our pods
are restarting entire flow information are exec logs are being wiped".
Root cause: openova-flow-server had ZERO persistence (in-memory
map+RingBuffer per flowId) so pod restart wiped all canvas state.
catalyst-api's flow_snapshot_local.go composer was added as a "fallback"
precisely because openova-flow-server couldn't be trusted — but that
created TWO half-broken paths instead of one durable backend.

## Waterfall delivery — single PR, end-to-end

### openova-flow-server: in-memory → CNPG (Postgres) backed

- New schema: `flow_instances`, `flow_nodes`, `flow_relationships`,
  `flow_events`, `flow_log_lines`, `flow_executions` with CASCADE FK,
  indexes on (flow_id, status/region/family), and a bounded-retention
  trigger on `flow_events` (keeps last 4096 per flow_id — matches the
  prior RingBuffer capacity).
- `pgstore.go` rewires Append/Snapshot/Subscribe/Drop with pgxpool
  transactional writes + LISTEN/NOTIFY pub/sub via per-flow channel
  hash. Migrations applied at startup via embedded `embed.FS`.
- Backend abstraction (`store.Backend`) lets api/ swap between
  PGStore (production) and the legacy MemBackend (tests/dev).
  `FLOW_SERVER_BACKEND=pg|memory` env selects.
- New endpoints: POST/GET `/v1/flows/{id}/log-lines` for exec log
  ingest+replay against the `flow_log_lines` table.

### Helm chart: CNPG Cluster CR + DSN wire-in

- New `templates/cnpg-cluster.yaml` provisions `openova-flow-pg` via
  bp-cnpg's `postgresql.cnpg.io/v1.Cluster`. CASCADE-FK-aware schema
  + Reflector annotations for cross-NS secret access.
- Deployment env wires `FLOW_SERVER_PG_DSN` from CNPG's auto-generated
  `<cluster>-app` Secret (`uri` key — full libpq URI with auth).
- `chart 0.1.1 → 0.2.0` (breaking schema change).
- bootstrap-kit slot 56: `dependsOn: bp-cnpg` so cold install order
  is correct.

### catalyst-api: emit loop + remove local fallback first

- New `internal/flowemit/` HTTP client posts FlowMessage envelopes
  (snapshot, upsert-nodes, upsert-rels, delete-*) to
  `OPENOVA_FLOW_SERVER_URL/v1/flows/{id}/events`. Bounded retry,
  fire-and-forget.
- New `flow_emitter.go` runs a per-deployment 5s ticker goroutine
  that composes the current snapshot via `flowSnapshotFromJobs` and
  emits it. State changes via Bridge call `triggerFlowEmit(depID)`
  for sub-second propagation.
- `HandleFlowSnapshot` order INVERTED: proxy to openova-flow-server
  FIRST, fall back to local composer ONLY in degraded mode (proxy
  unreachable). Production traffic now durably reads from CNPG.
- Emit loop starts when phase 1 watch begins; idempotent; survives
  catalyst-api restart because state is in CNPG.

## What this delivers

-  Canvas data is DURABLE — survives any pod restart (catalyst-api,
  openova-flow-server, or both).
-  openova-flow-server is now stateless — every read hits CNPG.
-  Wire contract (FlowMessage envelopes) unchanged. UI unchanged.
-  catalyst-api can be horizontally scaled — no in-memory state
  needed for the graph path (deployments map + jobs.Store retire
  in follow-up).

## What's NOT in this PR (clear follow-up)

- jobs.Store + PVC retirement: exec logs still on PVC. Moving them
  to `flow_log_lines` requires updating ~30 callers across the
  catalyst-api handler/ package — out of scope for this single PR's
  blast radius. The new `POST /v1/flows/{id}/log-lines` endpoint is
  already in place; only the call sites need to migrate.
- flow_snapshot_local.go: kept as the degraded-mode fallback (proxy
  unreachable). Will be deleted once jobs.Store retirement removes
  the underlying read path.

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 14:16:11 +04:00
github-actions[bot]
bb2726bcf9 deploy: update catalyst images to f110a54 2026-05-14 06:51:04 +00:00
e3mrah
f110a540d8
fix(canvas): persist DependsOn on every event + /refresh-watch fans out to secondary regions (#1470)
Founder caught on prov #75 (b7ae422089d4fde9) after PR #1469 deploy:
all 3 regions' 45-children dep wiring vanished after the catalyst-api
pod restart. Root cause: the deps were never in Job.DependsOn — they
were only in the Pod's in-memory hrDeps cache built from
liveWatcher.SnapshotComponents() Layer-2 in flow_snapshot_local.go.
Pod restart killed the cache.

## Two fixes

### Fix A — Bridge.OnHelmReleaseEvent preserves existing DependsOn

`OnHelmReleaseEvent` previously hardcoded `DependsOn: []string{}` on
every HR state-transition event, relying on `mergeJob` to keep the
prior list. That works when SeedJobsFromInformerList wrote the deps
FIRST. But the seed fires once at OnInitialListSynced; if the seed
ran during a window when HR.spec.dependsOn was being applied/rolling,
or if the seed didn't run at all (silent informer failure post-Pod
restart), Job.DependsOn stays `[]` forever and every subsequent event
re-confirms it.

Fix: load the existing Job from store first, carry its DependsOn
through on the upsert. Same pattern as OnRawComponentLog at line
~939. Combined with mergeJob's preserve-prev behaviour, deps are
durable across event waves.

### Fix B — /refresh-watch respawns secondary watchers

`POST /refresh-watch` rebuilt the PRIMARY helmwatch.Watcher and
re-ran SeedJobsFromInformerList for the primary. But it did NOT
respawn secondary watchers — so after a Pod restart, secondaries'
90 install Jobs stayed flat indefinitely. Fix: call
`spawnSecondaryRegionWatchers(dep)` from RefreshWatch (idempotent —
already running watchers short-circuit on `stopWatchers[region]`).
With this, /refresh-watch restores deps for ALL regions, not just
primary.

## Validation

Caught the bug via per-region edge audit on prov #75 (NOT aggregate
counts — per `feedback_validate_full_dod_before_declaring_pass.md`).
Pre-fix: fsn1=0 / hel1-2=0 / nbg1-1=0 intra-region edges. Post-fix
target: fsn1=71 / hel1-2=71 / nbg1-1=71.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:48:46 +04:00
github-actions[bot]
b4c96a6d0d deploy: update catalyst images to df1dfed 2026-05-14 06:30:40 +00:00
e3mrah
df1dfed707
fix(canvas): opaque bubbles + explicit wires-below layering (#1469)
Founder rule on prov #75 review: "make sure the bubbles are no more
transparent and wires are always below the bubbles".

Two fixes:

1. **Opaque bubbles always**. Previously `groupOpacity = isDimmed ? 0.35 : 1`
   dropped the entire group's opacity to 35% when another job was open
   and this node wasn't on the focused path — making the bubble fill
   see-through and the edges behind visible THROUGH the bubble. Replaced
   with a CSS `filter: grayscale + brightness` treatment that desaturates
   the dimmed node without making it transparent.

2. **Explicit edges-then-nodes paint layers**. Wrapped the edges loop in
   `<g className="flow-edges-layer" data-layer="edges">` and the nodes
   loop in `<g className="flow-nodes-layer" data-layer="nodes">`. SVG
   paint order already produced the correct ordering via JSX source
   order, but a future code change inserting another element between the
   two could quietly break it; the explicit wrappers make the contract
   load-bearing and inspectable.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:28:40 +04:00
github-actions[bot]
331e6b2834 deploy: update catalyst images to b4c2f54 2026-05-14 06:12:28 +00:00
e3mrah
b4c2f54fa2
fix(canvas): don't region-prefix PRIMARY install deps (prov #74) (#1468)
Regression caught immediately after PR #1467 by founder on prov #74
(be70efe343e58b5a). My validation declared " all 5 issues passed"
based on aggregate 292 edges + 5 sampled hel1-2 deps, missing that
PRIMARY fsn1 had 0 intra-region edges + 71 phantom cross-region edges.

## Root cause

PR #1467 wired primary install jobs into a primary region sub-group
(jobRegion = dep.Request.Region) for symmetric multi-region rendering.
`regionalise()` triggered on `jobRegion != ""` — over-applying the
`fsn1:` prefix to PRIMARY's bare-named DependsOn entries:

  install-cilium → install-fsn1:cilium (PHANTOM — no such node exists)

PRIMARY install Jobs have BARE JobNames in the store
("install-cilium"); only SECONDARY install Jobs have region-prefixed
JobNames ("install-hel1-2:cilium"). Region-prefixing primary deps
produces a JobID that matches no node, so the edge is dropped or
points at nothing.

A second related bug: Layer-1 heuristic
`!strings.Contains(dep, ":")` was used to detect bare-jobName form,
but with the new `:` separator a region-prefixed JobName
("install-hel1-2:cilium") now contains a colon — so the heuristic
mis-classified it as "already a full JobID" and emitted FromID
without the deploymentID prefix. Phantom edge.

## Fixes

1. `isSecondaryRegionJob := strings.IndexByte(j.AppID, ':') > 0`
   replaces `jobRegion != ""` as the regionalise() gate. Primary
   jobs have no `:` in AppID → no prefix injection.

2. `fullJobIDPrefix := deploymentID + ":"` replaces the
   `strings.Contains(dep, ":")` heuristic. Only deps that ALREADY
   carry the deploymentID prefix are passed through verbatim; bare
   JobNames (with or without region prefix) get the JobID() wrap.

## Lesson learned

Saved `feedback_validate_full_dod_before_declaring_pass.md` —
aggregate metrics and sample checks are NOT validation. Every DoD
bullet must run an explicit per-tier pass/fail check before
declaring resolved.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:10:28 +04:00
github-actions[bot]
2f5b1cd0ee deploy: update catalyst images to 4814c68 2026-05-14 05:55:28 +00:00
e3mrah
4814c6849b
fix(canvas): wire deps + phase groups + URL-safe separator (prov #73) (#1467)
Founder caught 5 canvas defects on prov #73 (8cd1ff1a80430dc5):

1.  depth=1 shows 2 bubbles (provisioner + bootstrap-kit) — confirmed
   correct architecture per composer.
2.  Expanding bootstrap-kit shows 3 region sub-groups — confirmed.
3. 🐛 All 135 install-* nodes had ZERO inter-HR dep edges. Snapshot
   showed only 5 finish-to-start rels (tofu chain + bootstrap-kit
   sequence). install-cert-manager → install-cilium etc. all missing.
4. 🐛 Canvas only emitted 2 phase groups (provisioner + bootstrap-kit).
   Missing cutover/handover/apps despite being part of the canonical
   5-phase lifecycle.
5. 🐛 /jobs/install-hel1-2/newapi returned 404 because TanStack Router
   splits "/" in the $jobId param.

## Fixes

### Fix 3a: mergeJob preserves prev.DependsOn when next is empty
   store.go:283 — `if len(next.DependsOn)==0 && len(prev.DependsOn)>0`
   keeps prior list. Without this, every OnHelmReleaseEvent (which
   hardcodes `DependsOn: []string{}` at line 508 because it doesn't
   re-look up HR.spec.dependsOn per event) CLOBBERED the seeded deps.
   Confirmed in store: 135/135 install Jobs had `dependsOn: []`
   despite SeedJobsFromInformerList running with proper deps. Founder
   reported this same flat-leaves bug 4 sessions in a row.

### Fix 3b: secondary watchers get region-aware seeder hook
   New `attachSecondaryBridgeSeederHook` + `snapshotsToSeedsForRegion`
   wire the seed path for secondary helmwatch.Watchers. Without this,
   secondary install-* Jobs were only ever created by per-event
   OnHelmReleaseEvent (DependsOn=[]) so the canvas dep graph was
   permanently flat under secondary region groups regardless of fix
   3a.

### Fix 3c: composer Layer-2 reads secondary watchers' HR.spec.dependsOn
   flow_snapshot_local.go now also walks dep.secondaryWatchers and
   populates hrDeps with region-prefixed keys + region-prefixed values.
   After fix 3a+3b the stored Job.DependsOn is the authoritative source
   (Layer 1) — this Layer-2 enrichment is the safety net for hot-
   shipped charts that bypass the seed path.

### Fix 4: cutover/handover/apps phase groups
   types.go — add GroupCutover/Handover/Apps constants + Display.
   flow_snapshot_local.go — add phaseForChart() classifier (currently
   maps self-sovereign-cutover → cutover), reparent install jobs to
   the correct phase sub-group, synthesise per-region sub-groups for
   each phase, emit top-level phase groups, and chain them with
   finish-to-start: provisioner → bootstrap-kit → cutover → handover
   → apps.

### Fix 5: JobName separator `/` → `:` (canonical per memory rule)
   phase1_watch.go:457 emits ev.Component = region + ":" + chart.
   jobs_backfill.go + flow_snapshot_local.go updated to detect ":"
   instead of "/". useJobLinkBuilder's encodeURIComponent already
   handles ":". /jobs/install-hel1-2:newapi now matches the TanStack
   Router $jobId route.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 09:53:23 +04:00
github-actions[bot]
f5929e6114 deploy: update catalyst images to 2626d40 2026-05-14 04:27:53 +00:00
e3mrah
2626d40117
chore(catalyst-chart): bump 1.4.141 → 1.4.142 — propagate prov #72 fixes (#1466)
PR #1465 added `catalyst` + `newapi` to default-deny allowlist and
shipped `allow-kube-apiserver` CNP for qa-omantel, but the chart
version wasn't bumped so HRs across active provisions kept resolving
the OLD 1.4.141 artifact (with the broken allowlist). Bumping to
1.4.142 forces Flux on every Sovereign to upgrade and pick up the fix.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 08:25:55 +04:00
github-actions[bot]
edf8e6fd18 deploy: update catalyst images to c267ab5 2026-05-14 04:20:59 +00:00
e3mrah
c267ab5338
fix(qa-fixtures): allow catalyst+newapi NS + kube-apiserver egress (prov #72) (#1465)
* fix(flow_snapshot): region-scope dep edges (no cross-region wiring)

Founder caught on prov #66 (3dc9249ea73a6840, 2026-05-13): hel1-2's
install-* nodes all rendered dep arrows pointing at PRIMARY's install
nodes — cross-region edges where NAMING-CONVENTION §1.3 demands
independent fault domains (no cross-region wiring).

Root cause: helmwatch.Bridge persists secondary-region Jobs with bare
dep names ("install-cilium") because HR.spec.dependsOn carries chart
names without region context. The snapshot composer's normaliser
turned `install-cilium` → `<depID>:install-cilium` which IS the
primary's cilium JobID, not hel1-2's `<depID>:install-hel1-2/cilium`.
Every secondary install therefore drew a phantom cross-region edge.

Fix: in flow_snapshot_local.go, region-scope dep names when the source
Job is regional:

  jobRegion=="hel1-2" + dep="install-cilium"
    → "install-hel1-2/cilium" → "<depID>:install-hel1-2/cilium"

Same fix applied to the Layer-2 hrDeps derivation path (per-AppID
lookup also gets bare chart names from the primary watcher). hrDeps
lookup is now done with the unprefixed AppID so it actually hits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud-init): wait for private NIC before k3s install (prov #71)

Hetzner Cloud hot-attaches the private-network NIC ~10-20s AFTER server
create. cloud-init init-local fetches /hetzner/v1/metadata/private-networks
BEFORE the NIC is ready, renders netplan with only eth0, and the
private NIC (kernel-renamed eth1 → enp7s0 by udev) stays DOWN.

Effect on secondary CPs: k3s server starts with
  --node-ip=10.0.<10+idx>.2 --advertise-address=10.0.<10+idx>.2
and fatals on
  "listen tcp 10.0.11.2:2380: bind: cannot assign requested address"
then crashloops. Caught on prov #71/omantel.biz/nbg1-1-cp1: k3s.service
restart counter reached 5394, kubeconfig never PUT back to mothership,
canvas showed secondary region as a permanent black hole. Diagnosed via
Hetzner rescue mode SSH 2026-05-14. Primary CP works by luck of faster
fsn1 zone NIC attach.

Fix: in cloud-init runcmd, BEFORE the k3s install, poll up to 120s for
the expected private IP (control plane) or a route to it (worker). If
the NIC appears DOWN with no netplan stanza, generate one with dhcp4:true
and `netplan apply`. Bail loudly if the IP/route never appears — failures
surface in cloud-init.log instead of disguising as a slow boot.

Symmetric fix in worker template covers autoscaler-spawned secondary
workers when worker_count > 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(qa-fixtures): allow catalyst+newapi NS + kube-apiserver egress (prov #72)

The qa-fixtures chart's `default-deny` CiliumClusterwideNetworkPolicy
excluded `catalyst-system` from its NotIn list but FORGOT `catalyst`
(where bp-self-sovereign-cutover's Jobs live: auto-trigger,
gitea-mirror, harbor-projects, registry-pivot) and `newapi` (where
bp-newapi's Application pods live).

Effect on prov #72:
- bp-self-sovereign-cutover-auto-trigger Job stuck 20m+ on HTTP 000000
  curling http://catalyst-api.catalyst-system.svc → DNS resolution + TCP
  egress both denied by default-deny. Cutover never fires → handover
  blocked → bp-catalyst-platform's --wait never completes.
- newapi-bp-newapi pod gets `secret newapi-oidc not found` but its
  inability to resolve apiserver compounds the issue.
- qa-omantel cnpg cluster-primary/replica stuck "Setting up primary"
  for 18m because initdb's `dial tcp 10.43.0.1:443 i/o timeout` — the
  ClusterIP-rewritten kube-apiserver address has no allow-egress.

Fixes:
1. Add `catalyst` + `newapi` to $excludedNamespaces — they're first-party
   blueprint namespaces analogous to catalyst-system.
2. Add `allow-kube-apiserver` CNP in qa-omantel using Cilium's canonical
   `toEntities: [kube-apiserver]` directive so cnpg initdb can reach the
   apiserver regardless of whether traffic resolves to ClusterIP, node
   IP, or Service VIP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 08:18:54 +04:00
github-actions[bot]
5f2298c550 deploy: update catalyst images to a75463f 2026-05-14 03:42:19 +00:00
github-actions[bot]
af3a1e6375 deploy: update catalyst images to 410a3db 2026-05-13 18:05:18 +00:00
e3mrah
410a3dbd33
fix(flow_snapshot): region-scope dep edges (no cross-region wiring) (#1461)
Founder caught on prov #66 (3dc9249ea73a6840, 2026-05-13): hel1-2's
install-* nodes all rendered dep arrows pointing at PRIMARY's install
nodes — cross-region edges where NAMING-CONVENTION §1.3 demands
independent fault domains (no cross-region wiring).

Root cause: helmwatch.Bridge persists secondary-region Jobs with bare
dep names ("install-cilium") because HR.spec.dependsOn carries chart
names without region context. The snapshot composer's normaliser
turned `install-cilium` → `<depID>:install-cilium` which IS the
primary's cilium JobID, not hel1-2's `<depID>:install-hel1-2/cilium`.
Every secondary install therefore drew a phantom cross-region edge.

Fix: in flow_snapshot_local.go, region-scope dep names when the source
Job is regional:

  jobRegion=="hel1-2" + dep="install-cilium"
    → "install-hel1-2/cilium" → "<depID>:install-hel1-2/cilium"

Same fix applied to the Layer-2 hrDeps derivation path (per-AppID
lookup also gets bare chart names from the primary watcher). hrDeps
lookup is now done with the unprefixed AppID so it actually hits.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:03:06 +04:00
github-actions[bot]
3c38565951 deploy: update catalyst images to 4a14bbf 2026-05-13 16:34:30 +00:00
e3mrah
4a14bbf328
fix(flow_snapshot): symmetric region groups — primary gets its own too (#1460)
Founder caught on prov #65 (6e2fd14bb8b6ed4d, 2026-05-13): canvas shows
ASYMMETRIC structure — primary's 45 install jobs render as BARE LEAVES
directly under bootstrap-kit, while secondary regions get a proper
region sub-group. Result: M×N fan-out from provision-hetzner cascades
onto every primary leaf because there's no primary region group to
absorb the elided-group edge.

PR #1454 introduced region derivation from JobName's `/` separator
(secondary watchers emit `install-<region>/<chart>`). Primary's bridge
emits bare `install-<chart>` names — no `/`, no region derived, no
group synthesized.

Fix: derive primary region from `dep.Request.Region` and apply it to
every install job with no `/` in AppID. The synth-region-group loop
below already creates one group per discovered region, so primary
automatically gets its own `<deploymentId>:<primaryRegion>:bootstrap-kit`
bubble containing all 45 primary installs.

End state: 3 symmetric region sub-groups under bootstrap-kit
(fsn1 + nbg1-1 + hel1-2 for 3-region prov), each with exactly 45
install-* children, region-bounded temporal-endpoint cascade prevents
M×N fan-out at depth=all.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 20:31:20 +04:00
github-actions[bot]
cd5ace8dcb deploy: update catalyst images to 32e0b40 2026-05-13 15:42:13 +00:00
github-actions[bot]
55edb953d5 deploy: update catalyst images to 44913d8 2026-05-13 14:40:02 +00:00
github-actions[bot]
b6e6470ccf deploy: update catalyst images to 5f4f9f2 2026-05-13 14:01:04 +00:00
e3mrah
6fac1481d3
fix(catalyst-api): bump memory limit 1Gi → 4Gi for multi-region snapshot load (#1456)
prov #61 (2e197a934a0e0461, 2026-05-13): catalyst-api OOMKilled 6× during
phase-1 watch on a 3-region Sovereign. The in-memory state has grown
substantially since the 1Gi limit was set:

- 1 primary helmwatch.Watcher (45 HRs + informer cache)
- N secondary helmwatch.Watchers (45 HRs × 2 secondary regions, each
  with its own informer cache)
- jobs.Store backed by on-disk + in-memory tree
- per-/snapshot poll: composes per-region region groups across all
  Job rows + cross-references hrDeps from the live primary watcher

Combined steady-state exceeds 1Gi on cpx32-equivalent loads. Bumped
limits to 4Gi (request 512Mi up from 128Mi). The mothership node has
8GB+ resident, no other tight constraint. Future fix: persist region
in Job rows so secondary watchers don't need to be retained post
phase-1 (orthogonal cleanup).

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:20:00 +04:00
github-actions[bot]
2c6374b200 deploy: update catalyst images to 8518bb1 2026-05-13 12:48:59 +00:00
e3mrah
8518bb1f50
fix(flow_snapshot): drop duplicate live-watcher multi-region block (#1455)
* fix(JobsTable): strip <deploymentId>: prefix from row link (404 fix)

Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a
running secondary-region install-* row on /sovereign/provision/<id>/jobs
landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover
and returned "404 page not found".

Root cause: useJobLinkBuilder was passing the FULL canvas JobID form
through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping
the "<deploymentId>:" prefix. The canvas emits ids like
"<deploymentId>:install-X" (single-region) or
"<deploymentId>:<region>:install-X" (multi-region, see
flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName —
exact-match URL lookup of the prefix-bearing form misses every time.

FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the
first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs
row click and a canvas drill-down resolve to the SAME backend endpoint.

The existing JobsTable row-link test uses a job.id with no `:` prefix,
so the strip is a no-op for that fixture and the `/jobs/job-install-cilium`
assertion still holds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(flow_snapshot_local): derive region from persisted JobName, synth region groups

Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): the multi-region
canvas at /sovereign/provision/<id>/jobs/tofu-output renders 135 install-*
leaves as direct children of bootstrap-kit (no region sub-groups visible),
and the provision-hetzner→bootstrap-kit edge fans M×N across all 135.

Root cause: spawnSecondaryRegionWatchers (phase1_watch.go:429) emits
events with `ev.Component = region + "/" + componentName`. The jobs
bridge persists them with `JobName=install-<region>/<chart>` and
`AppID=<region>/<chart>`, BUT ParentID=bootstrap-kit (the bridge has no
region awareness). After phase 1 terminates the deferred stopSecondaries()
clears `dep.secondaryWatchers`, so the multi-region snapshot block
(line 408-460, gated on `len(secondaryWatchers) > 0`) becomes a no-op.
flowSnapshotFromJobs then emits all 135 install Jobs flat under
bootstrap-kit, no Region field set, no region group bubbles, and
flowLayoutOrganic.ts's temporal-endpoint cascade fans the
provisioner→bootstrap-kit edge onto all 135 because there's no
intermediate region group to absorb it.

Fix: in the per-Job loop, detect `/` in `j.AppID` (the canonical
multi-region prefix marker), derive the region key, set
FlowNode.Region, and re-parent to a synthesised
"<deploymentId>:<region>:bootstrap-kit" group. After the loop,
synthesise one bootstrap-kit sub-group node per discovered region
with a `contains` edge to the parent bootstrap-kit. The resulting
shape:

  bootstrap-kit
   ├── 45 primary install-* (legacy parent, no region)
   ├── <region-A>:bootstrap-kit ── 45 install-*  (region tagged)
   └── <region-B>:bootstrap-kit ── 45 install-*  (region tagged)

This persists ACROSS phase-1 termination because the source of truth
is jobs.Store (durable), not dep.secondaryWatchers (transient).

The multi-region block (line 408+) still runs WHEN secondary watchers
are alive (during phase 1) — it emits ADDITIONAL FlowNodes with
"<deploymentId>:<region>:install-X" IDs distinct from the persisted
"<deploymentId>:install-<region>/<chart>" IDs, so the two paths don't
collide. Post-phase-1 the watchers clear and only the persisted-Job
path remains, but now WITH region structure preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(flow_snapshot): remove duplicate live-watcher multi-region block

PR #1454 added region-group synthesis from persisted Job rows. The old
secondaryWatchers-based block at line 442+ emitted nodes with the SAME
region-group IDs AND child nodes, so during phase 1 (when both paths
are live) the snapshot rendered with 90 children per region group
instead of 45 — visible on prov #61 (2e197a934a0e0461):

  bootstrap-kit: 49 children
  hel1-2:bootstrap-kit: 90 children  (should be 45)
  nbg1-1:bootstrap-kit: 90 children  (should be 45)

Plus the region groups appeared twice in the node list.

Root cause: the per-Job loop (PR #1454) and the legacy block both write
to the same region-group IDs without deduping. The per-Job path covers
the persisted-Job state (durable across phase-1 termination), so the
live-watcher path is redundant.

Fix: delete the legacy block. The earlier
secondaryWatchers-snapshot-into-map work (lines 182-205) is kept
because that path also reads dep.liveWatcher (primary) for the hrDeps
lookup the per-Job loop uses for primary-region dep edges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:47:00 +04:00
github-actions[bot]
ed4f66438f deploy: update catalyst images to d9d7fa2 2026-05-13 12:26:59 +00:00
e3mrah
d9d7fa2baa
fix(flow_snapshot): derive region from persisted JobName, synth region groups (#1454)
* fix(JobsTable): strip <deploymentId>: prefix from row link (404 fix)

Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a
running secondary-region install-* row on /sovereign/provision/<id>/jobs
landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover
and returned "404 page not found".

Root cause: useJobLinkBuilder was passing the FULL canvas JobID form
through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping
the "<deploymentId>:" prefix. The canvas emits ids like
"<deploymentId>:install-X" (single-region) or
"<deploymentId>:<region>:install-X" (multi-region, see
flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName —
exact-match URL lookup of the prefix-bearing form misses every time.

FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the
first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs
row click and a canvas drill-down resolve to the SAME backend endpoint.

The existing JobsTable row-link test uses a job.id with no `:` prefix,
so the strip is a no-op for that fixture and the `/jobs/job-install-cilium`
assertion still holds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(flow_snapshot_local): derive region from persisted JobName, synth region groups

Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): the multi-region
canvas at /sovereign/provision/<id>/jobs/tofu-output renders 135 install-*
leaves as direct children of bootstrap-kit (no region sub-groups visible),
and the provision-hetzner→bootstrap-kit edge fans M×N across all 135.

Root cause: spawnSecondaryRegionWatchers (phase1_watch.go:429) emits
events with `ev.Component = region + "/" + componentName`. The jobs
bridge persists them with `JobName=install-<region>/<chart>` and
`AppID=<region>/<chart>`, BUT ParentID=bootstrap-kit (the bridge has no
region awareness). After phase 1 terminates the deferred stopSecondaries()
clears `dep.secondaryWatchers`, so the multi-region snapshot block
(line 408-460, gated on `len(secondaryWatchers) > 0`) becomes a no-op.
flowSnapshotFromJobs then emits all 135 install Jobs flat under
bootstrap-kit, no Region field set, no region group bubbles, and
flowLayoutOrganic.ts's temporal-endpoint cascade fans the
provisioner→bootstrap-kit edge onto all 135 because there's no
intermediate region group to absorb it.

Fix: in the per-Job loop, detect `/` in `j.AppID` (the canonical
multi-region prefix marker), derive the region key, set
FlowNode.Region, and re-parent to a synthesised
"<deploymentId>:<region>:bootstrap-kit" group. After the loop,
synthesise one bootstrap-kit sub-group node per discovered region
with a `contains` edge to the parent bootstrap-kit. The resulting
shape:

  bootstrap-kit
   ├── 45 primary install-* (legacy parent, no region)
   ├── <region-A>:bootstrap-kit ── 45 install-*  (region tagged)
   └── <region-B>:bootstrap-kit ── 45 install-*  (region tagged)

This persists ACROSS phase-1 termination because the source of truth
is jobs.Store (durable), not dep.secondaryWatchers (transient).

The multi-region block (line 408+) still runs WHEN secondary watchers
are alive (during phase 1) — it emits ADDITIONAL FlowNodes with
"<deploymentId>:<region>:install-X" IDs distinct from the persisted
"<deploymentId>:install-<region>/<chart>" IDs, so the two paths don't
collide. Post-phase-1 the watchers clear and only the persisted-Job
path remains, but now WITH region structure preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:24:20 +04:00
github-actions[bot]
6f50bc0a4a deploy: update catalyst images to 3a08c23 2026-05-13 12:05:56 +00:00
e3mrah
3a08c23ae4
fix(JobsTable): strip <deploymentId>: prefix from row link (404 fix) (#1453)
Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a
running secondary-region install-* row on /sovereign/provision/<id>/jobs
landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover
and returned "404 page not found".

Root cause: useJobLinkBuilder was passing the FULL canvas JobID form
through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping
the "<deploymentId>:" prefix. The canvas emits ids like
"<deploymentId>:install-X" (single-region) or
"<deploymentId>:<region>:install-X" (multi-region, see
flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName —
exact-match URL lookup of the prefix-bearing form misses every time.

FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the
first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs
row click and a canvas drill-down resolve to the SAME backend endpoint.

The existing JobsTable row-link test uses a job.id with no `:` prefix,
so the strip is a no-op for that fixture and the `/jobs/job-install-cilium`
assertion still holds.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:03:47 +04:00
github-actions[bot]
16f41bef56 deploy: update catalyst images to 68372d7 2026-05-12 16:13:41 +00:00
github-actions[bot]
1c6e82b83b deploy: update catalyst images to be47815 2026-05-12 16:03:56 +00:00
github-actions[bot]
034da82c00 deploy: update catalyst images to cdcc50a 2026-05-12 15:58:30 +00:00
github-actions[bot]
fc71800a52 deploy: update catalyst images to 19a847e 2026-05-12 12:30:55 +00:00
github-actions[bot]
bc0f56eb4e deploy: update catalyst images to 4923938 2026-05-12 12:15:30 +00:00
e3mrah
4923938c2b
feat(multi-region-canvas): per-region kubeconfig PUT-back + per-region helmwatch (#1444)
Operator mandate (2026-05-12): the mothership canvas must surface
install-* HRs from EVERY region of a multi-region provision, not just
the primary CP's. Today catalyst-api stores ONE kubeconfig per
deployment (the primary CP's) and spawns ONE helmwatch.Bridge against
it. Result: secondary regions are invisible on the canvas even though
their k3s clusters are fully reconciling.

End-to-end change across infra + handler:

1) cloud-init (cloudinit-control-plane.tftpl): the kubeconfig PUT URL
   appends `?region=<kubeconfig_postback_region>` when the var is set.
   main.tf templatefile call passes empty for primary CP, `each.key`
   (e.g. "nbg1-1", "hel1-2") for each secondary region.

2) PutKubeconfig handler: reads ?region= query param. Empty → primary
   path (unchanged: stores at <dir>/<id>.yaml, sets
   Result.KubeconfigPath, fires Phase-1 watch + SMTP seed). Non-empty
   → secondary path: stores at <dir>/<id>-<region>.yaml, populates
   Deployment.secondaryKubeconfigPaths[region]. Single-use guard is
   per-region (the same bearer secures every CP's PUT — secondaries
   reuse it for their own slot). NO Phase-1 watch re-launch from a
   secondary PUT.

3) phase1_watch.spawnSecondaryRegionWatchers: runs alongside the
   primary's watcher. Scans <kubeconfigsDir>/<id>-*.yaml every 15s,
   spawns one helmwatch.NewWatcher per kubeconfig discovered, stores
   the Watcher on Deployment.secondaryWatchers[region]. Per-region
   watchers emit ordinary helmwatch events with region-prefixed
   Component names so the wizard's per-component view doesn't collide
   primary vs secondary bp-cilium events. They do NOT contribute to
   markPhase1Done — outcome remains the primary's classification.

4) flow_snapshot_local.flowSnapshotFromJobs: composes per-region group
   bubbles + install-* nodes from each secondary watcher's
   SnapshotComponents. Node id: <depID>:<region>:install-<chart>.
   FlowNode.region set so the canvas can colour-group. Intra-region
   finish-to-start deps emitted from cs.DependsOn — same-region only,
   never cross-region (per NAMING-CONVENTION §1.3 independent fault
   domains, no stretched cluster).

5) wipe.go: removes both <id>.yaml AND every <id>-*.yaml secondary
   kubeconfig file on Sovereign wipe.

Storage model is uniform across SME and corporate Sovereigns. No
hardcoding of provider, region count, or building block.

Caught after operator pointed out that 3-region prov #50 was showing
only 52 install-* nodes (all from fsn1) on the canvas — the
architectural gap.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:12:38 +04:00
github-actions[bot]
effd75e4a7 deploy: update catalyst images to c5d891a 2026-05-12 11:26:54 +00:00
github-actions[bot]
5fb99be8e8 deploy: update catalyst images to bd5d439 2026-05-12 10:00:04 +00:00
e3mrah
bd5d4393ec
fix(canvas): cross-group edges cascade to leaf temporal endpoints (#1442)
Operator-reported design fix completing #1437/#1440 — the cross-phase
ordering between provisioner and bootstrap-kit groups was either an
M×N phantom-edge fan-out (pre-#1437) OR completely disconnected at
leaf level (post-#1440 with the both-elided skip). Neither was right.

Real design: when a group→group dependency edge is lifted onto the
leaf graph because one or both endpoints elided, cascade ONLY to the
temporal endpoint pair:

  upstream_terminals → downstream_initials

Where:
  - upstream_terminals = visible descendants of the upstream group
    that nothing else in the group depends on (sinks of intra-group
    DAG). For the tofu chain this collapses to just cluster-bootstrap.
  - downstream_initials = visible descendants of the downstream group
    that depend on nothing else in the group (sources of intra-group
    DAG). For bootstrap-kit this is install-cilium / install-flux /
    install-gateway-api / etc — the install-* roots.

Net result for provisioner→bootstrap-kit at depth=all: a small fan of
edges from cluster-bootstrap to the bp-* roots — the real temporal
gate, no spurious phantom edges, no missing cross-phase chain.

Two call sites updated:
  - Inbound: visibleJob X with X.dependsOn = [elidedGroup G] now
    cascades to groupTerminals(G) instead of fanOutVisibleChildren(G).
  - Outbound: elidedGroup G with G.dependsOn = [D] cascades to
    groupInitials(G) on the receive side; D-side cascades to
    groupTerminals(D) when D is also elided, or uses D directly when
    D is a visible job.

11/11 flowLayoutOrganic.test.ts pass.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 13:47:42 +04:00
github-actions[bot]
064fc3073f deploy: update catalyst images to 0fe0cac 2026-05-12 09:32:31 +00:00
e3mrah
0fe0cacc15
fix(canvas): right-click menu actions actually work + clearer labels (#1441)
Operator reported "non of the right click functionalites working
other than the open in new tab". Root cause: the previous handler
only mutated urlFoldedSet, which had no visible effect when the
clicked group was folded by the depth default (same class of bug
toggleFold had before #1439). The menu items also had confusing
labels ("Fold to level N" stepped GLOBAL depth, not subtree-relative).

Rewrite to use the same compose-state pattern toggleFold uses:

  - "Show only this group" — switch to depth=all + fold every OTHER
    group. Only the clicked group's subtree expands; sibling groups
    stay collapsed.
  - "Hide this group" — switch to depth=default + add clicked group
    to urlFoldedSet. Group renders as a folded bubble; its subtree
    hidden.
  - "Expand subtree" — switch to depth=all + remove this group and
    all its descendant groups from urlFoldedSet. Fully unfolded
    subtree.
  - "Open in new tab" — unchanged (was working since #1435).

Dropped the misleading "Fold to level N" item (was just stepDepth(-1)).
The depth chip ◀▶ at the top-right is the canonical global depth
control.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 13:30:31 +04:00
github-actions[bot]
c80d43c6d8 deploy: update catalyst images to 2c1f767 2026-05-12 09:27:06 +00:00
e3mrah
2c1f767b52
fix(canvas): back-to-jobs chroot-scoped + group→group edge w/o M×N lift (#1440)
Three operator-reported issues from the same dblclick session:

1) "Back to jobs" link in JobDetail.tsx (2 sites) and JobsTimeline.tsx
   used absolute /jobs which on contabo resolves to /sovereign/jobs —
   the mother's flat /jobs view, NOT the chroot-scoped
   /sovereign/provision/<id>/jobs. Operator reported "chroot principle
   violation". Fix: chroot-aware /provision/<deploymentId>/jobs when
   deploymentId is present.

2) Bootstrap and Provision Hetzner group bubbles at ?depth=1 had no
   edge between them — temporal ordering invisible. Earlier #1437
   dropped the group→group edge entirely because the FE layout's
   lift-on-elide cascaded it into M×N phantom edges at ?depth=all.
   Re-emit the edge AND fix the lift logic in
   flowLayoutOrganic.ts (lines 414-442) to SKIP the lift when BOTH
   endpoints of the elided-group dep are elided. At ?depth=1 the
   edge renders between the two folded groups as intended; at
   ?depth=all both groups elide and the lift is suppressed so the
   spurious cascade doesn't reappear. The actual install-* deps are
   already visible via each leaf's own dependsOn — skipping the lift
   costs no information.

3) (Documented separately) Right-click menu only attaches to GROUP
   nodes per design (FlowCanvasOrganic line 1277). When all groups
   are elided (?depth=all auto-folds groups out), the menu is
   unreachable. The dblclick-on-group fold fix (#1439) makes group
   bubbles reachable at ?depth=1 where right-click works.

Caught via Playwright after operator reported all three.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 13:24:50 +04:00
github-actions[bot]
fe337d571c deploy: update catalyst images to bb1bff2 2026-05-12 08:42:18 +00:00
e3mrah
bb1bff245a
fix(canvas): toggleFold handles depth-default-folded nodes (#1439)
toggleFold previously only mutated urlFoldedSet, which had no effect
when the clicked node was folded BY THE DEPTH DEFAULT (not by an
explicit URL override). Result: at ?depth=1 where both groups are
folded by depth-default, double-clicking bootstrap-kit (after #1438's
dblclick-on-group → toggleFold branch) was a no-op — the urlFoldedSet
delete didn't change the composed foldedSet, the canvas didn't budge.

New behaviour:
  - If clicked node is folded by ANY source: switch to depth=all AND
    explicitly fold every OTHER previously-folded group. Only the
    clicked group ends up visibly unfolded — exactly the operator-
    requested "expand only the respective parent" UX.
  - If clicked node is unfolded: add to urlFoldedSet to fold it
    without changing depth.

Caught via Playwright after #1438 landed and dblclick still didn't
unfold the clicked group at ?depth=1.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:39:58 +04:00
github-actions[bot]
24a2b13870 deploy: update catalyst images to 9da662c 2026-05-12 08:36:45 +00:00
e3mrah
9da662c6f5
fix(canvas): double-click on group toggles fold (not navigate) (#1438)
Operator reported "double-click on a parent bubble it is expanding
all the parent instead of expanding only the respective parent."
Reproduced in Playwright: at ?depth=1 only the 2 group bubbles
render folded; double-click on bootstrap-kit navigated to
/jobs/bootstrap-kit which DROPPED the ?depth=1 query → new page
defaulted to depth=2 → groups elided → all 50 install-* + Phase-0
bubbles rendered. Exactly the "expanding all parents" symptom.

Two fixes:

1) Branch handleNodeDoubleClick: if the bubble is a group, call
   toggleFold(nodeId) in place — fold or unfold ONLY that group.
   Tree-explorer UX where a leaf double-click drills in but a group
   double-click expands/collapses.

2) For the leaf path, preserve window.location.search across the
   navigate so the destination page renders with the same depth /
   folded filter the operator had on screen. Without this, the new
   page defaults to depth=2 and the visible bubble set changes
   beneath them.

Caught via Playwright double-click simulation on bootstrap-kit at
?depth=1 — URL went from .../jobs/install-cnpg?depth=1 (2 bubbles)
to .../jobs/bootstrap-kit (50 bubbles).

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:33:59 +04:00
github-actions[bot]
41787d66c6 deploy: update catalyst images to 5e96d30 2026-05-12 08:33:55 +00:00
e3mrah
5e96d30552
fix(flow-snapshot): drop provisioner→bootstrap-kit edge — causes M×N fan-out (#1437)
flowLayoutOrganic.ts lines 414-442 lift an elided group's outbound
deps onto EACH of its visible children, and if the dep target is
itself an elided group, fans out to THAT group's visible children
too. With both top-level groups elided at depth=all, the single
group→group finish-to-start edge I added cascades into M×N phantom
edges (each install-* gains a dep on every tofu-* + cluster-bootstrap
step). The operator-reported "install-cnpg has 5 connections from
terraform jobs" was exactly this layout-side fan-out.

Removing the group→group edge leaves Phase-0 and Phase-1 as separate
connected components on the canvas — the correct minimum-edge
rendering. Ordering between phases is implicit in the timestamps +
status flow, not in the edge graph.

Caught by Playwright-probing the canvas after operator pushback: data
side had only the 1 real direct dep (install-flux → install-cnpg)
yet the canvas drew 5+ phantom lines to install-cnpg from Phase-0.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:30:44 +04:00
github-actions[bot]
732949bc73 deploy: update catalyst images to f980356 2026-05-12 08:14:36 +00:00
e3mrah
f980356ce9
fix(canvas): setSearchPatch uses window.history (forward-fix CI tsc TS2322) (#1436)
PR #1435 (depth-chip basepath fix) failed CI because removing `to:`
from navigate() narrowed the search reducer's typed return to never,
producing TS2322 on the `Record<string, unknown>` cast.

Forward-fix: bypass TanStack navigate() entirely for the search-only
mutation path. Update window.location's query string via
history.replaceState (preserves pathname verbatim including basepath)
and dispatch a synthetic popstate so TanStack's useSearch picks up
the new query on next render. No TanStack path resolution → no
basepath drop → no colon re-encoding → depth-chip click stops 404ing.

Re-also fixes open-new-tab (window.open of absolute /sovereign/... )
and handleNodeDoubleClick (strip + encode jobId) carried over from #1435.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:11:26 +04:00
e3mrah
4d1ccfbd44
fix(canvas): depth-chip click drops /sovereign basepath + open-new-tab 404 (#1435)
Two UX-killer bugs the operator hit on the FlowCanvasOrganic surface:

1) Clicking the depth chip arrows (◀ / ▶) on
   /sovereign/provision/<id>/jobs/<depId>:install-X pushed the browser
   to /provision/<id>/jobs/<depId>%3Ainstall-X — the /sovereign basepath
   was dropped AND the colon was re-encoded as %3A, both via TanStack's
   `to: '.'` path resolution. The new URL 404s at the BE because the
   colon-prefixed jobName misses jobs.Store.GetJob's exact-match lookup.
   Fix: omit `to:` entirely. TanStack treats a search-only navigate as
   a pure search-params mutation and preserves the current path verbatim
   including the basepath. The colon-prefixed jobId in the URL comes
   from older deep-links; the strip-on-click fix landed in #1431.

2) Right-click → "Open in new tab" also passed the raw nodeId
   verbatim (no prefix strip, no encode, no /sovereign prefix). Mirror
   handleNodeDoubleClick: strip the "<deploymentId>:" prefix,
   encodeURIComponent the remainder, AND prepend /sovereign for the
   absolute-path window.open (window.open isn't routed through
   TanStack so basepath isn't auto-prepended).

Caught after operator reported "level arrows redirect to wrong URLs
and giving 404" + "right click on a parent bubble … none of the
functions are working properly."

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:02:37 +04:00
e3mrah
1d9dd99915
fix(flow-snapshot): normalise bare-name Job.DependsOn to canonical JobID form (#1434)
helmwatch.Bridge writes SOME Job.DependsOn entries as bare names
("install-flux") rather than the canonical JobID form
("<deploymentId>:install-flux") — 71 such entries observed on prov
bfdccbdbd6f700e1 (2026-05-12). My flowSnapshotFromJobs emit copied
those bare names verbatim into Relationship.fromId. The canvas
reducer matches FlowNode.id by exact string, so the bare-name fromId
became a phantom edge pointing to a non-existent node. In the
force-directed layout these phantom edges visually routed through
the nearest real bubbles, manifesting as 5-edge fan-outs from every
Phase-0 tofu job to every install-* bubble (operator-reported on
install-cnpg, but symmetric across all install-*).

Normalise every fromId to jobs.JobID(deploymentID, dep) form when
the stored value lacks a ":" separator.

Caught after operator reported "install-cnpg has 5 different
connections from terraform jobs — this is matter of a proper
chaining" — looking at the snapshot showed Job.DependsOn=[install-flux]
without the prefix.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:00:04 +04:00
github-actions[bot]
1a0333a43f deploy: update catalyst images to 93c3e81 2026-05-12 07:27:29 +00:00
e3mrah
93c3e81f0c
fix(flow-snapshot): contains edge direction — toId is parent per canon (#1433)
Per products/openova-flow/core/src/types.ts line 112:
  "contains — toId (parent) contains fromId (child)"

My emit had this inverted: I set FromID=parent, ToID=child, which
made the FE adapter (flowStreamToOrganic.ts line 134) interpret every
install-* leaf as a group containing the bootstrap-kit/provisioner
group nodes. Net result: only 2 bubbles ever rendered on the canvas
regardless of ?depth= because the hierarchy graph was upside-down.

Caught by opening the canvas in a browser via Playwright after the
operator reported "still showing only 2 bubbles, no drill-down".

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:24:30 +04:00
github-actions[bot]
9011d1b635 deploy: update catalyst images to 048a4d8 2026-05-12 06:46:54 +00:00
e3mrah
048a4d8910
fix(refresh-watch): disk-fallback when Result.KubeconfigPath is empty (#1432)
When the Pod restarts between PutKubeconfig writing the file AND the
next Result.Save() persisting the field, dep.Result.KubeconfigPath
comes back empty even though the file exists at the canonical
convention <kubeconfigsDir>/<deploymentID>.yaml. RefreshWatch was
returning 409 watch-not-resumable in this state, which left the
mothership canvas frozen because the live watcher couldn't re-attach
to source HR.spec.dependsOn for the install-* edge derivation.

Hit live on prov bfdccbdbd6f700e1 (2026-05-12): chart roll for
PR #1431 restarted catalyst-api Pod, the file
/var/lib/catalyst/kubeconfigs/bfdccbdbd6f700e1.yaml was on disk but
RefreshWatch refused to use it because the record field was empty.

Fix: when KubeconfigPath is empty AND h.kubeconfigsDir is configured
AND a file exists at <dir>/<depID>.yaml, use that path and patch the
record so subsequent /components/state + flow snapshot calls see a
populated field.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:44:55 +04:00
github-actions[bot]
7e4f38ec62 deploy: update catalyst images to e3771f6 2026-05-12 06:38:32 +00:00
e3mrah
e3771f6813
fix(flow): derive HR dependsOn from live watcher + fix canvas drill-down 404 (#1431)
Two bugs the operator hit on /sovereign/provision/<id>/jobs:

1) Phase-1 install-* Jobs rendered DISCONNECTED on the canvas —
   helmwatch.Bridge doesn't persist Job.DependsOn (only the Phase-0
   tofu chain + cluster-bootstrap is wired today). Pull HR.spec.dependsOn
   from the live Watcher's informer cache via SnapshotComponents()
   (ComponentSnapshot.DependsOn already populated by extractDependsOn)
   at snapshot-time and emit finish-to-start edges from upstream
   install-<dep> to install-<self>. Also add provisioner→bootstrap-kit
   group-to-group finish-to-start so the Phase-0/Phase-1 ordering is
   visible on the canvas.

2) Clicking a canvas node → "404 page not found" because
   FlowPage.handleNodeDoubleClick passed the full
   "<deploymentId>:install-X" id verbatim. The backend Store.GetJob
   keys by bare jobName ("install-X"), so the colon-prefixed id missed
   exact-match and JobDetail returned 404. Mirror useJobLinkBuilder
   (JobsTable.tsx line 364): strip the "<deploymentId>:" prefix and
   encodeURIComponent the remainder before pushing to the router.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:36:22 +04:00
github-actions[bot]
59b6940c18 deploy: update catalyst images to 2fbab45 2026-05-12 06:08:41 +00:00
e3mrah
2fbab45b43
feat(flow-proxy): assemble snapshot from local jobs.Store before upstream proxy (#1429)
* fix(catalyst-api): add OPENOVA_FLOW_SERVER_URL env to chart template

Without this env the proxy resolveFlowServerURL() falls back to
per-deployment FQDN lookup (https://openova-flow.<sovereignFQDN>) which
only exists on Sovereigns that already installed bootstrap-kit slot 56
with httproute=enabled. Every other catalyst-api deployment (mothership
contabo + Sovereigns that haven't reached cutover yet) returns 502 on
/api/v1/flows/{deploymentId}/snapshot — the live regression founder
saw at console.openova.io: "No nodes to render."

The env points at the in-cluster Service DNS for the LOCAL openova-flow-
server. Both the mothership (catalyst-system or catalyst namespace) and
each Sovereign chroot run the bp-openova-flow-server chart with a local
Service, so this URL is correct for every cluster catalyst-api runs in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(flow-proxy): assemble snapshot from local jobs.Store before upstream proxy

Mothership canvas at /sovereign/provision/<id>/jobs was empty for the
first ~30 minutes of every fresh provision because the snapshot
endpoint went straight to https://openova-flow.<sovereignFQDN> which
can't serve until cilium + cert-manager + the HTTPRoute TLS cert are
all up on the chroot. The Phase-0 + Phase-1 lifecycle Jobs catalyst-api
ALREADY owns (tofu-init/plan/apply/output, flux-bootstrap,
install-bp-<chart>, ...) were invisible the whole time.

This change adds flowSnapshotFromJobs which assembles the canonical
FlowMessage envelope from h.jobsStore().ListJobs(deploymentID) — every
Job becomes a FlowNode with the legacy <deploymentId>:<jobName> id form
the canvas drill-down already expects, every Job.DependsOn becomes a
finish-to-start Relationship, every Job.ParentID becomes a contains
Relationship. HandleFlowSnapshot checks the local store first and
returns immediately when it has data; otherwise falls through to the
existing upstream proxy path.

HandleFlowStream gets the same treatment via flowStreamLocal: emit a
snapshot frame on connect AND every 3 seconds thereafter, plus a 15s
heartbeat. The OpenovaFlow consumer's reducer is idempotent on
snapshot replay so re-emitting an unchanged envelope is harmless;
in exchange the canvas reflects Job state transitions within ~3s
of when helmwatch.Bridge writes them.

No FE change required — the same /api/v1/flows/<id>/snapshot and
/stream endpoints serve the same envelope shape the chroot adapter
emits (products/openova-flow/adapter-flux/internal/types/flow.go),
named SSE events including 'snapshot' and 'heartbeat'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:06:28 +04:00
github-actions[bot]
4ceb74067f deploy: update catalyst images to 50bf7a5 2026-05-12 04:12:24 +00:00
e3mrah
50bf7a59ed
fix: F8 - double bp-catalyst-platform HR timeout (15m→30m) + catalyst-api phase1 budget (60m→120m) (#1428)
prov #44 (d9399223c3caa4f9) hit the catalyst-api 60m phase1 watch cap
with bp-catalyst-platform HR still mid-retry (failures=3) and 41/45 HRs
True. F1-F7 are correct and live on main (qa-finalizer-strip Completed,
autoscaler workers joined). The remaining wall is total bootstrap-kit
install time exceeding the outer watch budget on a fresh cpx42×1
Sovereign without a warm Harbor proxy-cache.

Two lock-step changes widen both bounds:

1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
   install.timeout 15m → 30m, upgrade.timeout 15m → 30m. The umbrella
   chart genuinely needs >15m worst case when the full SME + Catalyst
   service stack rolls cold.

2. products/catalyst/bootstrap/api/internal/helmwatch/helmwatch.go:
   DefaultWatchTimeout 60m → 120m. Worst-case inner HR retry chain is
   now 30m × 3 = 90m; the outer phase1 budget MUST be larger so the
   watch never terminates while helm-controller still has remediation
   attempts left. CATALYST_PHASE1_WATCH_TIMEOUT env-var override path
   was already wired (issue #538 baseline) — chart template now
   declares the explicit "120m" value so the runtime knob is
   discoverable for capacity-bounded environments. Per INVIOLABLE-
   PRINCIPLES.md #4 the knob remains runtime-configurable.

New unit test TestPhase1WatchConfig_ProductionDefaultIs120m pins the
F8 floor against future regression. Existing env-var override + field-
override tests still pass unchanged.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 08:10:24 +04:00
github-actions[bot]
dd095b8597 deploy: update catalyst images to b743b64 2026-05-12 02:13:30 +00:00
github-actions[bot]
d4d05f16f6 deploy: update catalyst images to 8c7d326 2026-05-12 00:38:43 +00:00
e3mrah
8c7d32616e
fix(bp-catalyst-platform): qa-finalizer-strip hook unschedulable on saturated worker (Fix #185, prov #38/#39/#41 recurrence) (#1426)
Root cause (4-layer trace on prov #41, omantel.biz, 2026-05-12 00:28 UTC):

  bp-catalyst-platform HR install.timeout=15m
    → Helm pre-install hook: qa-finalizer-strip Job (weight -99)
      → Pod requests 50m CPU + 64Mi memory (tiny)
        → BUT no tolerations → scheduler restricted to worker
          → worker cpx32 (8vCPU/16GB) at 99% CPU requests
            (7980m of 8000m allocated) after bootstrap-kit fan-out
            → FailedScheduling: "0/2 nodes are available: 1
              Insufficient cpu, 1 node(s) had untolerated taint
              {node-role.kubernetes.io/control-plane: true}"
            → autoscaler triggers scale-up worker 2→3 → "1 in backoff
              after failed scale-up" → still Pending → 15m timeout
              → InstallFailed → Flux uninstall+rollback → installFailures: 3
              → Flux gives up entirely

Live evidence quoted from chroot kubeconfig on prov #41:
  - bp-catalyst-platform HR `Reconciling=True, reason=Progressing,
    message="Running 'install' action with timeout of 15m0s"`
  - HR `Released=False, reason=InstallFailed, message="Helm install
    failed for release catalyst-system/catalyst-platform with chart
    bp-catalyst-platform@1.4.140: failed pre-install: 1 error occurred:
    * timed out waiting for the condition"`
  - Pod `qa-finalizer-strip-m2hdb` status=Pending; events:
    `Warning  FailedScheduling 108s default-scheduler 0/2 nodes are
    available: 1 Insufficient cpu, 1 node(s) had untolerated taint
    {node-role.kubernetes.io/control-plane: true}`
  - Worker `Allocated cpu 7980m (99%) of 8000m capacity`
  - Control-plane `Allocated cpu 635m (7%) of 8000m capacity` (idle)

Fix: add tolerations for the control-plane NoSchedule taint +
priorityClassName: system-cluster-critical so the qa-finalizer-strip
Job can ALWAYS schedule regardless of worker-node CPU saturation.
The hook is a defense-in-depth cleanup that runs in seconds on a
clean cluster; it legitimately belongs anywhere with free capacity
including the control-plane node (which on prov #41 had 7365m CPU
free vs. the hook's 50m request).

Why prior fixes didn't suffice:
  - Fix #114 introduced this hook to break a finalizer-deadlock loop
    on prov #9. Correct fix for that wedge; never anticipated worker
    saturation as a scheduling failure mode for the hook itself.
  - Fix #138 (chart 1.4.138) converted the qa-cnpg-backup-s3-seed +
    qa-cnpg-status-seed hooks (weight 0/post-install) to regular
    release resources to break a circular DAG dep. Different hook
    surface.
  - Fix #184 (chart 1.4.140) raised the gitea-token-mint pre-install
    hook (weight +10) wait budget for cold-start autoscaler. That
    hook runs AFTER qa-finalizer-strip (-99 < +10); if the -99 hook
    never starts, the +10 hook never runs.

Recurring class: same family as Fix #114 (hook scheduling failure
wedges entire HR install). 3 consecutive recurrences (prov #38, #39,
#41) on chart pin 1.4.140 trigger the category-level audit threshold
(CLAUDE.md rule "CATEGORY-LEVEL THINKING"). Coupled chart hygiene
swept in same commit:

  - Switch image from bitnamilegacy/kubectl:1.29.3 (Docker-Hub
    redirect for deprecated Bitnami images, 2025-08 cutover
    documented at platform/self-sovereign-cutover/chart/values.yaml:
    252) → harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4 —
    the canonical alpine-based kubectl image already used by sibling
    hook catalyst-gitea-token-mint (Fix #163). MIRROR-EVERYTHING +
    ARCHITECT-FIRST rules.

Coordinator follow-up tickets:
  - Sibling Jobs in templates/qa-fixtures/cnpg-clusters-qa.yaml
    (qa-cnpgpair-status-seed) still reference bitnamilegacy/kubectl
    :1.29.3 — same Bitnami-deprecation class. Out of scope for this
    Fix (not part of the recurrence cluster); flagged for a sweep.
  - Worker cpx32 sizing may be undersized for the bootstrap-kit fan-
    out on omantel.biz — separate sizing ticket, not blocking.

Changes:
  - products/catalyst/chart/templates/qa-fixtures/pre-install-
    finalizer-strip.yaml: add tolerations + priorityClassName;
    switch image to alpine/k8s:1.31.4. Inline doc comments explain
    the 4-layer trace and the Fix #114/#138/#184 history.
  - products/catalyst/chart/Chart.yaml: bump 1.4.140 → 1.4.141 with
    changelog entry capturing root cause + budget arithmetic.
  - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
    bump HR pin 1.4.140 → 1.4.141.

Verification:
  - helm template renders cleanly (exit 0, ~6700 lines).
  - kubectl apply --dry-run=client validates the rendered Job
    manifest (job.batch/qa-finalizer-strip created (dry run)).
  - Rendered Job contains tolerations[control-plane Exists NoSchedule],
    priorityClassName: system-cluster-critical, image: alpine/k8s:1.31.4.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 04:36:35 +04:00
github-actions[bot]
5fdd33b7c0 deploy: update catalyst images to 0ba87bb 2026-05-11 18:32:08 +00:00
e3mrah
0ba87bb8da
fix(JobsPage): use FlowNode.id in row anchor href (region prefix) (#1414)
TC-035 (iter-2, 2026-05-11): OpenovaFlow rows merged into JobsPage
(PR #1413) lost their region-prefixed identity in the URL. The link
builder sliced the "<prefix>:" segment off every id with a colon —
intended to strip the legacy "<deploymentId>:install-keycloak" form,
but it also stripped "contabo:bp-openova-flow-server" → bare
"bp-openova-flow-server" in the href. The matrix asserts the
verbatim form "/jobs/contabo:bp-openova-flow-server" must appear in
the rendered DOM.

Fix: stop slicing. `encodeURIComponent` still escapes unsafe path
chars (`/` for live K8s job ids like "job/syft-grype/..."), then we
restore `:` because RFC 3986 permits it as a path-segment `pchar`.
FlowPage canvas navigation (PR #1411) and JobDetail flow-fallback
(PR #1412) already pass on the colon-present form, so this round-
trips end-to-end. Legacy "bp-cilium" / "cluster-bootstrap" hrefs are
unchanged (no `:` to encode). The previously-stripped legacy form
"<deploymentId>:install-keycloak" now lands as the full id in the
URL, and JobDetail's `jobsById` lookup is already keyed by BOTH the
canonical id AND the bare jobName (JobDetail.tsx:124-131), so the
resolution path is preserved.

Test coverage: new Case 4 in JobsPage.flow-merge.test.tsx asserts
the openova-flow row's anchor `href` contains
`/jobs/contabo:bp-openova-flow-server` and is NOT the bare-jobName
form. All 4 flow-merge cases PASS. The 3 pre-existing failures in
JobsPage.test.tsx (back-to-apps href, canonical-columns header,
Show-as-Flow button) are the documented iter-2 baseline — untouched
by this change.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:29:46 +04:00
github-actions[bot]
5c987309b5 deploy: update catalyst images to 5332ed0 2026-05-11 17:56:31 +00:00
e3mrah
5332ed0691
fix(JobsPage): merge openova-flow snapshot rows into legacy /jobs table (#1413)
TC-035 iter-1 FAIL (2026-05-11): /sovereign/provision/12e194090631a885/jobs
asserts rows for the openova-flow-server + openova-flow-emitter HRs but the
JobsTable only sourced from /api/v1/deployments/<id>/jobs (legacy event
stream) — verified live: GET /v1/flows/<id>/snapshot returns 2 leaf nodes
(contabo:bp-openova-flow-server, contabo:bp-openova-flow-emitter) whose ids
NEVER appear in the legacy /jobs payload. Sovereigns whose state lives only
in the OpenovaFlow snapshot silently drop these rows.

Fix: wire `useFlowStream({deploymentId})` alongside the existing legacy
reducer + live-jobs backfill. Synthesize a Job stub per FlowNode via
`synthesizeJobFromFlowNode` (PR #1412 — same adapter JobDetail's
flow-fallback path uses) and append the rows whose ids are absent from the
legacy set. Legacy wins dedup on id collisions because it carries real
execution timeline / appId / parentId / dependsOn — the flow synth is
intentionally a minimal stub.

Behavior unchanged for Sovereigns without an active flow stream: empty
FlowNode map → empty `flowJobs` → `legacyMerged` passes through untouched.

Test coverage (JobsPage.flow-merge.test.tsx — 3 cases, all PASS):
  1. Legacy 5 / flow empty → 5 rows, no behavior change.
  2. Legacy 5 / flow has 2 distinct ids → 7 rows with the contabo:bp-*
     ids present.
  3. Legacy 5 / flow has 1 id-collision + 1 new → 6 rows, legacy wins
     dedup (DOM scan asserts the colliding testid appears exactly once).

Validation:
  vitest: 3/3 PASS on new file; 13 prior tests in JobsPage.test.tsx
  unchanged from origin/main baseline (3 unrelated pre-existing failures
  in chrome/columns/Show-as-Flow tests, untouched by this fix).
  tsc --noEmit -p tsconfig.app.json: 27 errors, ALL pre-existing in
  @openova/flow-canvas + @openova/flow-core workspaces — zero new errors
  introduced.

Canonical seam reused (no new code paths):
  - @/lib/openflow-adapter-sse → useFlowStream (FlowPage / JobDetail share)
  - @/lib/synthesizeJobFromFlowNode (PR #1412 helper)
  - @/lib/jobs.types → Job (single source of truth)

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:54:14 +04:00
github-actions[bot]
1f05e52e77 deploy: update catalyst images to 36d1f56 2026-05-11 17:47:04 +00:00
e3mrah
36d1f56840
fix(JobDetail): fall back to OpenovaFlow snapshot when legacy /jobs 404 (#1412)
JobDetail built `jobsById` from the legacy useDeploymentEvents reducer
+ useLiveJobsBackfill polling. For Sovereigns whose state lives ONLY in
the openova-flow snapshot (post-flux-only flow, fresh chroot before the
catalyst-api event bridge has emitted any rows), that lookup misses and
JobDetail short-circuited to "Job not found" — never mounting FlowPage,
the very surface that would have painted the node.

Verified live this turn against deployment 12e194090631a885:
  GET /api/v1/flows/12e194090631a885/snapshot → 200, 2 leaf nodes
  GET /api/v1/deployments/12e194090631a885/jobs/<nodeId> → 404

This blocks ~20 of 26 iter-1 FAILs on the OpenovaFlow canvas test
matrix (TC-019/020/021/023/024/025/027/028/033/034/036/037/038/039/040
/041/042/053/054/060/064).

Fix:
  • JobDetail now reads the same useFlowStream hook FlowPage uses.
  • When `jobsById[jobId]` is undefined, look up the node in the flow
    snapshot's nodes Map. If found, synthesize a flat Job stub from the
    FlowNode (id, label, status) so the canvas mounts with the right
    hostJobId.
  • Behaviour for Sovereigns WITH an active event stream is unchanged
    — the legacy lookup wins and the synth stub is never read.
  • "Job not found" panel renders ONLY when BOTH lookups miss.

Tests:
  Added JobDetail.flow-fallback.test.tsx (vitest, 3 cases):
    1. Legacy has the job → FlowPage renders, no fallback.
    2. Legacy empty, flow snapshot has the node → FlowPage renders
       via synth job (the iter-1 FAIL scenario).
    3. Both empty → "Job not found" panel.
  All 3 new + 5 existing JobDetail tests pass.
  No tsc regressions (27 → 27 baseline errors, all pre-existing
  in flow-canvas/flow-core packages).

Refs INVIOLABLE-PRINCIPLES.md:
  #1 (waterfall): target-state fallback, no MVP "show loading" stub.
  #2 (no compromise): no field is faked with plausible data; absent
    timestamps land as null / 0 so fmtTime renders "—".
  #4 (never hardcode): the synth helper coerces FlowNode.status into
    the JobStatus vocabulary; the label falls back to the node id when
    `label` is empty.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:43:43 +04:00
github-actions[bot]
0a869c3805 deploy: update catalyst images to 1863a25 2026-05-11 16:54:46 +00:00
Claude Code
1863a25c53 fix(openflow-adapter-sse): guard reducer iterations against missing fields
Root cause of live crash 'TypeError: t.relationships is not iterable':
the Go server uses omitempty JSON tags on FlowMessage so empty slices
are dropped from the wire (snapshot with 2 nodes + 0 rels arrives as
'{"type":"snapshot","nodes":[...]}' with no 'relationships' key).
The reducer iterates msg.relationships, msg.nodes, msg.ids, msg.pairs
without nullish guards → crashes on first frame.

Defensive (?? []) on every reducer iteration. Same shape, idempotent.

Observed bundle: index-CEnQMVBy.js@2285:51356.
Snapshot proven empty-rel: GET /v1/flows/12e194090631a885/snapshot
returns {type:'snapshot',nodes:[2 items]} with relationships key absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:52:27 +02:00
Claude Code
75aac8d53c deploy: update catalyst images to 2ffdba0
Manual bump — Build & Deploy Catalyst workflow's deploy job lost the
push race twice on PR #1411 merge. Images exist in GHCR; this commit
lands the template+values bump so Flux on contabo-mkt reconciles and
the natural-view canvas restore (FlowCanvasOrganic + fold badges +
depth chip) takes effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:34:42 +02:00
e3mrah
2ffdba038f
fix: restore natural FlowPage canvas + drop synthetic phase/region pillars (#1411)
Founder rejected the lane-layout + synthetic-phase scaffolding shipped
via PR #1399/#1400/#1407. This commit restores the founder-tuned
natural view (FlowCanvasOrganic) and adds the per-bubble fold-
disclosure badge + top-right depth chip on top of it.

Adapter (products/openova-flow/adapter-flux/):
  - mapper.go: BuildFromHR now returns ONE leaf FlowNode + finish-to-
    start edges from spec.dependsOn only. Deleted BuildRegionNode,
    BuildPhaseNodes, BuildPhaseEdges, phaseLabels, phaseSortKey,
    AllPhaseSuffixes, PhaseSuffix* constants, derivePhase, PhaseLabel,
    PhaseSortKey. Node-id separator changed "/" → ":" so ids do not
    collide with URL routing (founder hit "Not Found" drilling into
    contabo/phase-0).
  - hr_informer.go: dropped bootstrap(), tracker, nodeGroups,
    reemitGroups(), buildGroupNode(). handle() is now single-leaf
    upsert + dependsOn edges.
  - rollup.go: deleted entirely (StatusTracker only existed for
    synthetic group rollups).
  - mapper_synthetic_test.go + rollup_test.go: deleted; mapper_test.go
    updated for the ":" separator + no-synthetic-rels assertions.

UI (products/catalyst/bootstrap/ui/):
  - FlowPage.tsx: switched from @openova/flow-canvas's FlowCanvas back
    to FlowCanvasOrganic. Dropped lane-layout (regionDescriptorsFromFlow),
    defaultFoldedAtDepth from @openova/flow-core, FoldControls chrome
    strip. Kept useFlowStream + ?folded=/?depth= URL contract.
  - flowStreamToOrganic.ts (new): bridges live SSE state to the Job[]
    + hints + region/family descriptors flowLayoutOrganic expects.
    Treats `contains` rels as parent-child and FS/SS/FF/SF/triggers as
    dependsOn.
  - FlowCanvasOrganic.tsx: ADDITIVE optional props onFoldToggle,
    badgeCounts, nodeActions, onNodeAction. Renders per-bubble "⊕ K"/
    "⊖" disclosure badge on group bubbles when wired; right-click
    opens a small action menu. Existing call sites are unchanged.
  - Depth chip: ◀ L<n>/<max> ▶ pinned top-right of canvas host,
    visible only when real groups exist in the data. Esc clears
    manual fold overrides.

Verification:
  - go build ./... in adapter-flux: clean
  - go test ./... in adapter-flux: PASS (12 tests)
  - tsc --noEmit on bootstrap/ui: clean
  - vitest FlowPage + FlowCanvasOrganic.bounded: 25/25 PASS
  - vitest JobDetail + distribution + flowLayoutOrganic + flow-bridge:
    27/27 PASS

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 20:22:58 +04:00
github-actions[bot]
98d5975566 deploy: update catalyst images to d96d3dd 2026-05-11 16:02:47 +00:00
e3mrah
d96d3dd0ff
fix(openflow-adapter-sse): subscribe to NAMED SSE events (#1410)
* feat(openova-flow-canvas): fold UX + lane layout + actions menu + cross-flow nav (Agent #9)

Wires the 6 founder-locked canvas views agreed 2026-05-11:

  • Lane layout — `meta.layout: 'lane-vertical' | 'lane-horizontal'`
    on a `contains`-parent renders the group as a rounded-rect
    swim-lane; children pack inside (L→R horizontal, T→B vertical).
    Lanes nest: region (vertical) → phase (horizontal) → HR bubbles.
    Falls back to organic d3-force when no group declares a layout
    hint, so single-region provisions look unchanged.
  • Child-count badge `[N]` on every foldable parent — recursive
    descendant count through `contains` edges, surfaced via
    PositionedNode.descendantCount. Renders independent of fold
    state per the founder-locked View 4 ASCII (region keeps `[43]`
    even when expanded to phases only).
  • Hover dim — onMouseEnter/Leave on a node dims non-neighbor
    nodes + non-incident edges to 35% opacity. Selection / host /
    neighbor rings keep full opacity per spec precedence.
  • Right-click → adapter actions menu — new `actions` +
    `onNodeAction` props on FlowCanvasProps. Renders the supplied
    NodeAction[] (filtered by per-action `enabled` predicate) in a
    NodeActionsMenu (click-outside + Esc dismissal, mirrors
    ProfileMenu's canonical seam).
  • `triggeredBy` cross-flow badge — when FlowInstance.triggeredBy
    is non-empty, a top-left banner lists the parent flows with a
    `[↗ open flow]` button → onNavigateFlow callback.
  • Cross-flow edges — when a Relationship's `toFlowId` references a
    flow not in the current canvas, the source node renders a
    "→ flow" tag that calls onNavigateFlow.

FlowPage wires onNodeAction to POST /api/v1/flows/{id}/nodes/{nodeId}
/actions/{actionId} and onNavigateFlow to the router. Default action
list (Retry/Suspend/View logs) supplied by FlowPage; adapters can
override.

Canonical seam citations (per ARCHITECT-FIRST):

  • core/src/layout.ts (Agent #1) — pure layout function. Extended
    with LaneDescriptor[] + descendantCount, cycle-safe lane-depth
    walks reusing the existing visited-set pattern. Lane geometry
    stays in canvas (the layout is pure topology).
  • widgets/auth/ProfileMenu.tsx — canonical click-outside + ESC
    dismissal pattern. NodeActionsMenu mirrors this verbatim so we
    stay consistent without a new radix/headless-ui dependency.

Tests: 25 core (was 20, +5 for lanes + descendantCount) + 22 canvas
(was 9, +13 for lane layout, badge math, hover dim, action menu,
triggeredBy banner, cross-flow tag). FlowPage tests still 8/8 green.

No vite/next builds (Rule 7). No kubectl writes (Rule 11). Lane
geometry has zero domain knowledge — the canvas never reads "phase"
or "region" as words; everything is `meta.layout` + `meta.isGroup`
+ `contains` edges driven by the adapter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(openflow-adapter-sse): subscribe to NAMED SSE events not just onmessage

Root cause of canvas "No nodes to render": the openova-flow-server
emits SSE frames with named event types per the contract:

  event: snapshot
  event: upsert-nodes
  event: upsert-rels
  ...

EventSource's `onmessage` handler ONLY fires for the default
("message") event type. addEventListener with the explicit name is
required for named events. The hook only had `next.onmessage = onMessage`
so EVERY frame the server emitted was silently dropped; the local state
stayed at the initial empty value and FlowCanvas rendered the empty
fallback message.

Verified live: in-browser test showed onmessage_count=0,
addEventListener('snapshot') count=1 — exactly one snapshot frame
arrived but the hook ignored it.

Fix: register addEventListener for every event name in the contract
(snapshot, upsert-flow, upsert-nodes, upsert-rels, delete-nodes,
delete-rels, heartbeat). onmessage retained as defensive default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 19:59:42 +04:00
github-actions[bot]
0fc2adc601 deploy: update catalyst images to 1e14439 2026-05-11 15:59:18 +00:00
e3mrah
1e14439f95
fix(catalyst-api): add OPENOVA_FLOW_SERVER_URL env to chart template (#1409)
Without this env the proxy resolveFlowServerURL() falls back to
per-deployment FQDN lookup (https://openova-flow.<sovereignFQDN>) which
only exists on Sovereigns that already installed bootstrap-kit slot 56
with httproute=enabled. Every other catalyst-api deployment (mothership
contabo + Sovereigns that haven't reached cutover yet) returns 502 on
/api/v1/flows/{deploymentId}/snapshot — the live regression founder
saw at console.openova.io: "No nodes to render."

The env points at the in-cluster Service DNS for the LOCAL openova-flow-
server. Both the mothership (catalyst-system or catalyst namespace) and
each Sovereign chroot run the bp-openova-flow-server chart with a local
Service, so this URL is correct for every cluster catalyst-api runs in.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 19:55:58 +04:00
github-actions[bot]
62d807f808 deploy: update catalyst images to 5bd68ae 2026-05-11 14:11:33 +00:00
e3mrah
5bd68ae0f6
feat(openova-flow-canvas): fold UX + lane layout + actions menu + cross-flow nav (Agent #9) (#1407)
Wires the 6 founder-locked canvas views agreed 2026-05-11:

  • Lane layout — `meta.layout: 'lane-vertical' | 'lane-horizontal'`
    on a `contains`-parent renders the group as a rounded-rect
    swim-lane; children pack inside (L→R horizontal, T→B vertical).
    Lanes nest: region (vertical) → phase (horizontal) → HR bubbles.
    Falls back to organic d3-force when no group declares a layout
    hint, so single-region provisions look unchanged.
  • Child-count badge `[N]` on every foldable parent — recursive
    descendant count through `contains` edges, surfaced via
    PositionedNode.descendantCount. Renders independent of fold
    state per the founder-locked View 4 ASCII (region keeps `[43]`
    even when expanded to phases only).
  • Hover dim — onMouseEnter/Leave on a node dims non-neighbor
    nodes + non-incident edges to 35% opacity. Selection / host /
    neighbor rings keep full opacity per spec precedence.
  • Right-click → adapter actions menu — new `actions` +
    `onNodeAction` props on FlowCanvasProps. Renders the supplied
    NodeAction[] (filtered by per-action `enabled` predicate) in a
    NodeActionsMenu (click-outside + Esc dismissal, mirrors
    ProfileMenu's canonical seam).
  • `triggeredBy` cross-flow badge — when FlowInstance.triggeredBy
    is non-empty, a top-left banner lists the parent flows with a
    `[↗ open flow]` button → onNavigateFlow callback.
  • Cross-flow edges — when a Relationship's `toFlowId` references a
    flow not in the current canvas, the source node renders a
    "→ flow" tag that calls onNavigateFlow.

FlowPage wires onNodeAction to POST /api/v1/flows/{id}/nodes/{nodeId}
/actions/{actionId} and onNavigateFlow to the router. Default action
list (Retry/Suspend/View logs) supplied by FlowPage; adapters can
override.

Canonical seam citations (per ARCHITECT-FIRST):

  • core/src/layout.ts (Agent #1) — pure layout function. Extended
    with LaneDescriptor[] + descendantCount, cycle-safe lane-depth
    walks reusing the existing visited-set pattern. Lane geometry
    stays in canvas (the layout is pure topology).
  • widgets/auth/ProfileMenu.tsx — canonical click-outside + ESC
    dismissal pattern. NodeActionsMenu mirrors this verbatim so we
    stay consistent without a new radix/headless-ui dependency.

Tests: 25 core (was 20, +5 for lanes + descendantCount) + 22 canvas
(was 9, +13 for lane layout, badge math, hover dim, action menu,
triggeredBy banner, cross-flow tag). FlowPage tests still 8/8 green.

No vite/next builds (Rule 7). No kubectl writes (Rule 11). Lane
geometry has zero domain knowledge — the canvas never reads "phase"
or "region" as words; everything is `meta.layout` + `meta.isGroup`
+ `contains` edges driven by the adapter.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:03:00 +04:00
e3mrah
410ce2d394
fix(openova-flow-proxy): derive upstream URL from deployment FQDN (HTTPRoute) — Agent #8 (#1405)
Mothership catalyst-api serves /sovereign/api/v1/flows/{deploymentId}/* for
every Sovereign's user-facing job view, but the previous resolver only knew
about OPENOVA_FLOW_SERVER_URL (or the in-cluster Service DNS default). On
the mothership both fall back to a name the kernel can't resolve, so prov #34
hit:

  HTTP/2 502 openova-flow-server unreachable:
    Get "http://openova-flow-server.catalyst-system.svc.cluster.local:8080/v1/flows/.../snapshot":
    dial tcp: lookup openova-flow-server.catalyst-system.svc.cluster.local: no such host

Resolution order is now:

  1. OPENOVA_FLOW_SERVER_URL env override — wins (chroot catalyst-api).
  2. h.deployments.Load(deploymentId) → Request.SovereignFQDN → build
     `https://openova-flow.<sovereignFQDN>` (HTTPRoute pattern documented
     in platform/openova-flow-server/chart/values.yaml comment + the
     bootstrap-kit overlay clusters/_template/bootstrap-kit/56-bp-openova-
     flow-server.yaml which sets `hostname: openova-flow.${SOVEREIGN_FQDN}`).
  3. No deployment in store (and no env): return 404 instead of silently
     dialing a Service URL the mothership can't reach.

Canonical patterns cited (ARCHITECT-FIRST rule):
  - PDM-by-deploymentId lookup: deployments.go GetDeployment lines 1201-1216
    (h.deployments.Load(id) → (*Deployment).Request.SovereignFQDN). The
    chrootEnsureDeployment fallback (jobs.go lines 53-86) covers the
    chroot case; on the mother it returns nil and surfaces 404.
  - Self-signed TLS skip-verify: deployment_handover_export.go line 62
    (&tls.Config{InsecureSkipVerify: true} with nolint:gosec, gated by
    explicit operator opt-in). Gated here on
    OPENOVA_FLOW_TLS_SKIP_VERIFY=true so qa-loop Sovereigns minting
    LE-staging "Fake LE Intermediate X1" certs are reachable, while
    production stays strict.

SSE streaming logic is unchanged. Per docs/INVIOLABLE-PRINCIPLES.md #4
the only hostname literal added is the chart-documented prefix
`openova-flow.`; the FQDN suffix itself comes from the per-deployment
record at runtime.

Tests:
  - TestFlowProxy_EnvOverride_TakesPrecedence — chroot path
  - TestFlowProxy_DerivesURLFromDeploymentFQDN — mother path
  - TestFlowProxy_DerivedURL_NotFoundReturns404
  - TestFlowProxy_DerivedURL_EmptyFQDNReturns404
  - TestFlowProxy_DerivedURL_PathAssembly
All 15 TestFlowProxy_* tests pass (go test ./internal/handler -run TestFlowProxy).
go vet ./... clean. go build ./cmd/api clean. The two pre-existing
TestHandleWhoami_* failures on origin/main are unrelated.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:32:08 +04:00
github-actions[bot]
386884fa2e deploy: update catalyst images to 52cc679 2026-05-11 13:13:04 +00:00
e3mrah
52cc6794ee
fix(ui-build): include @types/node so tests referencing global compile (#1403)
build-ui on 841b6133 surfaced TS2304 "Cannot find name 'global'" in
several layout tests after the workspace-root npm ci fix exposed
errors that the prior react/d3-* failures had masked. The tests use
`global.fetch = vi.fn(...)` which requires @types/node ambient types.

tsconfig.app.json restricted `types` to ["vite/client"], so node
types weren't auto-loaded. Add "node" so the existing @types/node
devDep (^24.12.0) is in scope.

Co-authored-by: hatiyildiz <269457768+hatiyildiz=hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:10:08 +04:00
e3mrah
841b61336c
fix(ui-build): npm ci from workspace root for @openova/flow-* resolution (#1401)
PR #1399 (Agent #5) added npm workspaces at the repo root, but the
Containerfile still ran `npm ci` from /repo/products/catalyst/bootstrap/ui/
which bypasses workspace activation. Cross-workspace bare-spec imports
(react / d3-force / d3-drag / d3-selection) from the canvas package
source couldn't resolve, breaking the Docker build with ~120 TS2307
errors on commit 2c6595a3 (2026-05-11).

Fix: COPY the workspace-root package.json + package-lock.json + each
workspace's package.json BEFORE installing. Run `npm ci --workspaces
--include-workspace-root` from /repo. Then WORKDIR into the leaf for
the Vite build. This is the canonical npm workspaces flow.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:06:13 +04:00
e3mrah
b8a75962a8
feat(openova-flow-adapter-flux): synthetic phase/region nodes + contains edges (Agent #6) (#1400)
OpenovaFlow's FlowNode is deliberately domain-agnostic — Phase 0/1/2/3
+ multi-region structure are conveyed via synthetic group nodes,
contains relationships, and adapter-supplied meta.layout hints (same
primitives a Temporal/Argo/Airflow adapter would use for their own
concepts). Catalyst-specific knowledge stays in the adapter.

What this PR ships
==================

products/openova-flow/adapter-flux:
  - mapper.go: phase-suffix constants, BuildPhaseNodes, BuildPhaseEdges,
    derivePhase (slot-label / component-label driven, no hardcoded
    HR-name → phase table). BuildFromHR now returns two `contains` rels
    per leaf (region row + phase column). BuildRegionNode carries
    meta.layout=lane-vertical + isGroup.
  - rollup.go (new): StatusTracker + RollupStatus (worst-of:
    failed > running > pending > succeeded). Mirrors the same worst-of
    rollup the catalyst-api status-projection uses for the Sovereign
    Console progress widget.
  - hr_informer.go: bootstrap emits region + 4 phase nodes + 3 FS edges
    per region; HR upserts/deletes update the StatusTracker and re-emit
    affected synthetic parents with fresh rolled-up status.
  - test/mapper_synthetic_test.go (new): 9 cases — phase nodes,
    phase edges, slot/component/name-fallback derivation, 43-mock-HR
    acceptance, region-scoped IDs, default region fallback.
  - test/rollup_test.go (new): 9 cases — rollup palette, tracker
    lifecycle, per-group isolation.
  - test/mapper_test.go: updated existing assertions for the new
    contains-edge count (2 per HR, was 1).

clusters/_template/bootstrap-kit/*.yaml (45 HRs):
  - Added catalyst.openova.io/slot=<NN> label per HR (chart-level slot
    surface so the adapter doesn't hardcode HR-name → phase). Mirrors
    the existing catalyst.openova.io/component label pattern in
    platform/external-secrets-stores/chart/templates/*.yaml +
    platform/openclaw/chart/templates/*.yaml.
  - 06a-bp-self-sovereign-cutover.yaml + 13-bp-catalyst-platform.yaml
    also get catalyst.openova.io/component={cutover,catalyst-platform}
    so their phase derivation is explicit, not name-fallback.

Canonical patterns cited
========================
1. catalyst.openova.io/component label on platform/* charts
   (platform/external-secrets-stores, platform/openclaw) — same label
   vocabulary, extended with slot.
2. worst-of-children rollup matches the existing catalyst-api
   status-projection pattern (Sovereign Console progress widget).

Tests
=====
  go test ./test/... → 31 PASS, 0 FAIL.
  go vet ./... → clean.

Definition of Done (after Build & Deploy + emitter reconcile)
=============================================================
GET /sovereign/api/v1/flows/<deploymentId>/snapshot returns:
  - N region root nodes (1 per adapter sidecar)
  - 4 phase nodes per region (8 total for 2-region prov)
  - N HR nodes per region with TWO `contains` edges each
  - 3 phase-FS edges per region

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:00:26 +04:00
e3mrah
2c6595a378
feat(openova-flow): npm workspaces + FlowPage canvas real-adapter rewire (Agent #5) (#1399)
Lands the OpenovaFlow Foundation end-to-end so the catalyst-ui FlowPage
consumes the new openova-flow-server's merged multi-region SSE stream
(`GET /api/v1/flows/{deploymentId}/stream`) and renders the per-region
adapter-flux emissions directly via @openova/flow-canvas. Closes the
revert from PR #1394 and unblocks the prov #34 multi-region 2-bubble
demo (fsn1 + hel1 each install bp-gateway-api → two bubbles).

# What ships

## A. npm workspaces at repo root

  • New `package.json` declares `openova-monorepo` private root with
    three workspaces: products/openova-flow/{core,canvas} +
    products/catalyst/bootstrap/ui.
  • Root `package-lock.json` resolves @openova/flow-* as workspace
    symlinks into the hoisted node_modules tree.
  • react / react-dom / d3-* are now hoisted into the monorepo's root
    node_modules, so flow-canvas's bare `import 'react'` resolves via
    standard upward-walking node_modules — no per-package sibling
    node_modules required (the root cause of PR #1389's build break).

## B. Catalyst-ui consumes @openova/flow-* via file: deps

  • catalyst-ui's `package.json` adds `@openova/flow-core` and
    `@openova/flow-canvas` as `file:../../../openova-flow/{core,canvas}`
    deps so `npm ci` from within catalyst-ui (today's CI path) keeps
    working without needing root-level `npm ci -ws`.
  • Vite `resolve.alias` + tsconfig `paths` bind `@openova/flow-core`
    and `@openova/flow-canvas` to the source-only `./src/index.ts`
    entry points. `dedupe: ['react', 'react-dom']` guards against
    double-instancing.
  • `tsconfig.app.json` `include` adds the two flow-package src trees
    so tsc covers them with catalyst-ui's strict settings (instead of
    each package's standalone `tsc -p tsconfig.json`, which lacks the
    React/d3 node_modules siblings).

## C. New SSE consumer + bridge

  • `src/lib/openflow-adapter-sse.ts` — `useFlowStream` React hook +
    pure `reduceFlowMessage` reducer. Consumes the contract verbatim
    (snapshot / upsert-flow / upsert-nodes / upsert-rels / delete-nodes
    / delete-rels). Owns the EventSource lifecycle, GET /snapshot
    pre-paint, capped exponential reconnect.
  • `src/lib/flow-bridge.ts` — catalyst-specific glue:
    `CATALYST_STATUS_PALETTE` (mirrors `--bubble-*` CSS tokens onto
    `StatusTone`), `flowStateToArrays` (Map→Array materialiser),
    `regionDescriptorsFromFlow` (derives FlowCanvas regions from live
    region tags + optional wizard-store augmentation), and
    `rollupFlowStatus` (provisioning-status rollup on the new
    contract).
  • NOT a Job-shape bridge — the legacy Job adapter from PR #1389
    is gone. catalyst-ui never goes through Catalyst's legacy Job model
    again; the SSE stream IS the source of truth.

## D. FlowPage.tsx rewired

  • Drives `FlowCanvas` from `@openova/flow-canvas` directly off the
    new hook.
  • Multi-region support comes for free: per-region adapter-flux tags
    every emitted FlowNode with `region: '<location-code>'`; the
    canvas's swimlane layout buckets by `region`. Single-region
    provisions render identically to before via a synthetic
    fallback descriptor.
  • Embedded mode preserved for JobDetail.

## E. Containerfile preserves CI build

  • COPY products/openova-flow/{core,canvas}/{package.json,src/}
    BEFORE `npm ci` so `file:` deps validate. Subsequent
    `COPY products/` layers the rest (CONTRACT.md etc.) in.

# Tests

  • 23 new tests, 0 regressions on adjacent areas:
    - `openflow-adapter-sse.test.ts` (6) — reducer covers all 6
      FlowMessage variants including delete-nodes' rel-prune cascade
      AND a multi-region merge case (fsn1 + hel1 both install
      bp-gateway-api).
    - `flow-bridge.test.ts` (10) — palette completeness, Map→Array
      ordering, region descriptor derivation/fallback, status rollup
      including group-exclusion and terminal-failure detection.
    - `FlowPage.test.tsx` (7) — empty-state mount, StatusStrip, no
      legacy mode toggle, embedded variant.
  • flow-core: 20/20 passing; flow-canvas: 9/9 passing.
  • Vitest full suite: 1130 pass / 87 fail (87 fails are pre-existing
    on main and unrelated — PinInput6, ProvisionPage, etc.). Baseline
    on main is 1052 pass / 88 fail / 27 failed files; this PR brings
    78 new passing tests and lowers failing files from 27 → 18.

# Constraints honoured (Rule 7)

  • NO `vite build` / `next build` / `npm run build` / `npx playwright
    test` / `npx playwright install`. Only `tsc --noEmit` + `vitest
    run` + `npm install --package-lock-only`.
  • NO `kubectl apply` / chart manifests touched (Rule 11).
  • NO hardcoded URLs / regions / k3s flags. Endpoint composed from
    `API_BASE`; regions derived from live FlowNode tags; deploymentId
    from `useParams` (Rule 18).
  • Two-repo discipline: openova-io/openova only (Rule 21).
  • Conventional commit + Claude co-author footer (Rule 20).
  • isolation:"worktree" — work landed in a dedicated worktree.

# Canonical-seam citations (ARCHITECT-FIRST)

  1. PR #1389's `flow-bridge.ts` — reference for the shape of a
     catalyst-ui→@openova/flow contract layer. NOT conflated: that
     bridge translated legacy Catalyst Jobs into FlowNodes; this one
     consumes the new SSE FlowMessage stream directly with no Job
     intermediary.
  2. `useDeploymentEvents.ts` (line 526+, `openStream` + `onerror`
     reconnect + capped retry) — canonical SSE consumer pattern in
     this codebase. `useFlowStream` mirrors it (capped exponential
     backoff, idempotent reducer over replayed buffered events).

# Definition of Done — post-merge verification plan

  1. CI green (catalyst-build builds the new Containerfile path).
  2. `curl -k -b /tmp/cz-cookie-prov27.txt
     'https://console.openova.io/sovereign/api/v1/flows/5a175e0a88c99cec/snapshot' | jq`
     → nodes[] contains BOTH `fsn1/bp-gateway-api` AND `hel1/bp-gateway-api`.
  3. Browser test: navigate to
     `https://console.openova.io/sovereign/provision/5a175e0a88c99cec/jobs/install-gateway-api`
     → expect TWO bubbles (one per region).
  4. If snapshot is empty, inspect emitter DaemonSets:
     `kubectl --context=omantel get pods -n openova-flow`.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 16:59:07 +04:00
github-actions[bot]
07ec0ee61c deploy: update catalyst images to 22855e6 2026-05-11 12:03:26 +00:00
e3mrah
22855e62d8
feat(openova-flow): catalyst-api proxy + cloud-init thread (Agent #3 — integrator, infra-side) (#1396)
Final integration piece for OpenovaFlow infrastructure path —
catalyst-api proxy + cloud-init substitution for SOVEREIGN_DEPLOYMENT_ID
+ SOVEREIGN_REGION_KEY, so bp-openova-flow-emitter (slot 57) emits
distinct region tags on every FlowNode and the snapshot returns 2× per
HR on a multi-region Sovereign.

Builds on PR #1389 (TS core + canvas packages on disk), PR #1390 (Go
server + flux adapter + bootstrap-kit slots 56/57), PR #1394 (catalyst-
ui temporary revert until npm workspaces land), PR #1395 (chart no-op).

## Scope vs original Agent #3 brief

The brief planned a 4-section PR (proxy + cloud-init + FlowPage rewire +
runbook). Section 3 (catalyst-ui rewire of @openova/flow-*) is deferred:
PR #1394 reverted Agent #1's UI wiring because the Docker UI build has
no node_modules for the cross-workspace canvas source. Founder note on
#1394: "Agent #3 (or a follow-up) will re-wire them properly once npm
workspaces are configured at repo root."

This PR ships the infrastructure half (proxy + cloud-init + runbook).
The canvas-side rewire is a separate follow-up PR that needs npm
workspaces, not surgical edits to FlowPage.

## What ships

### 1. catalyst-api proxy /api/v1/flows/{deploymentId}/{snapshot,stream,events}

products/catalyst/bootstrap/api/internal/handler/openova_flow_proxy.go:
- GET /snapshot — JSON pass-through, headers + status forwarded
- GET /stream — unbuffered SSE pass-through using http.Flusher (NOT
  httputil.ReverseProxy; that buffers and breaks text/event-stream)
- POST /events — body forwarded byte-for-byte
- Upstream URL from env OPENOVA_FLOW_SERVER_URL (default Sovereign
  in-cluster Service DNS)

Routes registered in cmd/api/main.go inside the auth-gated chi.Group.

11 table-driven tests cover snapshot/events/stream pass-through, upstream
404/400/unreachable propagation, empty-deploymentId guard, SSE frames
arrive AS EMITTED, and env-default fallback.

### 2. Cloud-init threads SOVEREIGN_DEPLOYMENT_ID + SOVEREIGN_REGION_KEY

- infra/hetzner/cloudinit-control-plane.tftpl — two new postBuild.
  substitute keys alongside SOVEREIGN_FQDN/SOVEREIGN_LB_IP
- infra/hetzner/main.tf — primary CP renders var.region as region key;
  secondary CP renders each.key (e.g. "hel1-1") from for_each over
  local.secondary_regions
- infra/hetzner/variables.tf — new sovereign_deployment_id var (string,
  default "" for tofu mocks)
- provisioner.go writeTfvars — writes vars["sovereign_deployment_id"]
  = req.DeploymentID
- bootstrap-kit slot 57 — swap placeholder ${SOVEREIGN_FQDN} / literal
  "primary" for the new ${SOVEREIGN_DEPLOYMENT_ID} / ${SOVEREIGN_REGION_KEY}
  envsubst keys

### 3. Deployment record flag

handler/deployments.go State() — emits `openovaFlowEnabled: true` on
every deployment. The catalyst-ui rewire (follow-up PR) will read this
to enable the openova-flow-server adapter; legacy provisions without
the flag will keep the bridge once the rewire lands.

### 4. Verification runbook

docs/runbooks/openova-flow-multi-region-verify.md — prov #34 POST body
(multi-region cpx42 fsn1+hel1, qaTestEnabled=true,
sovereignFQDN=omantel.biz), step-by-step kubectl/curl gates, visual
canvas checks (gated on the follow-up UI rewire), and a failure-class
triage table.

## Canonical-seam citations

1. SSE pattern — products/catalyst/bootstrap/api/internal/handler/
   deployments.go:1244-1287 (StreamLogs): identical Content-Type +
   Cache-Control + X-Accel-Buffering header set; identical
   http.Flusher.Flush() after each write; identical r.Context().Done()
   cancel path.

2. postBuild.substitute pattern — infra/hetzner/cloudinit-control-plane.tftpl:884-893
   (SOVEREIGN_FQDN + SOVEREIGN_LB_IP): same indentation, same KEY: ${var}
   form, dual emission at primary + secondary CP for_each in main.tf.

## Verification

```
$ go build ./...
(clean)

$ go vet ./...
(clean)

$ go test ./internal/handler/ -run TestFlowProxy -count=1 -race
ok    github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/handler   1.410s

$ go test ./internal/provisioner/... -count=1
ok    github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/provisioner  0.025s
```

3 pre-existing test failures (TestHandleWhoami_NoRBACOmitsFields,
TestHandleWhoami_PinSessionRBACClaims,
TestUnstructuredToUserAccess_NilApplicationsBecomesEmpty) reproduce on
main HEAD without this PR — unrelated baseline state.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 16:01:09 +04:00
github-actions[bot]
cdd8743177 deploy: update catalyst images to 2d54ced 2026-05-11 11:49:34 +00:00
e3mrah
2d54cedb78
revert(catalyst-ui): unwire @openova/flow-* until proper workspaces land (#1394)
PR #1389 wired the new @openova/flow-core + @openova/flow-canvas
packages into catalyst-ui via Vite alias + tsconfig paths. Build-image
tsc then tried to typecheck the canvas source (`products/openova-flow/
canvas/src/`) which has no sibling node_modules — bare imports for
react/d3-* fell off the resolution chain and the Docker UI build broke
on 16ec3399 with ~120 TS2307 errors.

PR #1392 attempted to add explicit paths for react/d3-* but pointed
at runtime .js dirs (no .d.ts), which broke ALL of catalyst-ui's
type resolution.

Cleanest emergency revert: undo the FlowPage refactor, restore vite
alias + tsconfig paths to pre-#1389 state, delete flow-bridge.{ts,test.ts}.
The new openova-flow/{core,canvas} source packages remain on disk —
Agent #3 (or a follow-up) will re-wire them properly once npm
workspaces are configured at repo root. Until then catalyst-ui uses
the legacy flowLayoutOrganic + FlowCanvasOrganic stack and builds
cleanly.

Multi-region rendering goal is unblocked: Agent #2's openova-flow-server
+ adapter-flux still deploy via bp-openova-flow-{server,emitter} HRs;
the canvas-side rewiring is the follow-up.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 15:47:20 +04:00
e3mrah
783b405f67
fix(openova-flow): tsc paths for cross-workspace canvas source (#1392)
Build-ui failed on 16ec3399 with TS2307 'Cannot find module react/d3-*'
when typechecking ../../../openova-flow/canvas/src/FlowCanvas.tsx.

Vite's bundler-mode module resolution starts from the imported file's
location. Canvas source lives at products/openova-flow/canvas/src/
with no sibling node_modules — bare-spec imports for react / react-dom /
d3-force / d3-drag / d3-selection fall off the resolution chain.

Fix: extend catalyst-ui tsconfig.app.json with explicit `paths` entries
mapping those bare specs to catalyst-ui's installed node_modules. Mirrors
the vite.config.ts alias additions Agent #1 introduced; both resolvers
now agree on the path. Also expands `include` to typecheck the canvas +
core sources from catalyst-ui's compilation root, so future regressions
land at PR-CI time, not build-image time.

Workspaces will eventually supersede this — Agent #2+#3 plan to land
real npm workspaces. Until then, paths is the canonical seam.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 15:42:22 +04:00
e3mrah
aaaaadf8bc
feat(openova-flow): server (HTTP+SSE event router) + flux adapter (K8s informer sidecar) (#1390)
Agent #2 of 3 for OpenovaFlow. Ships the Go backend independently of
Agent #1's TS packages (@openova/flow-core + @openova/flow-canvas);
the FlowMessage JSON contract is locked between agents.

Two Go modules (separate go.mod each so the dep graphs stay decoupled):

- products/openova-flow/server/ — stateless HTTP+SSE event router.
  Map<flowId, RingBuffer<FlowMessage>>, in-memory, no DB. Endpoints:
  POST /v1/flows/{flowId}/events, GET /v1/flows/{flowId}/snapshot,
  GET /v1/flows/{flowId}/stream (SSE with 15s heartbeats + Last-Event-ID
  seq stamping), DELETE /v1/flows/{flowId}, GET /healthz, /readyz.
  Zero external Go deps (stdlib net/http). Ring cap default 4096
  (env-overridable). Locked schema validation rejects unknown envelope
  variants with 400.

- products/openova-flow/adapter-flux/ — DaemonSet sidecar that watches
  helm.toolkit.fluxcd.io/v2.HelmRelease + HelmChart CRs via
  client-go's dynamicinformer.NewFilteredDynamicSharedInformerFactory
  (canonical seam: products/catalyst/bootstrap/api/internal/k8scache/factory.go),
  maps each event to FlowMessage via a pure-transform mapper, POSTs to
  the configured openova-flow-server with exponential-backoff retry.
  Status mapping: Ready=True → succeeded, InstallFailed/UpgradeFailed/
  RetriesExhausted → failed, Progressing/Unknown/other-False → running,
  no Ready yet → pending. FlowNode.id format "{REGION_KEY}/{hrName}"
  so multi-region renders correctly. Region-aware: synthetic region
  parent FlowNode emitted on bootstrap; dependsOn entries fan-out to
  finish-to-start relationships.

Two wrapper charts under platform/openova-flow-{server,emitter}/chart/
(canonical seam: platform/qa-app/chart/ for the simple
Deployment+Service+SA shape; platform/k8s-ws-proxy/chart/ for the
DaemonSet+ClusterRole+ClusterRoleBinding shape). MIRROR-EVERYTHING:
image refs go through harbor.openova.io/proxy-ghcr/openova-io/...
Image tag + required runtime config fail-fast at chart render via
_helpers.tpl so silent ImagePullBackOff / boot crash is impossible.

Two bootstrap-kit HRs added (slots 56 + 57):
- 56-bp-openova-flow-server (dependsOn: bp-cilium, bp-cert-manager) —
  installs on primary cluster only; Cilium Gateway HTTPRoute at
  openova-flow.<sovereignFQDN> for cross-cluster ingest.
- 57-bp-openova-flow-emitter (dependsOn: bp-flux) — DaemonSet, runs
  on every cluster (mother + Sovereign + every secondary region).

scripts/expected-bootstrap-deps.yaml updated; check-bootstrap-deps.sh
audit passes (drift=0, cycles=0).

Tests (all green):
- server contract_test.go — every FlowMessage variant round-trips JSON,
  unknown/malformed variants reject. Cross-flow Triggerer/ToFlowID
  preserved.
- server server_test.go — full HTTP surface, including SSE replay+tail
  with a real httptest.Server.
- adapter mapper_test.go — every HelmRelease.status.conditions[Ready]
  transition + multi-dependsOn fan-out + family-label/heuristic + region
  fallback.

Verification done locally:
- (cd products/openova-flow/server && go build ./... && go test ./...) — PASS
- (cd products/openova-flow/adapter-flux && go build ./... && go test ./...) — PASS
- helm template platform/openova-flow-server/chart/ — renders cleanly
- helm template platform/openova-flow-emitter/chart/ — renders cleanly
- bash scripts/check-bootstrap-deps.sh — PASS (drift=0)

Agent #3 follow-ups (called out in slot 57's HelmRelease comments):
- Thread SOVEREIGN_DEPLOYMENT_ID + REGION_KEY into the
  postBuild.substitute env in infra/hetzner/cloudinit-control-plane.tftpl
  so the emitter's flowId/regionKey become per-deployment + per-region
  automatically. Today the slot uses SOVEREIGN_FQDN as the flowId
  fallback and "primary" as the regionKey default; per-Sovereign overlays
  can override pre-Agent-#3.
- catalyst-api proxy at /sovereign/api/v1/flows/{id}/stream so the
  Sovereign Console canvas hits a single in-tree origin.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 15:36:54 +04:00
e3mrah
16ec3399e9
feat(openova-flow): extract flow-core + flow-canvas packages (drop parentId, adopt PMI temporal types) (#1389)
* feat(openova-flow): extract flow-core + flow-canvas packages (drop parentId, adopt PMI temporal types)

OpenovaFlow Foundation — Agent #1 of 3. Splits flow visualisation out
of Catalyst into two standalone packages:

  • @openova/flow-core: plugin-shaped contract (FlowInstance, FlowNode,
    Relationship, FlowMessage, FlowAdapter) + pure layout engine.
  • @openova/flow-canvas: React SVG canvas, zero OpenOva imports,
    theme-decoupled via CSS variables.

Founder-locked design adopted:

  • FlowInstance is first-class (definitionId / parentFlowId /
    triggeredBy) — DAG vs DAG-run distinction works for Argo,
    Temporal, Flux, custom.
  • Node hierarchy moves from FlowNode.parentId to
    Relationship{type:'contains'}. The legacy parentId field is gone
    from the new contract (the bridge still adapts legacy Job.parentId
    so catalyst-ui keeps working against today's catalyst-api).
  • Edge types follow the PMI temporal taxonomy: finish-to-start (FS),
    start-to-start (SS), finish-to-finish (FF), start-to-finish (SF)
    + 'triggers' (event-driven) + 'contains' (hierarchy). Failure-
    conditioned edges render as overlays and are NOT counted toward
    depth.

Layout engine port:
  • Verbatim cycle-safety + parent-elision + MAX_VISIBLE_DEPTH cap
    invariants from products/catalyst/.../flowLayoutOrganic.ts.
  • Adds component-detection (weak connected components on the
    blocking-DAG graph) so future UIs can paint gutters.

Catalyst-ui refactor:
  • New products/catalyst/bootstrap/ui/src/lib/flow-bridge.ts adapts
    legacy Job[] → FlowNode + Relationship[]. Single-responsibility
    seam — the only place that still knows about the legacy shape.
  • FlowPage now drives @openova/flow-canvas via the bridge.
  • Legacy lib/flowLayoutOrganic.ts + sovereign/FlowCanvasOrganic.tsx
    remain in place for non-FlowPage consumers (JobDetail breadcrumbs,
    JobsTable rollups) until Agent #3 retires them with the real
    catalyst-api FlowAdapter.

Tests:
  • core: 20 tests (cycle-safety, parent-elision, RelType tagging,
    component detection, defaultFoldedAtDepth) — all passing.
  • canvas: 9 tests (render shape, RelType edge attrs, host/selection
    rings, single-click debounce, fold toggle, navigate) — all passing.
  • catalyst-ui: bridge 11 tests + FlowPage 9 tests (testid updated
    flow-job-* → flow-node-* to match new contract) — all passing.
  • tsc --noEmit: clean on all three workspaces.

Constraints honoured:
  • Two-repo discipline: lands entirely in openova-io/openova (public).
  • No npm run build / playwright install / playwright test.
  • No kubectl apply / chart manifests touched.
  • No hardcoded URLs, regions, k3s flags, chart versions.
  • vitest --pool=threads --maxWorkers=2 --no-isolate everywhere.

Canonical-seam citations (ARCHITECT-FIRST):
  • Monorepo packages alias via tsconfig + vite resolve (no top-level
    `workspaces:` field exists in this monorepo today). Pattern
    mirrors core/console + products/axon path-mapping style.
  • CSS-variable theming follows the data-theme="light/dark" pattern
    already in catalyst-ui's globals.css (line 87+).

Agents #2/#3 (out of scope for this PR):
  • Agent #2: catalyst-api server that emits FlowMessage events on
    a SSE endpoint per CONTRACT.md.
  • Agent #3: replace lib/flow-bridge.ts with a real FlowAdapter
    against catalyst-api, then delete legacy flowLayoutOrganic +
    FlowCanvasOrganic.

Prov #34 readiness: the bridge forwards Job.region (when catalyst-api
begins emitting it) opaquely; perNodeHints feed region descriptors
to the new layout. Multi-region rendering is shape-ready end-to-end —
the catalyst-api just needs to emit region per job.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(openova-flow): resolve react/d3-* from ui node_modules — restore /wizard rendering

The flow-core/flow-canvas alias targets in products/openova-flow/{core,canvas}/src/
have no sibling node_modules tree (workspaces wiring lands with Agent #2), so
Vite/Rolldown could not resolve their peer-dependency imports (react, react-dom,
d3-force, d3-drag, d3-selection) from those source files. The production build
failed with "Rolldown failed to resolve import 'react' from .../FlowLogFeed.tsx",
no dist/ was emitted, and the CI Playwright smoke lane therefore got 404 on
/wizard (which itself does NOT use FlowPage, but the whole bundle was missing).

Fix: alias each peer dep bare-spec to this package's local node_modules, and
add resolve.dedupe for react/react-dom. Also reorders @openova/* entries above
the '@' prefix entry — both are correct in @rollup/plugin-alias today since
matching is whole-name not prefix, but reordering follows the documented
"longer key first" convention defensively.

Verified:
- `npx vite build --mode production` succeeds (3.5s, dist/index.html + asset
  chunks emitted, wizard route in bundle).
- `npx vitest run` flow-related tests: src/lib/flow-bridge.test.ts +
  src/pages/sovereign/FlowPage.test.tsx → 2 files / 21 tests / all pass
  (baseline pre-fix had FlowPage.test.tsx failing).
- Other vitest failures present in baseline are pre-existing and flaky
  across runs; not introduced by this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(openova-flow): clarify alias-matching comment — the bare-spec react/d3 aliases are the real /wizard fix

The previous fix commit (3b19501) shipped two changes bundled together:

  1. Reorder `@openova/flow-core` + `@openova/flow-canvas` above the
     `@` alias (claimed: "@ would otherwise shadow @openova/...").
  2. Add bare-spec aliases for react / react-dom / d3-force / d3-drag /
     d3-selection pointing at this package's local node_modules.

Reading Vite's alias matcher (node_modules/vite/dist/node/chunks/node.js
line ~27349, function `matches`) shows that the `@` alias is matched
with EXACT equality OR `startsWith(@ + '/')` — so `@/foo` matches but
`@openova/flow-core` does NOT. The reorder was harmless but the comment
explaining it was misleading.

The bare-spec aliases (#2) ARE the actual fix. The aliased
`@openova/flow-{core,canvas}` source files live OUTSIDE this package
and have no sibling node_modules tree (workspace wiring lands with
Agent #2). Vite resolution from inside those source files would walk
up the filesystem looking for `node_modules/d3-drag`, find nothing,
and throw "Failed to resolve import 'd3-drag'" — which surfaces as a
white-screen wizard at `/wizard`. The aliases redirect bare imports
to the absolute paths under catalyst-ui's own node_modules.

Verification on this commit:

  • `npx tsc --noEmit` from products/catalyst/bootstrap/ui — clean.
  • `npx vitest run --pool=threads --maxWorkers=2 --no-isolate
     src/pages/sovereign/FlowPage.test.tsx src/lib/flow-bridge.test.ts`
     — 2 files / 21 tests / all pass.
  • Reverting the prior fix and re-running the same vitest produces:
     "Failed to resolve import 'd3-drag' from
     ../../../openova-flow/canvas/src/FlowCanvas.tsx" — proves the
     aliases are load-bearing.
  • `vite build` / `vite dev` / playwright NOT run locally (Rule 7);
     CI on this push exercises the dev-server path the Playwright
     smoke uses.

No behavior change vs 3b19501 — this commit only rewrites the inline
comment block so the next maintainer sees the real reason the aliases
exist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 15:36:51 +04:00
github-actions[bot]
1d0f810162 deploy: update catalyst images to b5181ec 2026-05-11 10:46:59 +00:00
e3mrah
b5181ec5d6
fix(catalyst-platform): gitea-token-mint hook 60->180 iters for autoscaler cold-start (Fix #184) (#1388)
* fix(catalyst-platform): gitea-token-mint hook 60->180 iters for autoscaler cold-start (Fix #184)

Raise the catalyst-gitea-token-mint pre-install hook's Gitea-API wait
loop from a hardcoded 60x5s (300s = 5m) budget to a values-driven knob
(giteaWait.iterations x giteaWait.intervalSeconds, default 168x5 =
840s = 14m). Pairs with HR install.timeout=15m to leave 60s slack for
the rest of the umbrella install action.

Root-cause trace (4-layer) on prov #33 (multi-region fsn1+hel1, cpx42
workerCount=0+autoscaler):

  bp-catalyst-platform HR (15m HR-timeout)
    -> Helm pre-install hook Job: catalyst-gitea-token-mint
         -> pod runs alpine/k8s curl loop:
              while ! curl gitea-http.gitea.svc.cluster.local; do
                sleep 5; i=$((i+1))
              done
         -> Hook gave up at iter 60 (= 5 min wall-time)
         -> Meanwhile gitea Pod is Pending: autoscaler-hcloud still
            scaling up workers in fsn1/hel1 (Fix #157 sizing default
            workerCount=0 means cold start).

Budget arithmetic (post-Fix #184 default):
  hook_wait_time = iterations x intervalSeconds = 168 x 5 = 840s (14 min)
  HR install.timeout =                                       900s (15 min)
  slack within HR budget =                                    60s ( 1 min)

The hook MUST complete strictly before HR remediates; the 60s slack
absorbs regular release resources rolling + post-install hooks after
the pre-install Job.

Canonical-seam citations:
- The hook lives at products/catalyst/chart/templates/
  catalyst-gitea-token-secret.yaml (line ~303 pre-Fix), the
  catalyst-gitea-token-mint Job's `args` block.
- Prior pattern: bp-keycloak chart 1.4.5 (Fix #146) introduced
  keycloakConfigCli.availabilityCheck.timeout as a values knob -
  same shape (chart-internal hook timing knob, distinct from the
  outer HR timeout). See platform/keycloak/chart/values.yaml:413.
- The HR's install.timeout=15m lives at clusters/_template/
  bootstrap-kit/13-bp-catalyst-platform.yaml:484 - the chart-internal
  wait budget MUST stay strictly less than this.

Recurring class: same family as Fix #127 (bp-cutover HR 15m),
Fix #131 (bp-gitea HR 15m), Fix #150 (bp-harbor HR 15m), Fix #154
(HR-timeout audit). Those bumped the HelmRelease install.timeout.
This bumps the chart-INTERNAL wait loop budget inside the pre-
install hook Job, which is a different (lower) seam.

Per INVIOLABLE-PRINCIPLES #4 (never hardcode) the budget is fully
runtime-configurable via .Values.giteaWait. Operators may shorten on
known-warm-cluster overlays or extend on air-gapped Sovereigns.

Changes:
- products/catalyst/chart/templates/catalyst-gitea-token-secret.yaml:
  replace hardcoded `seq 1 60` + `sleep 5` with templated
  ITERATIONS/INTERVAL vars driven by .Values.giteaWait.{iterations,
  intervalSeconds}.
- products/catalyst/chart/values.yaml: add giteaWait block with
  defaults (iterations: 168, intervalSeconds: 5 = 14m budget).
- products/catalyst/chart/Chart.yaml: bump 1.4.139 -> 1.4.140 with
  changelog entry capturing the 4-layer trace + budget arithmetic.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bump
  HelmRelease pin 1.4.138 -> 1.4.140 (skip 1.4.139 which is a no-op
  packaging bump on main).

Verification:
- helm template renders cleanly (2799 lines, exit 0).
- Force-render with lookup gate bypassed shows ITERATIONS=168 +
  INTERVAL=5 substituted into the rendered Job args.
- --set giteaWait.iterations=240 --set giteaWait.intervalSeconds=10
  override confirmed to emit ITERATIONS=240 + INTERVAL=10.

Test plan (post-merge, on prov #34):
- kubectl logs -n catalyst-system catalyst-gitea-token-mint-* should
  emit `waiting for gitea api ($i/168)` instead of `($i/60)`.
- bp-catalyst-platform HR reaches Ready=True within the 15m HR
  budget (previously installFailures: 2 on prov #33).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bootstrap-deps): reconcile pre-existing dep-graph audit drift

Two pre-existing drift items surfaced when dep-graph-audit ran on the
Fix #184 PR — both are in `main` already, not introduced here, but the
gate blocks any PR until the expected DAG matches the actual HRs.

1. `bp-catalyst-platform` (slot 13) — actual HR file declares
   `bp-crossplane-claims` as an additional dependsOn edge (added in
   chart-roll-rca iter-15, 2026-05-10, for the XRD-ordering race that
   caused the omantel.biz 90-min wedge). Update expected-deps to
   include it.

2. `bp-hcloud-ccm` (slot 55) — present on disk but absent from
   expected-deps. Cloud-provider seam, no upstream dependencies.
   Added with empty depends_on.

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
2026-05-11 14:44:54 +04:00
github-actions[bot]
8af9ef6f34 deploy: update catalyst images to 4e6bec7 2026-05-11 09:09:43 +00:00
github-actions[bot]
fd42c2c44e deploy: update catalyst images to 957dcb3 2026-05-11 08:51:08 +00:00
e3mrah
957dcb3be1
fix(catalyst-ui): delete malformed import type from react line (Fix #181) (#1384)
Fix #180 PR #1383 merged with sed -i error: produced `import type  from 'react'`
(empty import binding) which is a syntax error. Main build broken.
This PR removes the malformed line entirely.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 12:49:06 +04:00
e3mrah
dfe0588fc6
fix(catalyst-ui): remove unused ReactNode import in DeploymentsList.test.tsx (#180) (#1383)
Fix #178 PR #1382 introduced new test file but left an unused `ReactNode`
import. Containerfile's `tsc -b` (strict mode) fails TS6133. CI Build &
Deploy Catalyst workflow blocked → Fix #178 features (sortable cols +
2-mode delete) never reached production.

Caught live: `npx tsc --noEmit` (Fix Author's local check) does NOT
enforce TS6133, but production `tsc -b` does.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 12:47:38 +04:00
e3mrah
67eae51587
feat(catalyst): sortable deployments list + two-mode delete (Fix #178) (#1382)
Adds operator-friendly admin controls to /sovereign/deployments:

* Sortable column headers — click any of FQDN / Status / Started /
  Finished / Region to sort the table; second click toggles ASC↔DESC.
  Default is Started DESC (newest first). Sort is client-side; the
  list is small enough that round-tripping via ?sort= would only add
  latency without operator benefit.

* Per-row Delete button → opens DeleteDeploymentModal with TWO modes
  via a radio group:
  1. "Delete record only (mother)" — DELETE /api/v1/deployments/{id}.
     Removes the catalyst-api row (in-memory map + on-disk store +
     kubeconfig file) but LEAVES THE HETZNER SOVEREIGN RUNNING.
  2. "Delete record AND wipe Sovereign (kill the kid)" — POSTs to
     the existing /wipe endpoint (tofu destroy + Hetzner orphan
     purge + PDM release + record cleanup in one pass).

  Both modes require typing the deployment FQDN to confirm (same
  safety pattern WipeDeploymentModal uses, per Fix #46 / #914).
  Deep-delete additionally requires the Hetzner token, which flows
  straight through to the wipe handler (S3 + Hetzner creds never
  logged, per principle #10).

Backend:
* New DeleteDeployment handler (record-only). Refuses adopted (422)
  + in-flight (409) + unknown (404, matching the issue #689
  anti-enumeration posture). Idempotent: a second DELETE on a
  vanished row returns 404 cleanly.
* Route wired in cmd/api/main.go alongside the existing /wipe and
  /release-subdomain endpoints, inside the session-required group.
* 5 unit tests covering happy path / adopted / in-flight / unknown /
  terminal-wiped paths.

Frontend:
* DeploymentsList now mounts the new modal and invalidates the
  React Query cache (`catalyst, deployments, list`) on success so
  the table refreshes without a hard reload.
* 8 unit tests covering default sort order, header-click sort
  switching, ASC↔DESC toggle, status sort, delete button rendering
  (enabled for terminal rows, disabled for in-flight), modal open
  with both radios, conditional Hetzner-token field per mode.

Files:
* products/catalyst/bootstrap/api/internal/handler/deployments_delete.go
* products/catalyst/bootstrap/api/internal/handler/deployments_delete_test.go
* products/catalyst/bootstrap/api/cmd/api/main.go (route)
* products/catalyst/bootstrap/ui/src/components/CrudModals/DeleteDeploymentModal.tsx
* products/catalyst/bootstrap/ui/src/components/CrudModals/index.ts (export)
* products/catalyst/bootstrap/ui/src/pages/sovereign/DeploymentsList.tsx
* products/catalyst/bootstrap/ui/src/pages/sovereign/DeploymentsList.test.tsx

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 12:33:52 +04:00