Two bugs the operator hit on /sovereign/provision/<id>/jobs:
1) Phase-1 install-* Jobs rendered DISCONNECTED on the canvas —
helmwatch.Bridge doesn't persist Job.DependsOn (only the Phase-0
tofu chain + cluster-bootstrap is wired today). Pull HR.spec.dependsOn
from the live Watcher's informer cache via SnapshotComponents()
(ComponentSnapshot.DependsOn already populated by extractDependsOn)
at snapshot-time and emit finish-to-start edges from upstream
install-<dep> to install-<self>. Also add provisioner→bootstrap-kit
group-to-group finish-to-start so the Phase-0/Phase-1 ordering is
visible on the canvas.
2) Clicking a canvas node → "404 page not found" because
FlowPage.handleNodeDoubleClick passed the full
"<deploymentId>:install-X" id verbatim. The backend Store.GetJob
keys by bare jobName ("install-X"), so the colon-prefixed id missed
exact-match and JobDetail returned 404. Mirror useJobLinkBuilder
(JobsTable.tsx line 364): strip the "<deploymentId>:" prefix and
encodeURIComponent the remainder before pushing to the router.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(catalyst-api): add OPENOVA_FLOW_SERVER_URL env to chart template
Without this env the proxy resolveFlowServerURL() falls back to
per-deployment FQDN lookup (https://openova-flow.<sovereignFQDN>) which
only exists on Sovereigns that already installed bootstrap-kit slot 56
with httproute=enabled. Every other catalyst-api deployment (mothership
contabo + Sovereigns that haven't reached cutover yet) returns 502 on
/api/v1/flows/{deploymentId}/snapshot — the live regression founder
saw at console.openova.io: "No nodes to render."
The env points at the in-cluster Service DNS for the LOCAL openova-flow-
server. Both the mothership (catalyst-system or catalyst namespace) and
each Sovereign chroot run the bp-openova-flow-server chart with a local
Service, so this URL is correct for every cluster catalyst-api runs in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(flow-proxy): assemble snapshot from local jobs.Store before upstream proxy
Mothership canvas at /sovereign/provision/<id>/jobs was empty for the
first ~30 minutes of every fresh provision because the snapshot
endpoint went straight to https://openova-flow.<sovereignFQDN> which
can't serve until cilium + cert-manager + the HTTPRoute TLS cert are
all up on the chroot. The Phase-0 + Phase-1 lifecycle Jobs catalyst-api
ALREADY owns (tofu-init/plan/apply/output, flux-bootstrap,
install-bp-<chart>, ...) were invisible the whole time.
This change adds flowSnapshotFromJobs which assembles the canonical
FlowMessage envelope from h.jobsStore().ListJobs(deploymentID) — every
Job becomes a FlowNode with the legacy <deploymentId>:<jobName> id form
the canvas drill-down already expects, every Job.DependsOn becomes a
finish-to-start Relationship, every Job.ParentID becomes a contains
Relationship. HandleFlowSnapshot checks the local store first and
returns immediately when it has data; otherwise falls through to the
existing upstream proxy path.
HandleFlowStream gets the same treatment via flowStreamLocal: emit a
snapshot frame on connect AND every 3 seconds thereafter, plus a 15s
heartbeat. The OpenovaFlow consumer's reducer is idempotent on
snapshot replay so re-emitting an unchanged envelope is harmless;
in exchange the canvas reflects Job state transitions within ~3s
of when helmwatch.Bridge writes them.
No FE change required — the same /api/v1/flows/<id>/snapshot and
/stream endpoints serve the same envelope shape the chroot adapter
emits (products/openova-flow/adapter-flux/internal/types/flow.go),
named SSE events including 'snapshot' and 'heartbeat'.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prov #44 (d9399223c3caa4f9) hit the catalyst-api 60m phase1 watch cap
with bp-catalyst-platform HR still mid-retry (failures=3) and 41/45 HRs
True. F1-F7 are correct and live on main (qa-finalizer-strip Completed,
autoscaler workers joined). The remaining wall is total bootstrap-kit
install time exceeding the outer watch budget on a fresh cpx42×1
Sovereign without a warm Harbor proxy-cache.
Two lock-step changes widen both bounds:
1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
install.timeout 15m → 30m, upgrade.timeout 15m → 30m. The umbrella
chart genuinely needs >15m worst case when the full SME + Catalyst
service stack rolls cold.
2. products/catalyst/bootstrap/api/internal/helmwatch/helmwatch.go:
DefaultWatchTimeout 60m → 120m. Worst-case inner HR retry chain is
now 30m × 3 = 90m; the outer phase1 budget MUST be larger so the
watch never terminates while helm-controller still has remediation
attempts left. CATALYST_PHASE1_WATCH_TIMEOUT env-var override path
was already wired (issue #538 baseline) — chart template now
declares the explicit "120m" value so the runtime knob is
discoverable for capacity-bounded environments. Per INVIOLABLE-
PRINCIPLES.md #4 the knob remains runtime-configurable.
New unit test TestPhase1WatchConfig_ProductionDefaultIs120m pins the
F8 floor against future regression. Existing env-var override + field-
override tests still pass unchanged.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause (autoscaler pod log, prov #43 chroot):
W orchestrator.go:626 Node group workers is not ready for scaleup -
backoff with status: Scale-up timed out for node group workers after
15m2.273255226s
Hetzner API confirms autoscaler-spawned workers come up PUBLIC-ONLY:
workers-77439321e2047e3e public_net.ipv4=178.105.102.237 private_net=[]
workers-a6410e81b24cced public_net.ipv4=178.105.73.210 private_net=[]
The worker cloud-init (identical to Phase-0 user_data) issues
curl -sfL https://get.k3s.io | K3S_URL=https://10.0.1.2:6443 ... sh -
against the CP's PRIVATE 10.0.1.2 IP. Without the 10.0.0.0/16 attachment
that URL is unreachable → k3s agent install silent-fails → node never
registers with apiserver → autoscaler 15m timeout → backoff → bp-catalyst-
platform Pending Pods never schedulable → chroot canvas tests blocked.
Fix: wire HCLOUD_NETWORK / HCLOUD_FIREWALL / HCLOUD_SSH_KEY env vars on
the cluster-autoscaler deployment so the Hetzner provider attaches every
scale-up VM to the SAME private network + firewall + ssh-key the Phase-0
Tofu module created (resource names: catalyst-<sov-fqdn-with-dashes>-net /
-fw / catalyst-<sov-fqdn-with-dashes>). Names flow:
Tofu (hcloud_network.main.name + hcloud_firewall.main.name +
hcloud_ssh_key.main.name)
→ cloudinit-control-plane.tftpl (3 new template vars)
→ /var/lib/catalyst/cloud-credentials-secret.yaml (3 new keys)
→ flux-system/cloud-credentials Secret
→ bp-cluster-autoscaler-hcloud HelmRelease valuesFrom (3 optional entries
with targetPath: cluster-autoscaler.extraEnv.HCLOUD_*)
→ upstream chart's deployment env
Chart bumped 1.2.0 → 1.3.0. New smoke-test gates (Cases 5+6) prevent
regression of the three env-var slots in chart values.yaml.
Reaffirms canonical seam: values flow through Tofu → cloud-init →
flux-system Secret → Flux valuesFrom → chart values → upstream env.
Never via kubectl patch, never via bespoke Go API calls.
Refs: prov #38/#39/#41/#43 omantel.biz scale-up backoff.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause (4-layer trace on prov #41, omantel.biz, 2026-05-12 00:28 UTC):
bp-catalyst-platform HR install.timeout=15m
→ Helm pre-install hook: qa-finalizer-strip Job (weight -99)
→ Pod requests 50m CPU + 64Mi memory (tiny)
→ BUT no tolerations → scheduler restricted to worker
→ worker cpx32 (8vCPU/16GB) at 99% CPU requests
(7980m of 8000m allocated) after bootstrap-kit fan-out
→ FailedScheduling: "0/2 nodes are available: 1
Insufficient cpu, 1 node(s) had untolerated taint
{node-role.kubernetes.io/control-plane: true}"
→ autoscaler triggers scale-up worker 2→3 → "1 in backoff
after failed scale-up" → still Pending → 15m timeout
→ InstallFailed → Flux uninstall+rollback → installFailures: 3
→ Flux gives up entirely
Live evidence quoted from chroot kubeconfig on prov #41:
- bp-catalyst-platform HR `Reconciling=True, reason=Progressing,
message="Running 'install' action with timeout of 15m0s"`
- HR `Released=False, reason=InstallFailed, message="Helm install
failed for release catalyst-system/catalyst-platform with chart
bp-catalyst-platform@1.4.140: failed pre-install: 1 error occurred:
* timed out waiting for the condition"`
- Pod `qa-finalizer-strip-m2hdb` status=Pending; events:
`Warning FailedScheduling 108s default-scheduler 0/2 nodes are
available: 1 Insufficient cpu, 1 node(s) had untolerated taint
{node-role.kubernetes.io/control-plane: true}`
- Worker `Allocated cpu 7980m (99%) of 8000m capacity`
- Control-plane `Allocated cpu 635m (7%) of 8000m capacity` (idle)
Fix: add tolerations for the control-plane NoSchedule taint +
priorityClassName: system-cluster-critical so the qa-finalizer-strip
Job can ALWAYS schedule regardless of worker-node CPU saturation.
The hook is a defense-in-depth cleanup that runs in seconds on a
clean cluster; it legitimately belongs anywhere with free capacity
including the control-plane node (which on prov #41 had 7365m CPU
free vs. the hook's 50m request).
Why prior fixes didn't suffice:
- Fix#114 introduced this hook to break a finalizer-deadlock loop
on prov #9. Correct fix for that wedge; never anticipated worker
saturation as a scheduling failure mode for the hook itself.
- Fix#138 (chart 1.4.138) converted the qa-cnpg-backup-s3-seed +
qa-cnpg-status-seed hooks (weight 0/post-install) to regular
release resources to break a circular DAG dep. Different hook
surface.
- Fix#184 (chart 1.4.140) raised the gitea-token-mint pre-install
hook (weight +10) wait budget for cold-start autoscaler. That
hook runs AFTER qa-finalizer-strip (-99 < +10); if the -99 hook
never starts, the +10 hook never runs.
Recurring class: same family as Fix#114 (hook scheduling failure
wedges entire HR install). 3 consecutive recurrences (prov #38, #39,
#41) on chart pin 1.4.140 trigger the category-level audit threshold
(CLAUDE.md rule "CATEGORY-LEVEL THINKING"). Coupled chart hygiene
swept in same commit:
- Switch image from bitnamilegacy/kubectl:1.29.3 (Docker-Hub
redirect for deprecated Bitnami images, 2025-08 cutover
documented at platform/self-sovereign-cutover/chart/values.yaml:
252) → harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4 —
the canonical alpine-based kubectl image already used by sibling
hook catalyst-gitea-token-mint (Fix#163). MIRROR-EVERYTHING +
ARCHITECT-FIRST rules.
Coordinator follow-up tickets:
- Sibling Jobs in templates/qa-fixtures/cnpg-clusters-qa.yaml
(qa-cnpgpair-status-seed) still reference bitnamilegacy/kubectl
:1.29.3 — same Bitnami-deprecation class. Out of scope for this
Fix (not part of the recurrence cluster); flagged for a sweep.
- Worker cpx32 sizing may be undersized for the bootstrap-kit fan-
out on omantel.biz — separate sizing ticket, not blocking.
Changes:
- products/catalyst/chart/templates/qa-fixtures/pre-install-
finalizer-strip.yaml: add tolerations + priorityClassName;
switch image to alpine/k8s:1.31.4. Inline doc comments explain
the 4-layer trace and the Fix #114/#138/#184 history.
- products/catalyst/chart/Chart.yaml: bump 1.4.140 → 1.4.141 with
changelog entry capturing root cause + budget arithmetic.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
bump HR pin 1.4.140 → 1.4.141.
Verification:
- helm template renders cleanly (exit 0, ~6700 lines).
- kubectl apply --dry-run=client validates the rendered Job
manifest (job.batch/qa-finalizer-strip created (dry run)).
- Rendered Job contains tolerations[control-plane Exists NoSchedule],
priorityClassName: system-cluster-critical, image: alpine/k8s:1.31.4.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix#144 raised zoneBootstrap.activeDeadlineSeconds 300s → 840s after
prov #22 hit a 5m DeadlineExceeded on the bp-powerdns post-install hook.
That fix was insufficient: prov #37 + #38 (chroot omantel.biz, 2026-05-12)
both wedged on the SAME chart slot with `BackoffLimitExceeded`, NOT
`DeadlineExceeded`. The deadline never got a chance to fire.
Trace from prov #38 chroot (`KUBECONFIG=/tmp/prov38.kubeconfig kubectl
get hr bp-powerdns -o yaml`):
status:
Helm install failed for release powerdns/powerdns with chart
bp-powerdns@1.2.2: failed post-install: 1 error occurred:
* job powerdns-zone-bootstrap failed: BackoffLimitExceeded
Pod events for powerdns-zone-bootstrap-tq7qq:
59m Started container zone-bootstrap
56m Back-off restarting failed container zone-bootstrap
55m Job has reached the specified backoff limit
Root cause walked end-to-end (per CLAUDE.md TRACE rule):
TEST: bp-powerdns HR Ready=True
↑
HR: Helm install succeeds (post-install Job exits 0)
↑
Zone-bootstrap Job: curl POST succeeds
↑
powerdns:8081 Service: reachable (has Ready endpoints)
↑
powerdns Deployment: Pods Ready (3 replicas) ← Pending, blocked here
↑
CNPG cluster: pdns-pg-app Secret exists
↑
pdns-pg-1-initdb Pod: scheduled, Running, Completed ← Pending too
↑
Worker node has capacity ← 99% CPU requested
The zone-bootstrap container curl'd `http://powerdns:8081`, hit
"connection refused" (empty Service endpoints), exited 7, container
restarted under `restartPolicy: OnFailure`. After 6 Kubernetes-level
backoffs (≈10min wall-time with exponential delay), the Job declared
`BackoffLimitExceeded` — well before activeDeadlineSeconds=840s
(14min) could even consider firing.
Fix#144 was directionally right (the upstream IS slow on cold k3s) but
operated on the wrong knob. The container's outer-loop retry budget is
bounded by backoffLimit × backoff-delay, not by activeDeadlineSeconds.
Bumping only the deadline left the BackoffLimit ceiling unchanged.
Architectural fix (this commit):
1. Move the wait-for-API loop INSIDE the container (one Pod, one inner
poll loop, restartPolicy=Never). The inner loop polls
GET /api/v1/servers every 10s until HTTP 200, bounded by new
`apiReadyTimeoutSeconds` (default 600s = 10min). Now ONE container
run owns the full wait budget instead of N short-lived containers
racing the backoff timer.
2. restartPolicy: OnFailure → Never. The container script handles its
own retry; Kubernetes-level backoff is reserved for genuinely
transient pod failures (image-pull, OS eviction) where the Job-level
backoffLimit=6 still triggers a fresh Pod.
3. Surface POWERDNS_API_READY_TIMEOUT_S env var so operators on slower
clusters can raise the inner deadline without forking the chart
(per docs/INVIOLABLE-PRINCIPLES.md #4).
4. New value `zoneBootstrap.apiReadyTimeoutSeconds` (default 600s).
Sits below activeDeadlineSeconds (840s) so the zone-creation phase
keeps ≥240s of headroom AFTER the API comes Ready.
Curl status handling in the wait loop:
200 → API up, proceed to bootstrap
401|403 → auth failure, FATAL (no retry — operator misconfig)
000|5xx|... → transient, sleep & retry until inner deadline
Files changed:
- platform/powerdns/chart/Chart.yaml 1.2.2 → 1.2.3 + history
- platform/powerdns/chart/values.yaml + apiReadyTimeoutSeconds knob
- platform/powerdns/chart/templates/
zone-bootstrap-job.yaml inner wait-for-API loop;
restartPolicy: Never
- clusters/_template/bootstrap-kit/
11-powerdns.yaml pin to 1.2.3 + HR comment
Why this is sufficient where Fix#144 was not:
Fix#144 worked the chart-level deadline. This commit works the
inner-loop ownership — the wait budget is now owned by the script
inside the container, not by the Job spec arithmetic
(backoffLimit × backoff-delay). The Job's outer activeDeadlineSeconds
still caps the worst-case runtime (no runaway poll), but the script
now actually GETS to use it.
Verification:
- helm template renders cleanly (deps build OK, empty-zones short-
circuit preserved, non-empty zones render Job + RBAC + Audit CM)
- kubectl create --dry-run=client --validate=false: 5/5 resources
created (sa, role, rb, cm, job)
- chart 1.2.3 pinned in clusters/_template/bootstrap-kit/11-powerdns.yaml
Companion infrastructure note (NOT addressed by this commit, flagged
for Coordinator):
The DEEPER bottom of the trace stack is worker capacity. Prov #38's
single cpx32 worker (8 vCPU / 16 GB) is at 99% CPU requested. The
cluster-autoscaler attempted 2→3 scale-up but is in backoff because
two unscheduled pods (gitea/gitea-* PV affinity conflict from a
previous wedged install; trivy-system/node-collector NodeAffinity)
poison the autoscaler's "can the template node fit" check. Even with
this chart fix in place, the powerdns Deployment cannot become Ready
until either:
(a) the worker autoscales successfully (gitea PV migrated / trivy
taints relaxed), or
(b) worker_count is bumped from 2 to 3 in the provisioning body, or
(c) qa_worker_size is bumped to cpx42.
This chart fix ensures bp-powerdns survives a slow CNPG cold-start.
It does NOT fix a fundamentally undersized cluster. Coordinator next
step: reprov with worker_count=3 OR qa_worker_size=cpx42 + this chart
landed. Either should converge.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chroot bp-openova-flow-emitter posts to
http://openova-flow-server.catalyst-system.svc.cluster.local:8080
but the bp-openova-flow-server chart's Service is exposed on :80
(targetPort:8080 → port:80, kubernetes Service indirection).
Result: every event POST from the chroot emitter dial-times-out, the
chroot's openova-flow data plane never populates, and canvas pages
viewing the chroot show empty.
Same fix as PR #124 on mothership emitter-helmrelease.yaml (private
repo). Slot 57 in the bootstrap-kit template was missed in that round.
Live regression on prov #37 (2026-05-11): chroot has 38 bp-* HRs True
but openova-flow snapshot is empty because emitter can't reach server.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TC-035 (iter-2, 2026-05-11): OpenovaFlow rows merged into JobsPage
(PR #1413) lost their region-prefixed identity in the URL. The link
builder sliced the "<prefix>:" segment off every id with a colon —
intended to strip the legacy "<deploymentId>:install-keycloak" form,
but it also stripped "contabo:bp-openova-flow-server" → bare
"bp-openova-flow-server" in the href. The matrix asserts the
verbatim form "/jobs/contabo:bp-openova-flow-server" must appear in
the rendered DOM.
Fix: stop slicing. `encodeURIComponent` still escapes unsafe path
chars (`/` for live K8s job ids like "job/syft-grype/..."), then we
restore `:` because RFC 3986 permits it as a path-segment `pchar`.
FlowPage canvas navigation (PR #1411) and JobDetail flow-fallback
(PR #1412) already pass on the colon-present form, so this round-
trips end-to-end. Legacy "bp-cilium" / "cluster-bootstrap" hrefs are
unchanged (no `:` to encode). The previously-stripped legacy form
"<deploymentId>:install-keycloak" now lands as the full id in the
URL, and JobDetail's `jobsById` lookup is already keyed by BOTH the
canonical id AND the bare jobName (JobDetail.tsx:124-131), so the
resolution path is preserved.
Test coverage: new Case 4 in JobsPage.flow-merge.test.tsx asserts
the openova-flow row's anchor `href` contains
`/jobs/contabo:bp-openova-flow-server` and is NOT the bare-jobName
form. All 4 flow-merge cases PASS. The 3 pre-existing failures in
JobsPage.test.tsx (back-to-apps href, canonical-columns header,
Show-as-Flow button) are the documented iter-2 baseline — untouched
by this change.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TC-035 iter-1 FAIL (2026-05-11): /sovereign/provision/12e194090631a885/jobs
asserts rows for the openova-flow-server + openova-flow-emitter HRs but the
JobsTable only sourced from /api/v1/deployments/<id>/jobs (legacy event
stream) — verified live: GET /v1/flows/<id>/snapshot returns 2 leaf nodes
(contabo:bp-openova-flow-server, contabo:bp-openova-flow-emitter) whose ids
NEVER appear in the legacy /jobs payload. Sovereigns whose state lives only
in the OpenovaFlow snapshot silently drop these rows.
Fix: wire `useFlowStream({deploymentId})` alongside the existing legacy
reducer + live-jobs backfill. Synthesize a Job stub per FlowNode via
`synthesizeJobFromFlowNode` (PR #1412 — same adapter JobDetail's
flow-fallback path uses) and append the rows whose ids are absent from the
legacy set. Legacy wins dedup on id collisions because it carries real
execution timeline / appId / parentId / dependsOn — the flow synth is
intentionally a minimal stub.
Behavior unchanged for Sovereigns without an active flow stream: empty
FlowNode map → empty `flowJobs` → `legacyMerged` passes through untouched.
Test coverage (JobsPage.flow-merge.test.tsx — 3 cases, all PASS):
1. Legacy 5 / flow empty → 5 rows, no behavior change.
2. Legacy 5 / flow has 2 distinct ids → 7 rows with the contabo:bp-*
ids present.
3. Legacy 5 / flow has 1 id-collision + 1 new → 6 rows, legacy wins
dedup (DOM scan asserts the colliding testid appears exactly once).
Validation:
vitest: 3/3 PASS on new file; 13 prior tests in JobsPage.test.tsx
unchanged from origin/main baseline (3 unrelated pre-existing failures
in chrome/columns/Show-as-Flow tests, untouched by this fix).
tsc --noEmit -p tsconfig.app.json: 27 errors, ALL pre-existing in
@openova/flow-canvas + @openova/flow-core workspaces — zero new errors
introduced.
Canonical seam reused (no new code paths):
- @/lib/openflow-adapter-sse → useFlowStream (FlowPage / JobDetail share)
- @/lib/synthesizeJobFromFlowNode (PR #1412 helper)
- @/lib/jobs.types → Job (single source of truth)
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JobDetail built `jobsById` from the legacy useDeploymentEvents reducer
+ useLiveJobsBackfill polling. For Sovereigns whose state lives ONLY in
the openova-flow snapshot (post-flux-only flow, fresh chroot before the
catalyst-api event bridge has emitted any rows), that lookup misses and
JobDetail short-circuited to "Job not found" — never mounting FlowPage,
the very surface that would have painted the node.
Verified live this turn against deployment 12e194090631a885:
GET /api/v1/flows/12e194090631a885/snapshot → 200, 2 leaf nodes
GET /api/v1/deployments/12e194090631a885/jobs/<nodeId> → 404
This blocks ~20 of 26 iter-1 FAILs on the OpenovaFlow canvas test
matrix (TC-019/020/021/023/024/025/027/028/033/034/036/037/038/039/040
/041/042/053/054/060/064).
Fix:
• JobDetail now reads the same useFlowStream hook FlowPage uses.
• When `jobsById[jobId]` is undefined, look up the node in the flow
snapshot's nodes Map. If found, synthesize a flat Job stub from the
FlowNode (id, label, status) so the canvas mounts with the right
hostJobId.
• Behaviour for Sovereigns WITH an active event stream is unchanged
— the legacy lookup wins and the synth stub is never read.
• "Job not found" panel renders ONLY when BOTH lookups miss.
Tests:
Added JobDetail.flow-fallback.test.tsx (vitest, 3 cases):
1. Legacy has the job → FlowPage renders, no fallback.
2. Legacy empty, flow snapshot has the node → FlowPage renders
via synth job (the iter-1 FAIL scenario).
3. Both empty → "Job not found" panel.
All 3 new + 5 existing JobDetail tests pass.
No tsc regressions (27 → 27 baseline errors, all pre-existing
in flow-canvas/flow-core packages).
Refs INVIOLABLE-PRINCIPLES.md:
#1 (waterfall): target-state fallback, no MVP "show loading" stub.
#2 (no compromise): no field is faked with plausible data; absent
timestamps land as null / 0 so fmtTime renders "—".
#4 (never hardcode): the synth helper coerces FlowNode.status into
the JobStatus vocabulary; the label falls back to the node id when
`label` is empty.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of live crash 'TypeError: t.relationships is not iterable':
the Go server uses omitempty JSON tags on FlowMessage so empty slices
are dropped from the wire (snapshot with 2 nodes + 0 rels arrives as
'{"type":"snapshot","nodes":[...]}' with no 'relationships' key).
The reducer iterates msg.relationships, msg.nodes, msg.ids, msg.pairs
without nullish guards → crashes on first frame.
Defensive (?? []) on every reducer iteration. Same shape, idempotent.
Observed bundle: index-CEnQMVBy.js@2285:51356.
Snapshot proven empty-rel: GET /v1/flows/12e194090631a885/snapshot
returns {type:'snapshot',nodes:[2 items]} with relationships key absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI workflows built and pushed both images on the PR #1411 merge; the
chart-tag bump commits didn't auto-land. Bump both manually so Flux
rolls the adapter with the synthetic-phase-removal logic and the
server keeps consistent versioning.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Manual bump — Build & Deploy Catalyst workflow's deploy job lost the
push race twice on PR #1411 merge. Images exist in GHCR; this commit
lands the template+values bump so Flux on contabo-mkt reconciles and
the natural-view canvas restore (FlowCanvasOrganic + fold badges +
depth chip) takes effect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Founder rejected the lane-layout + synthetic-phase scaffolding shipped
via PR #1399/#1400/#1407. This commit restores the founder-tuned
natural view (FlowCanvasOrganic) and adds the per-bubble fold-
disclosure badge + top-right depth chip on top of it.
Adapter (products/openova-flow/adapter-flux/):
- mapper.go: BuildFromHR now returns ONE leaf FlowNode + finish-to-
start edges from spec.dependsOn only. Deleted BuildRegionNode,
BuildPhaseNodes, BuildPhaseEdges, phaseLabels, phaseSortKey,
AllPhaseSuffixes, PhaseSuffix* constants, derivePhase, PhaseLabel,
PhaseSortKey. Node-id separator changed "/" → ":" so ids do not
collide with URL routing (founder hit "Not Found" drilling into
contabo/phase-0).
- hr_informer.go: dropped bootstrap(), tracker, nodeGroups,
reemitGroups(), buildGroupNode(). handle() is now single-leaf
upsert + dependsOn edges.
- rollup.go: deleted entirely (StatusTracker only existed for
synthetic group rollups).
- mapper_synthetic_test.go + rollup_test.go: deleted; mapper_test.go
updated for the ":" separator + no-synthetic-rels assertions.
UI (products/catalyst/bootstrap/ui/):
- FlowPage.tsx: switched from @openova/flow-canvas's FlowCanvas back
to FlowCanvasOrganic. Dropped lane-layout (regionDescriptorsFromFlow),
defaultFoldedAtDepth from @openova/flow-core, FoldControls chrome
strip. Kept useFlowStream + ?folded=/?depth= URL contract.
- flowStreamToOrganic.ts (new): bridges live SSE state to the Job[]
+ hints + region/family descriptors flowLayoutOrganic expects.
Treats `contains` rels as parent-child and FS/SS/FF/SF/triggers as
dependsOn.
- FlowCanvasOrganic.tsx: ADDITIVE optional props onFoldToggle,
badgeCounts, nodeActions, onNodeAction. Renders per-bubble "⊕ K"/
"⊖" disclosure badge on group bubbles when wired; right-click
opens a small action menu. Existing call sites are unchanged.
- Depth chip: ◀ L<n>/<max> ▶ pinned top-right of canvas host,
visible only when real groups exist in the data. Esc clears
manual fold overrides.
Verification:
- go build ./... in adapter-flux: clean
- go test ./... in adapter-flux: PASS (12 tests)
- tsc --noEmit on bootstrap/ui: clean
- vitest FlowPage + FlowCanvasOrganic.bounded: 25/25 PASS
- vitest JobDetail + distribution + flowLayoutOrganic + flow-bridge:
27/27 PASS
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(openova-flow-canvas): fold UX + lane layout + actions menu + cross-flow nav (Agent #9)
Wires the 6 founder-locked canvas views agreed 2026-05-11:
• Lane layout — `meta.layout: 'lane-vertical' | 'lane-horizontal'`
on a `contains`-parent renders the group as a rounded-rect
swim-lane; children pack inside (L→R horizontal, T→B vertical).
Lanes nest: region (vertical) → phase (horizontal) → HR bubbles.
Falls back to organic d3-force when no group declares a layout
hint, so single-region provisions look unchanged.
• Child-count badge `[N]` on every foldable parent — recursive
descendant count through `contains` edges, surfaced via
PositionedNode.descendantCount. Renders independent of fold
state per the founder-locked View 4 ASCII (region keeps `[43]`
even when expanded to phases only).
• Hover dim — onMouseEnter/Leave on a node dims non-neighbor
nodes + non-incident edges to 35% opacity. Selection / host /
neighbor rings keep full opacity per spec precedence.
• Right-click → adapter actions menu — new `actions` +
`onNodeAction` props on FlowCanvasProps. Renders the supplied
NodeAction[] (filtered by per-action `enabled` predicate) in a
NodeActionsMenu (click-outside + Esc dismissal, mirrors
ProfileMenu's canonical seam).
• `triggeredBy` cross-flow badge — when FlowInstance.triggeredBy
is non-empty, a top-left banner lists the parent flows with a
`[↗ open flow]` button → onNavigateFlow callback.
• Cross-flow edges — when a Relationship's `toFlowId` references a
flow not in the current canvas, the source node renders a
"→ flow" tag that calls onNavigateFlow.
FlowPage wires onNodeAction to POST /api/v1/flows/{id}/nodes/{nodeId}
/actions/{actionId} and onNavigateFlow to the router. Default action
list (Retry/Suspend/View logs) supplied by FlowPage; adapters can
override.
Canonical seam citations (per ARCHITECT-FIRST):
• core/src/layout.ts (Agent #1) — pure layout function. Extended
with LaneDescriptor[] + descendantCount, cycle-safe lane-depth
walks reusing the existing visited-set pattern. Lane geometry
stays in canvas (the layout is pure topology).
• widgets/auth/ProfileMenu.tsx — canonical click-outside + ESC
dismissal pattern. NodeActionsMenu mirrors this verbatim so we
stay consistent without a new radix/headless-ui dependency.
Tests: 25 core (was 20, +5 for lanes + descendantCount) + 22 canvas
(was 9, +13 for lane layout, badge math, hover dim, action menu,
triggeredBy banner, cross-flow tag). FlowPage tests still 8/8 green.
No vite/next builds (Rule 7). No kubectl writes (Rule 11). Lane
geometry has zero domain knowledge — the canvas never reads "phase"
or "region" as words; everything is `meta.layout` + `meta.isGroup`
+ `contains` edges driven by the adapter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(openflow-adapter-sse): subscribe to NAMED SSE events not just onmessage
Root cause of canvas "No nodes to render": the openova-flow-server
emits SSE frames with named event types per the contract:
event: snapshot
event: upsert-nodes
event: upsert-rels
...
EventSource's `onmessage` handler ONLY fires for the default
("message") event type. addEventListener with the explicit name is
required for named events. The hook only had `next.onmessage = onMessage`
so EVERY frame the server emitted was silently dropped; the local state
stayed at the initial empty value and FlowCanvas rendered the empty
fallback message.
Verified live: in-browser test showed onmessage_count=0,
addEventListener('snapshot') count=1 — exactly one snapshot frame
arrived but the hook ignored it.
Fix: register addEventListener for every event name in the contract
(snapshot, upsert-flow, upsert-nodes, upsert-rels, delete-nodes,
delete-rels, heartbeat). onmessage retained as defensive default.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without this env the proxy resolveFlowServerURL() falls back to
per-deployment FQDN lookup (https://openova-flow.<sovereignFQDN>) which
only exists on Sovereigns that already installed bootstrap-kit slot 56
with httproute=enabled. Every other catalyst-api deployment (mothership
contabo + Sovereigns that haven't reached cutover yet) returns 502 on
/api/v1/flows/{deploymentId}/snapshot — the live regression founder
saw at console.openova.io: "No nodes to render."
The env points at the in-cluster Service DNS for the LOCAL openova-flow-
server. Both the mothership (catalyst-system or catalyst namespace) and
each Sovereign chroot run the bp-openova-flow-server chart with a local
Service, so this URL is correct for every cluster catalyst-api runs in.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prov #34 live regression 2026-05-11: emitter DaemonSet posted to
https://openova-flow.omantel.biz, every POST EOF'd at the TLS layer
because bp-catalyst-platform InstallFailed → no wildcard *.omantel.biz
cert was issued → no Cilium Gateway listener for the host.
The emitter lives IN the cluster — same k3s as openova-flow-server. It
must use the cluster-local Service DNS. The public HTTPRoute exists
for the MOTHERSHIP catalyst-api proxy (Agent #8 PR #1405), not for the
in-cluster DaemonSet.
Bootstrap-kit slot 57 now overrides flowServerUrl to
http://openova-flow-server.catalyst-system.svc.cluster.local:8080. No
TLS, no public DNS, no Gateway dependency — just a cluster-internal
hop. The chart default (`""`) is unchanged so out-of-cluster emitters
(future Temporal/Argo adapters running on different infra) can supply
their own URL.
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the 6 founder-locked canvas views agreed 2026-05-11:
• Lane layout — `meta.layout: 'lane-vertical' | 'lane-horizontal'`
on a `contains`-parent renders the group as a rounded-rect
swim-lane; children pack inside (L→R horizontal, T→B vertical).
Lanes nest: region (vertical) → phase (horizontal) → HR bubbles.
Falls back to organic d3-force when no group declares a layout
hint, so single-region provisions look unchanged.
• Child-count badge `[N]` on every foldable parent — recursive
descendant count through `contains` edges, surfaced via
PositionedNode.descendantCount. Renders independent of fold
state per the founder-locked View 4 ASCII (region keeps `[43]`
even when expanded to phases only).
• Hover dim — onMouseEnter/Leave on a node dims non-neighbor
nodes + non-incident edges to 35% opacity. Selection / host /
neighbor rings keep full opacity per spec precedence.
• Right-click → adapter actions menu — new `actions` +
`onNodeAction` props on FlowCanvasProps. Renders the supplied
NodeAction[] (filtered by per-action `enabled` predicate) in a
NodeActionsMenu (click-outside + Esc dismissal, mirrors
ProfileMenu's canonical seam).
• `triggeredBy` cross-flow badge — when FlowInstance.triggeredBy
is non-empty, a top-left banner lists the parent flows with a
`[↗ open flow]` button → onNavigateFlow callback.
• Cross-flow edges — when a Relationship's `toFlowId` references a
flow not in the current canvas, the source node renders a
"→ flow" tag that calls onNavigateFlow.
FlowPage wires onNodeAction to POST /api/v1/flows/{id}/nodes/{nodeId}
/actions/{actionId} and onNavigateFlow to the router. Default action
list (Retry/Suspend/View logs) supplied by FlowPage; adapters can
override.
Canonical seam citations (per ARCHITECT-FIRST):
• core/src/layout.ts (Agent #1) — pure layout function. Extended
with LaneDescriptor[] + descendantCount, cycle-safe lane-depth
walks reusing the existing visited-set pattern. Lane geometry
stays in canvas (the layout is pure topology).
• widgets/auth/ProfileMenu.tsx — canonical click-outside + ESC
dismissal pattern. NodeActionsMenu mirrors this verbatim so we
stay consistent without a new radix/headless-ui dependency.
Tests: 25 core (was 20, +5 for lanes + descendantCount) + 22 canvas
(was 9, +13 for lane layout, badge math, hover dim, action menu,
triggeredBy banner, cross-flow tag). FlowPage tests still 8/8 green.
No vite/next builds (Rule 7). No kubectl writes (Rule 11). Lane
geometry has zero domain knowledge — the canvas never reads "phase"
or "region" as words; everything is `meta.layout` + `meta.isGroup`
+ `contains` edges driven by the adapter.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prov #34's chroot Flux pulled bp-openova-flow-{server,emitter} 0.1.0
at install time. PR #1404 republished the same 0.1.0 tag with the
ghcr.io image repo, but OCI HelmRepository sources don't re-pull a
tag they've already cached by digest — even when the bytes change.
Bump Chart.yaml version + bootstrap-kit HR pins to 0.1.1 so Flux
detects the new version and pulls cleanly. No semantic change vs PR
#1404 — same repo, same templates, just a fresh tag.
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mothership catalyst-api serves /sovereign/api/v1/flows/{deploymentId}/* for
every Sovereign's user-facing job view, but the previous resolver only knew
about OPENOVA_FLOW_SERVER_URL (or the in-cluster Service DNS default). On
the mothership both fall back to a name the kernel can't resolve, so prov #34
hit:
HTTP/2 502 openova-flow-server unreachable:
Get "http://openova-flow-server.catalyst-system.svc.cluster.local:8080/v1/flows/.../snapshot":
dial tcp: lookup openova-flow-server.catalyst-system.svc.cluster.local: no such host
Resolution order is now:
1. OPENOVA_FLOW_SERVER_URL env override — wins (chroot catalyst-api).
2. h.deployments.Load(deploymentId) → Request.SovereignFQDN → build
`https://openova-flow.<sovereignFQDN>` (HTTPRoute pattern documented
in platform/openova-flow-server/chart/values.yaml comment + the
bootstrap-kit overlay clusters/_template/bootstrap-kit/56-bp-openova-
flow-server.yaml which sets `hostname: openova-flow.${SOVEREIGN_FQDN}`).
3. No deployment in store (and no env): return 404 instead of silently
dialing a Service URL the mothership can't reach.
Canonical patterns cited (ARCHITECT-FIRST rule):
- PDM-by-deploymentId lookup: deployments.go GetDeployment lines 1201-1216
(h.deployments.Load(id) → (*Deployment).Request.SovereignFQDN). The
chrootEnsureDeployment fallback (jobs.go lines 53-86) covers the
chroot case; on the mother it returns nil and surfaces 404.
- Self-signed TLS skip-verify: deployment_handover_export.go line 62
(&tls.Config{InsecureSkipVerify: true} with nolint:gosec, gated by
explicit operator opt-in). Gated here on
OPENOVA_FLOW_TLS_SKIP_VERIFY=true so qa-loop Sovereigns minting
LE-staging "Fake LE Intermediate X1" certs are reachable, while
production stays strict.
SSE streaming logic is unchanged. Per docs/INVIOLABLE-PRINCIPLES.md #4
the only hostname literal added is the chart-documented prefix
`openova-flow.`; the FQDN suffix itself comes from the per-deployment
record at runtime.
Tests:
- TestFlowProxy_EnvOverride_TakesPrecedence — chroot path
- TestFlowProxy_DerivesURLFromDeploymentFQDN — mother path
- TestFlowProxy_DerivedURL_NotFoundReturns404
- TestFlowProxy_DerivedURL_EmptyFQDNReturns404
- TestFlowProxy_DerivedURL_PathAssembly
All 15 TestFlowProxy_* tests pass (go test ./internal/handler -run TestFlowProxy).
go vet ./... clean. go build ./cmd/api clean. The two pre-existing
TestHandleWhoami_* failures on origin/main are unrelated.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prov #34 chroot has bp-openova-flow-server + bp-openova-flow-emitter
HRs Ready=True but pods stuck in ImagePullBackOff:
Failed to pull image
"harbor.openova.io/proxy-ghcr/openova-io/openova/openova-flow-server:0ac1297":
failed to resolve reference: ... not found
Root cause: mothership Harbor's proxy-ghcr project doesn't carry
ghcr.io auth for openova-io's PRIVATE container packages. Direct
`harbor.openova.io/...` references bypass containerd's transparent
ghcr.io→harbor rewrite (registries.yaml v1) and force Harbor to
upstream-pull, which 404s on private images.
catalyst-api + catalyst-ui (also private) work fine because their
charts reference `ghcr.io/openova-io/openova/...` directly. containerd
rewrites at the wire (MIRROR-EVERYTHING preserved), and kubelet auths
with the `ghcr-pull` imagePullSecret (reflected into every namespace
by bp-reflector).
Switch openova-flow-{server,emitter} charts to the same pattern.
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
build-ui on 841b6133 surfaced TS2304 "Cannot find name 'global'" in
several layout tests after the workspace-root npm ci fix exposed
errors that the prior react/d3-* failures had masked. The tests use
`global.fetch = vi.fn(...)` which requires @types/node ambient types.
tsconfig.app.json restricted `types` to ["vite/client"], so node
types weren't auto-loaded. Add "node" so the existing @types/node
devDep (^24.12.0) is in scope.
Co-authored-by: hatiyildiz <269457768+hatiyildiz=hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1399 (Agent #5) added npm workspaces at the repo root, but the
Containerfile still ran `npm ci` from /repo/products/catalyst/bootstrap/ui/
which bypasses workspace activation. Cross-workspace bare-spec imports
(react / d3-force / d3-drag / d3-selection) from the canvas package
source couldn't resolve, breaking the Docker build with ~120 TS2307
errors on commit 2c6595a3 (2026-05-11).
Fix: COPY the workspace-root package.json + package-lock.json + each
workspace's package.json BEFORE installing. Run `npm ci --workspaces
--include-workspace-root` from /repo. Then WORKDIR into the leaf for
the Vite build. This is the canonical npm workspaces flow.
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OpenovaFlow's FlowNode is deliberately domain-agnostic — Phase 0/1/2/3
+ multi-region structure are conveyed via synthetic group nodes,
contains relationships, and adapter-supplied meta.layout hints (same
primitives a Temporal/Argo/Airflow adapter would use for their own
concepts). Catalyst-specific knowledge stays in the adapter.
What this PR ships
==================
products/openova-flow/adapter-flux:
- mapper.go: phase-suffix constants, BuildPhaseNodes, BuildPhaseEdges,
derivePhase (slot-label / component-label driven, no hardcoded
HR-name → phase table). BuildFromHR now returns two `contains` rels
per leaf (region row + phase column). BuildRegionNode carries
meta.layout=lane-vertical + isGroup.
- rollup.go (new): StatusTracker + RollupStatus (worst-of:
failed > running > pending > succeeded). Mirrors the same worst-of
rollup the catalyst-api status-projection uses for the Sovereign
Console progress widget.
- hr_informer.go: bootstrap emits region + 4 phase nodes + 3 FS edges
per region; HR upserts/deletes update the StatusTracker and re-emit
affected synthetic parents with fresh rolled-up status.
- test/mapper_synthetic_test.go (new): 9 cases — phase nodes,
phase edges, slot/component/name-fallback derivation, 43-mock-HR
acceptance, region-scoped IDs, default region fallback.
- test/rollup_test.go (new): 9 cases — rollup palette, tracker
lifecycle, per-group isolation.
- test/mapper_test.go: updated existing assertions for the new
contains-edge count (2 per HR, was 1).
clusters/_template/bootstrap-kit/*.yaml (45 HRs):
- Added catalyst.openova.io/slot=<NN> label per HR (chart-level slot
surface so the adapter doesn't hardcode HR-name → phase). Mirrors
the existing catalyst.openova.io/component label pattern in
platform/external-secrets-stores/chart/templates/*.yaml +
platform/openclaw/chart/templates/*.yaml.
- 06a-bp-self-sovereign-cutover.yaml + 13-bp-catalyst-platform.yaml
also get catalyst.openova.io/component={cutover,catalyst-platform}
so their phase derivation is explicit, not name-fallback.
Canonical patterns cited
========================
1. catalyst.openova.io/component label on platform/* charts
(platform/external-secrets-stores, platform/openclaw) — same label
vocabulary, extended with slot.
2. worst-of-children rollup matches the existing catalyst-api
status-projection pattern (Sovereign Console progress widget).
Tests
=====
go test ./test/... → 31 PASS, 0 FAIL.
go vet ./... → clean.
Definition of Done (after Build & Deploy + emitter reconcile)
=============================================================
GET /sovereign/api/v1/flows/<deploymentId>/snapshot returns:
- N region root nodes (1 per adapter sidecar)
- 4 phase nodes per region (8 total for 2-region prov)
- N HR nodes per region with TWO `contains` edges each
- 3 phase-FS edges per region
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the OpenovaFlow Foundation end-to-end so the catalyst-ui FlowPage
consumes the new openova-flow-server's merged multi-region SSE stream
(`GET /api/v1/flows/{deploymentId}/stream`) and renders the per-region
adapter-flux emissions directly via @openova/flow-canvas. Closes the
revert from PR #1394 and unblocks the prov #34 multi-region 2-bubble
demo (fsn1 + hel1 each install bp-gateway-api → two bubbles).
# What ships
## A. npm workspaces at repo root
• New `package.json` declares `openova-monorepo` private root with
three workspaces: products/openova-flow/{core,canvas} +
products/catalyst/bootstrap/ui.
• Root `package-lock.json` resolves @openova/flow-* as workspace
symlinks into the hoisted node_modules tree.
• react / react-dom / d3-* are now hoisted into the monorepo's root
node_modules, so flow-canvas's bare `import 'react'` resolves via
standard upward-walking node_modules — no per-package sibling
node_modules required (the root cause of PR #1389's build break).
## B. Catalyst-ui consumes @openova/flow-* via file: deps
• catalyst-ui's `package.json` adds `@openova/flow-core` and
`@openova/flow-canvas` as `file:../../../openova-flow/{core,canvas}`
deps so `npm ci` from within catalyst-ui (today's CI path) keeps
working without needing root-level `npm ci -ws`.
• Vite `resolve.alias` + tsconfig `paths` bind `@openova/flow-core`
and `@openova/flow-canvas` to the source-only `./src/index.ts`
entry points. `dedupe: ['react', 'react-dom']` guards against
double-instancing.
• `tsconfig.app.json` `include` adds the two flow-package src trees
so tsc covers them with catalyst-ui's strict settings (instead of
each package's standalone `tsc -p tsconfig.json`, which lacks the
React/d3 node_modules siblings).
## C. New SSE consumer + bridge
• `src/lib/openflow-adapter-sse.ts` — `useFlowStream` React hook +
pure `reduceFlowMessage` reducer. Consumes the contract verbatim
(snapshot / upsert-flow / upsert-nodes / upsert-rels / delete-nodes
/ delete-rels). Owns the EventSource lifecycle, GET /snapshot
pre-paint, capped exponential reconnect.
• `src/lib/flow-bridge.ts` — catalyst-specific glue:
`CATALYST_STATUS_PALETTE` (mirrors `--bubble-*` CSS tokens onto
`StatusTone`), `flowStateToArrays` (Map→Array materialiser),
`regionDescriptorsFromFlow` (derives FlowCanvas regions from live
region tags + optional wizard-store augmentation), and
`rollupFlowStatus` (provisioning-status rollup on the new
contract).
• NOT a Job-shape bridge — the legacy Job adapter from PR #1389
is gone. catalyst-ui never goes through Catalyst's legacy Job model
again; the SSE stream IS the source of truth.
## D. FlowPage.tsx rewired
• Drives `FlowCanvas` from `@openova/flow-canvas` directly off the
new hook.
• Multi-region support comes for free: per-region adapter-flux tags
every emitted FlowNode with `region: '<location-code>'`; the
canvas's swimlane layout buckets by `region`. Single-region
provisions render identically to before via a synthetic
fallback descriptor.
• Embedded mode preserved for JobDetail.
## E. Containerfile preserves CI build
• COPY products/openova-flow/{core,canvas}/{package.json,src/}
BEFORE `npm ci` so `file:` deps validate. Subsequent
`COPY products/` layers the rest (CONTRACT.md etc.) in.
# Tests
• 23 new tests, 0 regressions on adjacent areas:
- `openflow-adapter-sse.test.ts` (6) — reducer covers all 6
FlowMessage variants including delete-nodes' rel-prune cascade
AND a multi-region merge case (fsn1 + hel1 both install
bp-gateway-api).
- `flow-bridge.test.ts` (10) — palette completeness, Map→Array
ordering, region descriptor derivation/fallback, status rollup
including group-exclusion and terminal-failure detection.
- `FlowPage.test.tsx` (7) — empty-state mount, StatusStrip, no
legacy mode toggle, embedded variant.
• flow-core: 20/20 passing; flow-canvas: 9/9 passing.
• Vitest full suite: 1130 pass / 87 fail (87 fails are pre-existing
on main and unrelated — PinInput6, ProvisionPage, etc.). Baseline
on main is 1052 pass / 88 fail / 27 failed files; this PR brings
78 new passing tests and lowers failing files from 27 → 18.
# Constraints honoured (Rule 7)
• NO `vite build` / `next build` / `npm run build` / `npx playwright
test` / `npx playwright install`. Only `tsc --noEmit` + `vitest
run` + `npm install --package-lock-only`.
• NO `kubectl apply` / chart manifests touched (Rule 11).
• NO hardcoded URLs / regions / k3s flags. Endpoint composed from
`API_BASE`; regions derived from live FlowNode tags; deploymentId
from `useParams` (Rule 18).
• Two-repo discipline: openova-io/openova only (Rule 21).
• Conventional commit + Claude co-author footer (Rule 20).
• isolation:"worktree" — work landed in a dedicated worktree.
# Canonical-seam citations (ARCHITECT-FIRST)
1. PR #1389's `flow-bridge.ts` — reference for the shape of a
catalyst-ui→@openova/flow contract layer. NOT conflated: that
bridge translated legacy Catalyst Jobs into FlowNodes; this one
consumes the new SSE FlowMessage stream directly with no Job
intermediary.
2. `useDeploymentEvents.ts` (line 526+, `openStream` + `onerror`
reconnect + capped retry) — canonical SSE consumer pattern in
this codebase. `useFlowStream` mirrors it (capped exponential
backoff, idempotent reducer over replayed buffered events).
# Definition of Done — post-merge verification plan
1. CI green (catalyst-build builds the new Containerfile path).
2. `curl -k -b /tmp/cz-cookie-prov27.txt
'https://console.openova.io/sovereign/api/v1/flows/5a175e0a88c99cec/snapshot' | jq`
→ nodes[] contains BOTH `fsn1/bp-gateway-api` AND `hel1/bp-gateway-api`.
3. Browser test: navigate to
`https://console.openova.io/sovereign/provision/5a175e0a88c99cec/jobs/install-gateway-api`
→ expect TWO bubbles (one per region).
4. If snapshot is empty, inspect emitter DaemonSets:
`kubectl --context=omantel get pods -n openova-flow`.
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the two missing GitHub Actions build pipelines for the OpenovaFlow Go
binaries so prov #34 has real images to install. Both auto-bump their
chart's values.yaml `image.tag` on every main-branch push and dispatch
blueprint-release for chart re-publish.
Workflows shipped:
- .github/workflows/build-openova-flow-server.yaml
· Triggers on push to products/openova-flow/server/** or the chart
· `go vet` + `go test -race` + Buildx push to
ghcr.io/openova-io/openova/openova-flow-server:<sha> + :latest
· cosign keyless sign + SBOM attest
· awk-bumps platform/openova-flow-server/chart/values.yaml
flowServer.image.tag, commits to main with [skip ci]
· Dispatches blueprint-release.yaml for chart re-publish
- .github/workflows/build-openova-flow-adapter-flux.yaml
· Same shape; bumps platform/openova-flow-emitter/chart/values.yaml
flowEmitter.image.tag
Chart defaults (`tag: "latest"`) already shipped in PR #1397 — no
values.yaml changes needed in this PR.
Canonical patterns cited (ARCHITECT-FIRST):
- Build shape mirrors .github/workflows/build-application-controller.yaml
(Go vet + test + Buildx + cosign + SBOM + values.yaml awk-bump +
blueprint-release dispatch).
- awk-sed bump pattern mirrors catalystApi/catalystUi tag-bump in
.github/workflows/catalyst-build.yaml `deploy` job (with the
`[skip ci]` + explicit blueprint-release dispatch fix from #712).
Per docs/INVIOLABLE-PRINCIPLES.md:
- #4a (GitHub Actions is the only build path)
- event-driven (no cron triggers, only push/PR/workflow_dispatch)
MIRROR-EVERYTHING: image refs in chart values point at
harbor.openova.io/proxy-ghcr/...; CI pushes to ghcr.io directly and
Harbor proxy-pulls. No direct push to harbor.
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final integration piece for OpenovaFlow infrastructure path —
catalyst-api proxy + cloud-init substitution for SOVEREIGN_DEPLOYMENT_ID
+ SOVEREIGN_REGION_KEY, so bp-openova-flow-emitter (slot 57) emits
distinct region tags on every FlowNode and the snapshot returns 2× per
HR on a multi-region Sovereign.
Builds on PR #1389 (TS core + canvas packages on disk), PR #1390 (Go
server + flux adapter + bootstrap-kit slots 56/57), PR #1394 (catalyst-
ui temporary revert until npm workspaces land), PR #1395 (chart no-op).
## Scope vs original Agent #3 brief
The brief planned a 4-section PR (proxy + cloud-init + FlowPage rewire +
runbook). Section 3 (catalyst-ui rewire of @openova/flow-*) is deferred:
PR #1394 reverted Agent #1's UI wiring because the Docker UI build has
no node_modules for the cross-workspace canvas source. Founder note on
#1394: "Agent #3 (or a follow-up) will re-wire them properly once npm
workspaces are configured at repo root."
This PR ships the infrastructure half (proxy + cloud-init + runbook).
The canvas-side rewire is a separate follow-up PR that needs npm
workspaces, not surgical edits to FlowPage.
## What ships
### 1. catalyst-api proxy /api/v1/flows/{deploymentId}/{snapshot,stream,events}
products/catalyst/bootstrap/api/internal/handler/openova_flow_proxy.go:
- GET /snapshot — JSON pass-through, headers + status forwarded
- GET /stream — unbuffered SSE pass-through using http.Flusher (NOT
httputil.ReverseProxy; that buffers and breaks text/event-stream)
- POST /events — body forwarded byte-for-byte
- Upstream URL from env OPENOVA_FLOW_SERVER_URL (default Sovereign
in-cluster Service DNS)
Routes registered in cmd/api/main.go inside the auth-gated chi.Group.
11 table-driven tests cover snapshot/events/stream pass-through, upstream
404/400/unreachable propagation, empty-deploymentId guard, SSE frames
arrive AS EMITTED, and env-default fallback.
### 2. Cloud-init threads SOVEREIGN_DEPLOYMENT_ID + SOVEREIGN_REGION_KEY
- infra/hetzner/cloudinit-control-plane.tftpl — two new postBuild.
substitute keys alongside SOVEREIGN_FQDN/SOVEREIGN_LB_IP
- infra/hetzner/main.tf — primary CP renders var.region as region key;
secondary CP renders each.key (e.g. "hel1-1") from for_each over
local.secondary_regions
- infra/hetzner/variables.tf — new sovereign_deployment_id var (string,
default "" for tofu mocks)
- provisioner.go writeTfvars — writes vars["sovereign_deployment_id"]
= req.DeploymentID
- bootstrap-kit slot 57 — swap placeholder ${SOVEREIGN_FQDN} / literal
"primary" for the new ${SOVEREIGN_DEPLOYMENT_ID} / ${SOVEREIGN_REGION_KEY}
envsubst keys
### 3. Deployment record flag
handler/deployments.go State() — emits `openovaFlowEnabled: true` on
every deployment. The catalyst-ui rewire (follow-up PR) will read this
to enable the openova-flow-server adapter; legacy provisions without
the flag will keep the bridge once the rewire lands.
### 4. Verification runbook
docs/runbooks/openova-flow-multi-region-verify.md — prov #34 POST body
(multi-region cpx42 fsn1+hel1, qaTestEnabled=true,
sovereignFQDN=omantel.biz), step-by-step kubectl/curl gates, visual
canvas checks (gated on the follow-up UI rewire), and a failure-class
triage table.
## Canonical-seam citations
1. SSE pattern — products/catalyst/bootstrap/api/internal/handler/
deployments.go:1244-1287 (StreamLogs): identical Content-Type +
Cache-Control + X-Accel-Buffering header set; identical
http.Flusher.Flush() after each write; identical r.Context().Done()
cancel path.
2. postBuild.substitute pattern — infra/hetzner/cloudinit-control-plane.tftpl:884-893
(SOVEREIGN_FQDN + SOVEREIGN_LB_IP): same indentation, same KEY: ${var}
form, dual emission at primary + secondary CP for_each in main.tf.
## Verification
```
$ go build ./...
(clean)
$ go vet ./...
(clean)
$ go test ./internal/handler/ -run TestFlowProxy -count=1 -race
ok github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/handler 1.410s
$ go test ./internal/provisioner/... -count=1
ok github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/provisioner 0.025s
```
3 pre-existing test failures (TestHandleWhoami_NoRBACOmitsFields,
TestHandleWhoami_PinSessionRBACClaims,
TestUnstructuredToUserAccess_NilApplicationsBecomesEmpty) reproduce on
main HEAD without this PR — unrelated baseline state.
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Blueprint Release smoke step renders charts with default values, but
both openova-flow-{server,emitter} charts had `image.tag: ""` which
fired the _helpers.tpl image-fail-fast at render. Default to `latest`
so smoke passes. CI's image-build workflow seds in the real short-SHA
on every push to products/openova-flow/{server,adapter-flux}/**, and
the bootstrap-kit HRs override at install time so runtime is always
deterministic. `latest` is only the render-default placeholder.
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>