fix(sovereign-tls): tls-restart Job needs list+watch verbs (#1504 )

* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs

Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
  "ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.

Auto-derivation:

  ClusterMeshName: <first-fqdn-label>-mesh
    e.g. t105.omani.works → "t105-mesh"

  ClusterMeshID:  (sha256(deploymentID)[:4] as uint32) mod 252 + 1
    Range [1, 252]; main.tf increments for secondaries so the max id
    any region sees is primary + (regions - 1) ≤ 254. ID 255 is
    intentionally avoided (Cilium sentinel).

Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).

Tested derive helpers against the last 4 prov IDs — all land in valid
range:
  98395b3d9bd9c1aa → 74 (secondaries 75, 76)
  005080699326a7ac → 29 (secondaries 30, 31)
  22af2b1120158239 → 139
  c9df5eed1c1ba6cf → 180

Build + provisioner unit tests green.

* fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml

t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's
catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99)
correctly reached cilium-config — but only AFTER Flux helm-upgraded the
release. The pre-Flux Cilium install (cloud-init line 1473) used
/var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or
cluster.id, so cilium-agent started with the chart defaults
("default", 0). The Flux upgrade then changed cilium-config but the
already-running cilium-agent kept its in-memory cluster.name="default"
because it reads ConfigMap once at startup.

Downstream consequences observed live on t105:
  hubble-relay CrashLoopBackOff:
    "tls: failed to verify certificate: x509: certificate is valid for
     *.t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1
     .default.hubble-grpc.cilium.io"
  clustermesh peer announcements use stale "default" identity →
  cross-region mesh handshakes x509-fail.

Fix: include cluster.name + cluster.id in the pre-Flux helm install's
values file, sourced from the templatefile() vars cluster_mesh_name +
cluster_mesh_id (already threaded per-region by main.tf:381-382 and
:900-901). Now the first cilium-agent process announces with the
correct identity, no helm-upgrade race.

* docs(sandbox): design docs for the Sandbox product

Captures the agreed product shape, end-user journeys (developer +
Sovereign admin), technical architecture (native agent TUI via
xterm.js + WebSocket + PTY, card protocol for mobile, MCP catalogue,
four knowledge layers, JetStream/SSE integration), and the
conversational-provisioning surface that reuses the same shell with a
narrow MCP toolbox as an alternative to the catalyst-ui wizard.

Status: design only — no implementation. Identifies one prerequisite
(long-lived API token carrying org_id claim) with the exact files to
extend in core/services/auth and platform/keycloak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereign-tls): tls-restart Job needs list+watch on deployments/daemonsets

Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15) — the
cilium-envoy-tls-restart Job stuck Running 10m+ with:

  W reflector.go:561] failed to list *unstructured.Unstructured:
    deployments.apps "cilium-operator" is forbidden: User
    "system:serviceaccount:kube-system:cilium-envoy-tls-restart"
    cannot list resource "deployments" in API group "apps" in the
    namespace "kube-system"

The Role grants `get` + `patch` but `kubectl rollout status` (which the
Job runs after `rollout restart`) does NOT just GET — internally it
uses client-go informerwatcher to LIST+WATCH the resource. Without
those verbs the informer fails and `rollout status` hangs until
activeDeadlineSeconds (900s). The Job never restarts cilium-envoy,
console.<fqdn> never serves.

Fix: add `list` + `watch` to both rules (cilium-operator Deployment
+ cilium-envoy DaemonSet). Scoped by resourceName, so the SA still
can't enumerate or watch other workloads.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-15 21:02:37 +04:00

6.4 KiB

Raw Blame History

Sandbox — business requirements

The problem we are solving

Developers using modern coding agents (Claude Code, Cursor, Qwen Code, Aider, Opencode) hit three structural ceilings:

The agent dies when the laptop closes. All current agents are local processes bound to a TTY and a workstation. Long-running work (multi-hour refactors, end-to-end provisioning, watching a deploy roll) can not survive a closed lid, a flaky Wi-Fi, or moving devices.
The agent has zero awareness of the cloud the developer is shipping to. It can edit code competently and run shell commands, but it does not know what Sovereign you operate, what cluster state exists right now, what your DNS topology is, what your secrets vault holds. Every cross-cluster question is a kubectl round-trip the developer has to wire by hand.
The user has no way to use the agent from a phone, a tablet, a Chromebook, or a borrowed machine. SSH + tmux + Termius is the workaround and it is brittle, ugly on mobile, and breaks down on every device boundary.

Beyond developers, there is a fourth problem at the on-ramp: non-technical users cannot provision a Sovereign through a wizard. The current catalyst-ui wizard is excellent for power users and integrators but presents 20+ fields at first contact. We are losing potential customers who would rather describe their need in two sentences than fill a form.

Who Sandbox is for

Audience	What they get	Pricing fit
Sovereign super-admin (the operator of the Sovereign)	Fleet view of every Org's Sandbox usage, agent catalogue control, org-level quotas, audit, cost attribution. Plus their own developer Sandbox in the same UI.	Bundled with Sovereign subscription.
Org admin (e.g., CTO of a customer Org inside a corporate Sovereign)	Org-scoped admin view: invite developers, set per-developer quotas, bind Org-level secrets (Stripe live, Resend prod), publish Org skill packs. Plus their own developer Sandbox.	Per-seat add-on on the Org subscription.
Nova user — developer (an engineer working inside an Org)	One personal Sandbox: persistent sessions, all approved agent brands, PVC-mounted repos, Org-shared build cache, auto preview URLs per PR, native-TUI in browser.	Per-seat. Pre-paid agent-token budget on top.
Prospective Sovereign customer (no Sovereign yet)	A conversational entry point that talks them through provisioning — text, or voice. The same Sandbox shell, scoped to provisioning MCP tools only.	Free conversion path; paid once the Sovereign is up.

Value proposition

Sandbox is not "another IDE." It is the agent host platform that is missing between "raw agent on a laptop" and "managed AI tooling SaaS." The differentiators that matter:

The Sovereign is the sandbox. The agent runs inside the customer's own tenant cloud (vcluster per Org), under their own RBAC, against their own data, signing as them. Nothing leaves the tenant.
Live cluster awareness is native, not retrofitted. OpenovaFlow's existing watcher fabric and JetStream subjects already publish the events. Sandbox makes the agent a first-class subscriber via MCP resources/subscribe. No kubectl get loops.
The same session is on every device. Native Claude Code TUI in the browser via xterm.js + persistent PTY. Same session re-attaches from PC, iPad, or phone. The pty-server fans out multi-client like tmux but with no tmux in the stack.
Preview-per-PR is one click. Every PR the agent opens auto-resolves to a live URL under the Org's marketplace subdomain (which already exists today). Click on phone, ship from laptop.
Conversational provisioning. New customers describe what they need; the agent calls catalyst-api directly. This is the first surface where AI-first replaces forms.

The moat

OpenOva already runs a per-tenant vcluster cloud with Keycloak identity, Gitea, Harbor, SeaweedFS, NATS JetStream, CNPG, marketplace DNS + BYOD, Crossplane reconcilers, Flux GitOps, and a live event fabric. No competitor in the AI-coding tools category has any of that. They have an editor with chat. We have a cloud with an agent host.

Every other "AI coding" tool tries to bolt cluster-awareness onto an editor. Sandbox does the inverse: it bolts an editor experience onto a cluster the user already owns.

Success criteria

We will know Sandbox is the right product when:

A developer on an iPad in a cab can ship a PR to production from a session that started on their desktop that morning, without re-attaching by hand.
A new customer goes from console.openova.io to a working Sovereign by speaking three sentences. No wizard fields touched.
The Sovereign admin can see "Org X spent $42 on agent tokens this week, 8 sessions, 3 production deploys" without writing a query.
A Cursor user and a Claude Code user inside the same Org open each other's PRs and the cards in each session reference the other's diffs because both subscribed to the same JetStream subjects.
The marketing pitch for Sovereign no longer leads with "secure tenant cloud" — it leads with "the only place where your agents can ship for you."

Non-goals

Sandbox is not an editor. It does not replace VS Code / Cursor / IntelliJ. It hosts agents; developers keep their editors.
Sandbox is not a Codespaces clone. Codespaces hosts a dev environment; Sandbox hosts an agent. The dev environment is incidental.
Sandbox does not ship its own model. Users bring their model subscription (Anthropic, OpenAI, Qwen, etc.) — the Sovereign holds the API key in its secret store.
Sandbox does not replace catalyst-ui wizard for provisioning. The conversational entry is an alternative path for new users. Power users keep the wizard.

Cost surface (for billing design)

Sandbox usage rolls up into the existing JetStream usage stream (catalyst.usage.recorded), tagged with org_id in the payload (the existing convention — core/services/shared/events/nats.go). The cost dimensions:

Compute — vcluster pod CPU/memory hours per Sandbox session.
Storage — PVC GB-hours for repo workdirs and build caches.
Agent tokens — model API spend attributed to the session's owner.
Preview hosting — pod-hours for live preview deployments under the Org's marketplace subdomain.

The super-admin dashboard shows per-Org rollup; the org admin dashboard shows per-developer rollup; the developer sees their own only.

6.4 KiB Raw Blame History