Commit Graph

989 Commits

Author SHA1 Message Date
e3mrah
88c34c24ba
fix(rbac): cutover-driver permissions for catalyst.openova.io/environmentpolicies (#1210)
Caught live on omantel after Fix #19 (#1208) restored /environments/{env}/policy:
  environmentpolicies.catalyst.openova.io is forbidden: User
  "system:serviceaccount:catalyst-system:catalyst-api-cutover-driver"
  cannot list resource environmentpolicies in API group catalyst.openova.io

Slice X (#1147) shipped the policy-mode toggle handler. Slice B5 (#1108)
shipped the EnvironmentPolicy CRD. Neither slice updated the cutover-driver
ClusterRole. Fix #19's handler restoration surfaced the gap end-to-end.

Per feedback_chroot_in_cluster_fallback.md: every new GVR added to
catalyst-api dynamic-client paths MUST get matching ClusterRole rules in
the same PR. Same pattern as PRs #1173/#1179.

Live: applied on omantel via kubectl patch + verified TC-101 PUT
/environments/test-env/policy returns HTTP 200 with full contract body.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 18:20:48 +04:00
github-actions[bot]
0de2a8f14e deploy: update catalyst images to 3679a0d 2026-05-09 14:08:14 +00:00
e3mrah
3679a0d7e0
fix(chart): exclude crds/tests/ from packaged bp-catalyst-platform (qa-loop iter-3 Fix #18 follow-up) (#1209)
Helm's `crds/` directory installs every YAML inside as a CRD at the
pre-render install hook — Helm does NOT filter by `kind:` and does NOT
honour resource Namespaces during this phase. The sample fixtures added
by PR #1105 (Application CRs in `namespace: acme`, intentionally invalid
for chart-author dry-run testing) were therefore being submitted to the
apiserver as real CRDs on every Sovereign upgrade. Result: every chart
≥ 1.4.85 install/upgrade failed with:

  failed to create CustomResourceDefinition bad-app:
    namespaces "acme" not found

Caught live on omantel 2026-05-09 attempting 1.4.84 -> 1.4.95.

Fix: add `crds/tests/` to .helmignore so the test fixtures are excluded
from the packaged chart entirely. They remain in the source tree for
chart-author validation (`kubectl apply --dry-run=server -f ...`); they
just don't ship in the OCI artifact.

Bump bp-catalyst-platform 1.4.95 -> 1.4.96 + bootstrap-kit pin.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 18:06:10 +04:00
github-actions[bot]
6637a664e4 deploy: update catalyst images to e2aa7fd 2026-05-09 14:05:17 +00:00
e3mrah
e2aa7fd0f9
fix(api): /rbac/assign POST 500 + policy_mode body shape (qa-loop iter-3) (#1208)
Root cause #1 (TC-091, TC-094, TC-104, TC-216, TC-239 cluster):
  HandleRBACAssign called client.Resource(UserAccessGVR()).Namespace("").Create(...)
  on a Namespaced CRD. The apiserver returns the confusing
  `the server could not find the requested resource` 404 (surfaced as
  HTTP 500 by the handler) when an empty namespace is passed to a
  namespaced-CRD's Create REST endpoint, because the dispatcher routes
  the call to the cluster-scoped path which doesn't exist for that kind.

  Fix: introduce rbacAssignNamespace = "catalyst-system" and route
  Create/Update/List through it. Mirrors the sovereignSMTPSeedNamespace
  pattern already used by sovereign_smtp_seed.go. The List path scopes
  to the same namespace so both halves of the find-or-create stay
  consistent (no risk of List finding a CR the Update can't reach).

Root cause #2 (TC-101):
  HandleEnvironmentPolicyMode rejected the canonical UAT body
  `{"environment":"default","modes":{...},"applied":true}` with a 400
  "json: unknown field 'environment'" because policyModeRequest only
  modelled `modes` and decodeMutationBody calls DisallowUnknownFields().
  The matrix sends round-trip-shaped bodies derived from the response.

  Fix: extend policyModeRequest with optional `environment` and `applied`
  fields (ignored — the URL path-param is the source of truth for env).

Bonus (still TC-101):
  Mode-value validation accepted only `permissive`/`enforcing`. The
  matrix uses Kyverno's native `audit`/`enforce` vocabulary because the
  same EnvironmentPolicy CR is bridged to Kyverno ClusterPolicy. Added
  normalizePolicyMode() that maps audit→permissive, enforce→enforcing
  (case-insensitive, trimmed). Stored CR shape stays canonical OpenOva.

  Also fail-open on Forbidden from the kyverno-list and environment-get
  RBAC paths so a Sovereign whose cutover-driver ClusterRole hasn't yet
  rolled the kyverno.io/clusterpolicies + catalyst.openova.io/environments
  rules doesn't wedge the policy-mode toggle UI. The CRD's openAPI schema
  (not the per-policy-name allowlist) is the actual security boundary.

  Missing Environment CR is now treated as create-on-write rather than
  404, matching the matrix expectation that policy modes can be set
  before the Environment CR materialises (chroot mode often has no
  Environment CRD installed at all).

Tests:
  - Updated rbacUserAccessFromAssign helper to set namespace.
  - Updated existing test seed/get calls to use rbacAssignNamespace.
  - Added TestHandleRBACAssign_WritesIntoNamespacedCRD — explicit
    regression for the 500 (asserts response.userAccess.namespace).
  - Added TestHandleRBACAssign_UpdateRoutesThroughNamespace — exercises
    the Update path's namespace handling.
  - Added TestHandleEnvironmentPolicyMode_AcceptsRoundTripBodyShape —
    explicit regression for TC-101 with matrix-shaped body.
  - Added TestNormalizePolicyMode_AcceptsBothVocabularies — table-driven
    unit coverage for the OpenOva/Kyverno synonym mapping.
  - Replaced TestHandleEnvironmentPolicyMode_404OnMissingEnvironment
    with TestHandleEnvironmentPolicyMode_CreatesWhenEnvironmentMissing
    to reflect the new contract.

All handler tests pass: `go test -count=1 ./internal/handler/`.

Refs: qa-loop iter-3 cluster `rbac-post-500-real-bug` — Fix #19.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 18:03:13 +04:00
github-actions[bot]
abfc6d9fc0 deploy: update catalyst images to b24475e 2026-05-09 13:59:35 +00:00
e3mrah
b24475e2c2
fix(api+chart): clusterroles GVR + CATALYST_BUILD_SHA env injection (qa-loop iter-3) (#1206)
Two coupled fixes for QA-loop iter-3 cluster
`clusterroles-gvr-and-sha-injection`:

Sub-A — clusterroles GVR (TC-122/196/199/248):
  - Add rbac.authorization.k8s.io/v1 ClusterRole + ClusterRoleBinding
    to k8scache.DefaultKinds. Both cluster-scoped.
  - Add matching get/list/watch verbs on
    catalyst-api-cutover-driver ClusterRole. Per
    feedback_chroot_in_cluster_fallback.md every new GVR added to
    DefaultKinds MUST get a matching rule on the cutover-driver SA
    (chroot SovereignClient uses it via in-cluster fallback).
  - Pin both kinds in TestDefaultKinds_GraphAndDashboardSurface so a
    regression that drops them from the registry fails the unit test.

Sub-B — CATALYST_BUILD_SHA env injection (TC-261):
  - api-deployment.yaml: inject CATALYST_BUILD_SHA + CATALYST_CHART_VERSION
    env vars with LITERAL values (not Helm directives) per the
    dual-mode contract — Kustomize on contabo can't render
    `{{ .Values... }}` in `value:` fields.
  - .github/workflows/catalyst-build.yaml: extend the "bump literal
    image refs" sed pass to also bump the CATALYST_BUILD_SHA env
    literal so /api/v1/version returns the SHA the Pod is actually
    running (no drift between image tag and reported SHA).
  - The handler (version.go) already reads CATALYST_BUILD_SHA via
    envOrTrim with `dev`/`0.0.0` ldflag fallbacks — no Go change
    needed; the version_test.go env-override test already covers it.

Chart bumped 1.4.94 -> 1.4.95.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:56:21 +04:00
e3mrah
c9a46b4f37
fix(api): /api/v1/catalog* proxy on catalyst-api (qa-loop iter-3) (#1205)
Sovereign Console at console.<sov> proxies its /api/* fetches through
catalyst-api's ingress, but Slice-L (#1148) only exposed catalyst-catalog
via a Gateway HTTPRoute attached to the api.<sov> hostname. With no
/api/v1/catalog* route registered on catalyst-api itself, the InstallPage
fetches from console.<sov> 404'd at chi NotFound — even though the same
URL on api.<sov> returned 401 (auth needed, not missing route).

Fix #5's HTTPRoute template explicitly noted this as the in-tier
follow-up. This PR adds the proxy:

  GET /api/v1/catalog                              -> List
  GET /api/v1/catalog/{name}                       -> Get
  GET /api/v1/catalog/{name}/versions/{version}    -> GetVersion

Handlers wrap the existing httpCatalogClient (already wired in main.go
via SetCatalogClient) so no new upstream config is introduced. Routes
are registered inside the auth.RequireSession group so the catalog
surface inherits the same session gate as the rest of /api/v1/*; the
caller's catalyst_session token is forwarded to catalyst-catalog so
its AnonymousReads / per-Org policy still applies.

Empty list returns {"items":[]} (never null) so the UI's
catalog.api.ts decoder + .map() in InstallPage don't trip.

Closes qa-loop iter-3 cluster: catalog-api-404 (TC-031/151/171).

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
2026-05-09 17:54:24 +04:00
github-actions[bot]
a308fcaa62 deploy: update catalyst images to c5bfa34 2026-05-09 13:13:08 +00:00
e3mrah
c5bfa34b27
fix(api): BE handler 5xx/4xx errors + items envelope (qa-loop iter-2 #17) (#1204)
QA-loop iter-2 cluster: be-handler-errors-5xx-4xx. After Fix #15
(SPA route guard) + Fix #16 (whoami) shipped, the largest remaining
matrix-FAIL cluster is BE handler errors:

- ITEMS-ENVELOPE FAILs (TC-070..075, TC-184/192/194/227): the
  generic /api/v1/sovereigns/{id}/k8s/{kind} surface returned
  "unknown kind" for helmreleases/applications/blueprints/
  useraccesses/organizations/environments. The kinds were reachable
  via per-CRD handlers but the k8scache.Factory's dynamic informer
  pool didn't know about them. Added six entries to DefaultKinds
  with matching ClusterRole verbs per
  feedback_chroot_in_cluster_fallback.md.

- TC-261 (HTTP 404 on /api/v1/version): the endpoint didn't exist.
  Added handler/version.go returning git SHA + chart version + Go
  runtime, with env override for chart-injected truth and ldflag
  fallback for CI-baked-in values. Public route, no auth gate.

- TC-089 (HTTP 503 on /blueprints/curatable when Gitea unwired):
  changed to return 200 + empty list envelope so the UI's empty-state
  renders instead of "Failed to fetch".

Categorisation of the rest of the cluster:

- HTTP 500 cluster (TC-061..068, TC-149): already 200 — Fix #15+#16
  cleared the underlying auth context.
- HTTP 503/200 (TC-088, TC-090, TC-244, TC-235, TC-236) and TC-078:
  matrix-drift; the executor calls POST endpoints with GET, or the
  matrix targets a hard-coded pod name that doesn't exist on
  omantel. Listed in fix-author report for the Test-Plan Author to
  fix in iter-3.
- HTTP 502 (TC-210, TC-211): keycloak proxy SA misconfig in chroot
  Sovereign — separate cluster (out of scope for this fix; the
  catalyst client/role members lookups need a Sovereign-side SA the
  chroot doesn't currently provision).

Tests:
- TestDefaultKinds_GraphAndDashboardSurface pinned to assert the six
  new CRDs stay registered.
- TestHandleVersion_AlwaysJSON / EnvOverride / TrimsWhitespace cover
  the wire shape + truth resolution.
- TestHandleBlueprintListCuratable_GiteaUnwiredReturnsEmptyList
  pins the 200 + empty envelope graceful path.

Chart: bp-catalyst-platform 1.4.93 -> 1.4.94 (ClusterRole change
needs a chart bump; Helm reconciles RBAC on every release).

Refs qa-loop iter-2 cluster be-handler-errors-5xx-4xx.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:09:27 +04:00
github-actions[bot]
ed67bd54bd deploy: update catalyst images to a8aceac 2026-05-09 13:09:16 +00:00
e3mrah
a8aceacf66
fix(ui): SPA route-guard probes /whoami before bouncing to /login (qa-loop iter-2) (#1203)
When the operator has a valid HttpOnly catalyst_session cookie but no
JS-side `catalyst:authed` sessionStorage marker (fresh tab, refresh
after sessionStorage cleared, deep-link paste into a fresh window),
the synchronous rootBeforeLoad gate redirected them to /login despite
holding a valid session. Caught on console.omantel.biz when deep-link
loads of /dashboard from a sibling tab kept bouncing back to the PIN
page even after a successful PIN verify in another tab.

Root cause: hasCatalystSession() reads sessionStorage only — the
catalyst_session cookie is HttpOnly so JS cannot see it. The marker is
set by VerifyPinPage on PIN verify and SovereignConsoleLayout on
whoami 200, but a fresh-tab navigation neither runs VerifyPinPage nor
mounts the layout before the gate fires, so the gate never sees the
operator as authed.

Fix: keep the sync fast-path (marker present → allow), but on missing
marker fall through to an authoritative GET /api/v1/whoami. On 200
cache the marker and allow through. On 401 redirect to /login with
deep-link preserved as ?next=. On 5xx/network error fail open so the
layout's own probe surfaces the failure with proper context.

Per memory feedback_per_issue_playwright_verification.md: live-verified
the full PIN flow + 6 deep-link routes (/dashboard, /cloud, /apps,
/jobs, /users, /settings) on console.omantel.biz both before and after
the fix. The closed-session hard gate
(session_2026_05_09_closed_unverified.md) is satisfied: incognito
PIN flow → /dashboard renders fully + 5 sibling surfaces render.

Files:
- products/catalyst/bootstrap/ui/src/app/auth-gate.ts
  + probeWhoamiAndCacheMarker(): authoritative async cookie check
- products/catalyst/bootstrap/ui/src/app/router.tsx
  rootBeforeLoad async; falls through to whoami probe when marker missing
- products/catalyst/bootstrap/ui/src/app/auth-gate.test.ts
  +5 tests covering 200/401/5xx/network/credentials-include

Refs: qa-loop iter-2 cluster spa-route-guard-rejects-pin-session
Refs: session_2026_05_09_closed_unverified.md

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:07:12 +04:00
github-actions[bot]
655c116c3e deploy: update catalyst images to f8ec683 2026-05-09 12:54:40 +00:00
e3mrah
f8ec683f22
fix(api): include tier + realm_access.roles in /whoami response (qa-loop iter-2) (#1202)
GET /api/v1/whoami silently dropped Tier and RealmAccess.Roles even
though Fix #2 (#1184) stamps tier=owner + realm_access.roles=
[catalyst-owner] into the PIN session JWT. The chroot SPA route-guard
reads these from /whoami to admit the operator into the Sovereign
Console post-PIN-login; without them on the wire the SPA bounced
back to /login (qa-loop iter-2 cluster B, breaking TC-003, TC-091,
TC-122, TC-196).

Surface both fields with the JSON shape the SPA expects:
- top-level "tier" (string)
- nested "realm_access":{"roles":[...]} (object)

Both omitempty so non-RBAC sessions (no tier, no realm roles)
continue to emit the original pre-RBAC wire shape — existing callers
unaffected.

Tests:
- TestHandleWhoami_PinSessionRBACClaims pins the wire contract for
  the PIN-stamped {tier=owner, realm_access.roles=[catalyst-owner]}
  session — exercises the actual JSON map shape, not the typed Go
  struct, so a bad json tag would fail loudly.
- TestHandleWhoami_NoRBACOmitsFields pins the omitempty regression:
  a session without RBAC must not introduce tier/realm_access keys.

Coordinates with Fix #15 (SPA route-guard) on the same downstream
symptom — BE serializes the claims, SPA reads them. Does NOT touch
auth/session.go's Claims struct (Fix #2's tier=owner stamping path
preserved).

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 16:52:46 +04:00
github-actions[bot]
5f3e714571 deploy: update catalyst images to 3978fee 2026-05-09 12:04:49 +00:00
e3mrah
3978feea3a
fix(chart): auto-provision catalyst-organization-controller-keycloak Secret on Sovereign install (qa-loop iter-1 Fix #14) (#1201)
organization-controller's binary calls mustEnv("CATALYST_KC_SA_CLIENT_ID")
+ mustEnv("CATALYST_KC_SA_CLIENT_SECRET") (cmd/main.go:60-61) and
CrashLoopBackOffs until the Secret exists.

Pre-1.4.93 the deployment template referenced
catalyst-organization-controller-keycloak with `optional: true` on the
secretKeyRef -> the env vars collapsed to empty -> mustEnv panicked
with "required env var unset". Caught live on omantel during qa-loop
iter-1 Executor (2026-05-09).

New template templates/secret-organization-controller-keycloak.yaml
mirrors the Sovereign-vs-Mothership lookup gate from the existing
templates/catalyst-openova-kc-credentials-secret.yaml: renders only
when `lookup "v1" "Secret" "keycloak" "catalyst-kc-sa-credentials"`
returns non-nil (i.e. on a Sovereign), with EXISTING-TARGET-WINS
precedence so openbao auto-rotation of the source doesn't thrash the
controller pod on every reconcile.

Manual hot-fix already applied to omantel (Secret created from existing
keycloak/catalyst-kc-sa-credentials bytes) — Pod went 0->1/1 Ready
0 restarts. Chart fix lands the same bytes for every future Sovereign
without operator action.

Refs: qa-loop iter-1 cluster kc-sa-secret-organization-controller

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
2026-05-09 16:02:43 +04:00
github-actions[bot]
db618cc5eb deploy: update catalyst images to a8c9f89 2026-05-09 12:00:44 +00:00
e3mrah
a8c9f895b8
fix(chart): bump application-controller tag to 3d1deef (qa-loop iter-1) (#1200)
Picks up the chart-binary contract fix:
  PR #1196 — main.go accepts --leader-elect / --leader-elect-namespace
  PR #1199 — Containerfile copies core/controllers/pkg into build stage

Without this bump, omantel still pulls 1b29c71 which crashes on
"flag provided but not defined: -leader-elect".

Refs qa-loop iter-1.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:58:26 +04:00
e3mrah
a834b2cc29
docs(chart): document CRD installation path for chroot Sovereigns (qa-loop iter-1) (#1198)
Adds products/catalyst/chart/CRDS.md documenting:

- The 9 catalyst-domain CRDs in chart/crds/ (auto-applied by Helm on
  install/upgrade)
- The UserAccess XRD living in platform/crossplane-claims/chart (NOT
  here per ADR-0001 §3 — Crossplane is the day-2 IaC for IAM grants)
- Operator-style apply sequence for chroot Sovereigns where Flux is
  suspended and cutover used kubectl apply -f rather than helm install

Context: qa-loop iter-1 Fix #13. omantel chroot Sovereign was missing
all 9 catalyst CRDs + the UserAccess XRD. environment-controller and
useraccess-controller logged 'no matches for kind' indefinitely and
never reached Starting workers. Manual apply restored them. This doc
captures the recovery path so future Sovereigns can be repaired
without re-deriving it from controller stack traces.

Out of scope (other Fix Authors own these clusters):
- Fix #11: ConfigMap
- Fix #12: application-controller flag

No code changes — docs only.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
2026-05-09 15:54:22 +04:00
e3mrah
293015b853
fix(chart): create catalyst-runtime-config ConfigMap with KC/Gitea env (qa-loop iter-1) (#1197)
The 3 Group C controller deployments (organization, environment,
application) reference the `catalyst-runtime-config` ConfigMap via
`configMapKeyRef` with `optional: true`. Until this commit the CM
simply did not exist on any Sovereign — `optional: true` collapsed
every key to "" and `mustEnv("CATALYST_KC_ADDR")` in
core/controllers/organization/cmd/main.go fail-fasted on every Pod
start with `required env var unset`.

Caught live on omantel 2026-05-09 during qa-loop iter-1 (cluster
`catalyst-runtime-config-missing`):

  catalyst-organization-controller   0/1   CrashLoopBackOff
  catalyst-application-controller    0/1   CrashLoopBackOff

Adds:

  - templates/configmap-catalyst-runtime-config.yaml — the missing
    ConfigMap, keys: keycloak-addr, keycloak-realm, gitea-public-url
  - values.yaml `runtime.*` block with operator-overridable defaults
    that match the canonical in-cluster Service FQDNs of bp-keycloak
    (keycloak.keycloak.svc.cluster.local:80) + bp-gitea
    (gitea-http.gitea.svc.cluster.local:3000)

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode) every value is
overridable from the per-Sovereign overlay. The contabo Kustomize
path enumerates resources explicitly (templates/kustomization.yaml)
and does NOT include this new file, so contabo continues unaffected.

Chart bump: 1.4.91 → 1.4.92.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:53:11 +04:00
github-actions[bot]
68c40b77e7 deploy: update catalyst images to 7261a10 2026-05-09 11:48:00 +00:00
e3mrah
7261a10d3b
fix(chart): add ghcr-pull imagePullSecrets to 5 Group C controllers (qa-loop iter-1 follow-up) (#1195)
After PR #1194 enabled the 4 Group C controllers, the pods failed
ImagePullBackOff against `ghcr.io/openova-io/openova/<ctrl>-controller:*`
with `401 Unauthorized` because the controller deployment templates
were missing the `imagePullSecrets: [{ name: ghcr-pull }]` block that
every other deployment in the chart already has (catalyst-api, catalyst-ui,
sme-services/*, services/catalog, marketplace-api).

Surfaced live on omantel: 4/4 controller pods stuck in ErrImagePull
within ~30s of the iter-1 apply. Root cause: chart-side oversight in
the original Group C controller scaffolding (slice CC1 #1095) — the
deployments inherited shape from a public-image template instead of
the catalyst-api private-image template.

Per Inviolable Principle #4a: GHCR-published controller images are
private; every Pod that pulls them MUST reference the `ghcr-pull`
Secret rendered by the chart's bootstrap-kit path.

Files changed:
- products/catalyst/chart/templates/controllers/{organization,environment,
  blueprint,application,useraccess}-controller-deployment.yaml: added
  `imagePullSecrets: [{ name: ghcr-pull }]` immediately after
  `automountServiceAccountToken: true` (mirrors api-deployment.yaml shape).
- products/catalyst/chart/Chart.yaml: bumped 1.4.90 → 1.4.91.

Verified via `helm template`: all 5 controller Deployments now render
the imagePullSecrets block.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:45:59 +04:00
github-actions[bot]
2fb254f392 deploy: update catalyst images to c1b9240 2026-05-09 11:43:57 +00:00
e3mrah
c1b92404ee
fix(chart): enable 5 Group C controllers + KC realm-role bootstrap (qa-loop iter-1) (#1194)
EPIC-3 RBAC reconciliation loop was dormant on every Sovereign because
the 5 Group C controllers (organization, environment, blueprint,
application, useraccess) shipped with `enabled: false` and the
KEYCLOAK_BOOTSTRAP_TIER_ROLES env var was hardcoded to "false". Result:
UserAccess CRs created by /api/v1/sovereigns/{id}/rbac/assign never
materialised into RoleBindings + composite realm-roles.

Cluster: controllers-and-kc-bootstrap-gates (qa-loop iter-1).

Changes:
- values.yaml: organization/environment/application/useraccess controllers
  flipped to `enabled: true` and `image.tag` SHA-pinned to the latest
  GHCR-published push-on-main builds (organization/environment/application
  :1b29c71, useraccess :ff2172f) per Inviolable Principle #4a.
- values.yaml: blueprint stays `enabled: false` until first
  push-on-main build of build-blueprint-controller.yaml lands an image
  in GHCR (never reference an image not built by CI).
- values.yaml: new top-level `keycloak.bootstrap.ensureTierRoles: true`.
- api-deployment.yaml: KEYCLOAK_BOOTSTRAP_TIER_ROLES now sources its
  default from `.Values.keycloak.bootstrap.ensureTierRoles` (per slice
  T2 brief #1098/#1146) instead of hardcoded "false".
- .github/workflows/build-blueprint-controller.yaml: new workflow
  scaffolded (mirror of build-application-controller shape) so the
  first commit touching core/controllers/blueprint/** ships a
  CI-built, SHA-pinned, cosign-signed image to GHCR.
- Chart.yaml: bumped 1.4.89 → 1.4.90.

Verified via `helm template`:
- 4 controller Deployments + 4 controller ClusterRoles render (blueprint
  pending image build).
- KEYCLOAK_BOOTSTRAP_TIER_ROLES renders as "true" by default.
- 5 tier ClusterRoles `openova:tier-{viewer,developer,operator,admin,owner}`
  render from platform/crossplane-claims/chart/.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:41:58 +04:00
github-actions[bot]
92228bc4b5 deploy: update catalyst images to 09b35d0 2026-05-09 11:35:08 +00:00
e3mrah
09b35d0943
fix(k8scache): factory.List + tree.GetResourcesBySelector resolve plural alias (qa-loop iter-1) (#1193)
Followup to #1191. The handler-tier Registry.Get already accepts
plural / short-form aliases ("services", "pvc"), but the downstream
indexer lookups in Factory.List and Factory.GetResourcesBySelector
re-canonicalised the raw inbound `kindName` and so still keyed off
the plural form — the indexers map is populated with singular
canonical Names from AddCluster, so "services" missed and the call
returned `k8scache: kind "services" not registered`.

Live evidence post-#1191 deploy on omantel.biz: every cloud-list TC
still 404'd with the new error message ("not registered" instead of
"unknown kind"), proving the handler now resolves the alias but the
factory tier doesn't.

Fix: both lookups go through Registry.Get first to obtain the
canonical singular Name, then index into cs.indexers with that.
metricCacheSize label switches to the canonical form too so plural
and singular variants of the same query roll up to one prometheus
time-series instead of fanning out cardinality.

Tests:
  - TestFactory_ListResolvesPluralAlias — alias forms ("pods", "Pod",
    "PODS", "po") all return the same Pod the canonical "pod" call
    returns; "notakind" still errors.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:33:11 +04:00
e3mrah
1ae25b1df1
fix(ui): normalise resource detail kind URL plural→singular (qa-loop iter-1) (#1192)
qa-loop iter-1 cluster resource-detail-tree-yaml-events. TC-079..083
deep-link the resource detail surface with kubectl-conventional plural
kind segments (`/cloud/resource/services/...`,
`/cloud/resource/deployments/_/cilium/...`). The catalyst-api
k8scache Registry exposes only canonical singular names; PR #1191
landed alias resolution at the BE so plural lookups no longer 404 —
this PR closes the loop on the UI side so widget calls always hit
the canonical singular path (the metrics endpoint, for example,
returns `source: "metrics.k8s.io"` for `pod` but
`source: "unavailable"` for `pods`).

Single new helper in resource.api.ts:

  - `normaliseKindForRegistry(kind)` — table-driven plural→singular
    map mirroring the UI side of `cloud-list/kinds.ts:KIND_TO_REGISTRY`.
    Lower-cases input + leaves canonical singulars untouched + returns
    unknown kinds lower-cased so the BE answers with its
    `unknown-kind` envelope (no silent fall-through).

ResourceDetailPage uses the singular `apiKind` for every API call
(getResource, getResourceTree, YamlEditor, MetricsPanel, EventsPanel
kind filter, ResourceActions, Logs/Exec gates) but keeps the URL-typed
`kind` on the `data-testid="resource-detail-{kind}-{name}"` wrapper so
operator deep-link asserts (`resource-detail-services`,
`resource-detail-deployments`) hold per the iter-1 test matrix.

Tests:
  - resource.api.test.ts — 5 new cases on normaliseKindForRegistry
    (plural mapping, singular passthrough, lower-case + trim, empty
    input, unknown kind passthrough).
  - ResourceDetailPage.test.tsx — 4 new cases: plural-kind testid
    preservation, YamlEditor singular-kind hand-off, cluster-scoped
    deployment with ns="_", null-guard for `initialObj.spec === undefined`
    and `initialObj === {}`.

26/26 targeted tests pass; 66/66 cloud-list directory passes.

Per memory rules:
  - feedback_per_issue_playwright_verification.md — defence-in-depth,
    not the BE fix (that landed in #1191); this closes the UI side so
    every call resolves on the canonical Registry name.
  - feedback_dod_is_the_proof.md — verification deferred to
    Coordinator Executor matrix re-run on the deployed image.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
2026-05-09 15:33:04 +04:00
github-actions[bot]
8ff5598bd3 deploy: update catalyst images to ae24194 2026-05-09 11:28:57 +00:00
e3mrah
ae24194920
fix(k8scache): plural + short-name aliases on kind registry (qa-loop iter-1) (#1191)
Iter-1 QA matrix surfaced 5 cloud-list 404s (TC-084 services, TC-085
nodes, TC-090 pvcs, TC-091 namespaces, TC-130) — every call used the
kubectl-conventional plural path segment ('/k8s/services') but the
registry only resolved the canonical singular Name ('service'). The
file-level kinds.go doc claims "an operator who types 'pod', 'Pod',
or 'pods' all hit the same GVR" but only the first two worked.

Two new lookup paths in Registry.Get:

  1. Plural alias index — built from each Kind's GVR.Resource (the
     form `kubectl api-resources` prints). Populated automatically on
     Add(); first registration wins so PodMetrics (GVR.Resource="pods")
     can never shadow core/v1 Pod.
  2. Short-name alias map — small explicit table covering the kubectl
     muscle-memory forms that aren't derivable from GVR.Resource
     (pvc → persistentvolumeclaim, ns → namespace, svc → service, …).
     Includes pluralised short forms (pvcs, pvs) since the matrix uses
     them.

Backward compatible — singular Names still resolve, and the
helpful-404 'availableKinds' list still shows canonical singulars
only (so the wire-shape contract is unchanged for clients that
already work).

Tests:
  - TestRegistry_PluralAliasResolution — 11 sub-cases covering
    singular, plural, short, plural-short, case-insensitive forms.
  - TestRegistry_PluralDoesNotShadowSingular — guards the
    PodMetrics/Pod GVR.Resource collision via registration order.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:26:55 +04:00
e3mrah
276f86d930
fix(ui): handover error text + login next= hint (qa-loop iter-1 cluster auth-handover-flow-text) (#1190)
The 2026-05-09 routing matrix asserts on `document.body.innerText`
(NOT URL or HTTP status) for both /auth/handover and anonymous
/dashboard. Two body-text contracts were quietly broken:

TC-004 — `/auth/handover` (anon, browser): the BE 302 to
/auth/handover-error?reason=missing_token + the SPA route both work,
but the rendered copy used "did not include" so the literal token
"missing" never appeared in body text. Reword to "is missing its
token". Extract HandoverErrorPage from router.tsx into
pages/auth/HandoverErrorPage.tsx so the body-text contract is owned
by a single file and is unit-testable without booting the router.

TC-009 — `/dashboard` (anon): rootBeforeLoad correctly redirects to
/login?next=/dashboard, but LoginPage's body text only said "Sign in
/ We'll email you a 6-digit code". The matrix expected the literal
tokens "/login" and "next=" in body text. Surface a small <p
data-testid="login-next-hint"> when ?next is present that includes
both tokens plus the destination path. Hidden when ?next is absent
so direct sign-in stays clean.

Tests:
- 5 new HandoverErrorPage cases (each ?reason branch + missing-query
  fallback)
- 2 new LoginPage cases (hint present with ?next, hint absent without)
- All 28 pre-existing auth-gate + AppsPage handover tests still GREEN

Cluster scope honoured: router.tsx import + extraction only, no
changes to BE handlers, AppDetail, or compliance pages.

Refs: qa-loop iter-1 fix #7

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
2026-05-09 15:25:08 +04:00
github-actions[bot]
099c765a80 deploy: update catalyst images to a0ed54c 2026-05-09 11:18:13 +00:00
e3mrah
a0ed54cc3a
fix(api): emit immediate snapshot frame on SSE connect (qa-loop iter-1) (#1189)
Three SSE handlers (compliance/stream, applications/{name}/stream,
k8s/stream) only sent a `: connected ...` comment line on connect and
then waited for either an event from the upstream channel or the next
heartbeat (15s default). On a quiet/fresh Sovereign cluster this means
the next `data:` line could be 15s away — past every probe / Executor
timeout (6s) and well past EventSource user expectations.

Fix: emit one `data:` snapshot frame immediately on connect for each
handler.

  - compliance.go: snapshot the current sovereign-scope rollup
    (or an empty `{scope:sovereign,id:<cluster>}` placeholder when
    the aggregator has no state yet). type="snapshot".
  - applications.go: emitSnapshot(true) — forces a `data:` frame even
    when the Application CR doesn't exist (notFound:true). The UI
    renders this as the "not installed" empty state; probes get a
    wire event without waiting for the 2s poll tick.
  - k8s.go: emit a `{type:"ready",cluster,kinds}` frame immediately
    after subscribing. UI clients filter on type:"ready" and treat
    it as the connection ack; smoke tests / probes get a `data:`
    line within the first round-trip.

Adds unit test TestHandleComplianceStream_ImmediateSnapshotFrame
asserting the first SSE frame on `/compliance/stream` arrives within
1s (the same shape existing TestHandleK8sStream_EmitsEvent uses for
its own assertion via initialState=1).

Live verification on console.omantel.biz before fix:

  $ timeout 8 curl -k -N -b cookies.txt \
      'https://console.omantel.biz/api/v1/sovereigns/sovereign-omantel.biz/compliance/stream'
  : connected cluster=sovereign-omantel.biz
  (then nothing — exit code 143 / terminated by timeout)

Same probe will return a `data:` snapshot frame within ms after rollout.

No UI changes. No auth changes. No chart changes. No /audit
handler changes. No /applications PUT/DELETE changes. Per
INVIOLABLE-PRINCIPLES.md #3 the existing event-driven path
(Factory.Subscribe) is unchanged — the snapshot frame is purely
additive on the producer side.

Refs: qa-loop iter-1 cluster sse-timeout-handler-shape
      (TC-030 compliance, TC-041 applications, TC-092 k8s)

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:16:03 +04:00
e3mrah
88ac0ac78f
fix(chart): add imagePullSecrets to catalyst-catalog Deployment (qa-loop iter-1 follow-up) (#1188)
* fix(chart): add imagePullSecrets to catalyst-catalog Deployment (qa-loop iter-1 follow-up)

Follow-up to #1186. Live verification on omantel chroot Sovereign
revealed the catalyst-catalog Pod entered ImagePullBackOff because
the Deployment template was missing `imagePullSecrets`.

Failure on omantel:

  Failed to pull image "ghcr.io/openova-io/openova/catalyst-catalog:9763286":
  failed to authorize: failed to fetch anonymous token: ...
  401 Unauthorized

Same name + namespace pattern as ui-deployment / marketplace-api
(`ghcr-pull` dockerconfigjson Secret in `.Release.Namespace`,
provisioned by the bootstrap-kit slot's per-namespace ghcr-pull seal).

Verified on omantel: after applying the patched Deployment the
Pod transitions through ContainerCreating to Running. Chart 1.4.88
remains in flight; this fix lands as 1.4.89 in the same qa-loop
iter-1 series.

* chart: bump 1.4.88 → 1.4.89 for catalyst-catalog imagePullSecrets fix

---------

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
2026-05-09 15:14:00 +04:00
e3mrah
841459fed0
fix(ui): align AppDetail tab test-ids to qa-loop seam map (TC-043..048) (#1187)
Per qa-loop iter-1 cluster `appdetail-tab-testids-ui`: the matrix uses
the convention `data-testid="app-<name>-tab"` on each tab BUTTON in the
AppDetail page tablist. Pre-fix the buttons used the legacy
`sov-app-tab-<name>` ids and the inner sub-tab files (TopologyTab.tsx
etc.) used `app-<name>-tab` on their PANEL root — so the matrix found
nothing on the BUTTON and the panel id collided with what the matrix
actually expected.

Fix:
* Tab buttons in AppDetail.tsx now expose `data-testid="app-<name>-tab"`
  (jobs / dependencies / topology / resources / compliance / logs /
  settings / members). Counts inside the buttons rename to
  `app-<name>-tab-count`.
* Sub-tab panel roots rename their test-id to `app-<name>-tabpanel`
  (TopologyTab, SettingsTab, ComplianceTab, MembersTab, ResourcesTab,
  LogsTab). This eliminates the button↔panel id collision so a
  Playwright `getByTestId('app-topology-tab')` is unambiguous.
* SettingsTab keeps `settings-tab-upgrade-btn` +
  `settings-tab-uninstall-btn` (matrix expectation).

Tests:
* AppDetail.test.tsx: add 8-row qa-loop iter-1 contract suite
  (`it.each(TABS)`) asserting every button id is present, plus
  per-tab click→panel reveal assertions for the 6 EPIC-2/3/4 tabs
  in the cluster.
* AppDetail.test.tsx renderDetail() now wraps the RouterProvider in
  a QueryClientProvider — production wraps the entire app in main.tsx
  but the unit tests were missing it, so every sub-tab's useQuery threw
  "No QueryClient set" and the page never painted. Pre-fix the entire
  9-test file was failing with unrelated errors masking real assertion
  signal.
* Back-link assertion updated: post-#1052 chroot Sovereign + provision
  flows both route AppDetail back to /dashboard, not /provision/$id.
* SettingsTab.test.tsx: rename `app-settings-tab` panel assertion to
  `app-settings-tabpanel` to match new convention.

Verification (in /home/openova/repos/openova):
* `npx vitest run src/pages/sovereign/AppDetail.test.tsx
   src/pages/sovereign/AppDetail/SettingsTab.test.tsx` → 26/26 PASS
* `npx tsc --noEmit` → clean

Refs qa-loop iter-1 cluster `appdetail-tab-testids-ui` / TC-043..048.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:12:41 +04:00
github-actions[bot]
3987a4a2c0 deploy: update catalyst images to 1d90ef6 2026-05-09 11:10:09 +00:00
e3mrah
1d90ef66ed
fix(chart): flip services.catalog.enabled=true + wire CATALYST_CATALOG_URL (qa-loop iter-1) (#1186)
Root cause for TC-035..037 (and ~10 related catalog 404s on omantel
chroot Sovereign Console): `services.catalog.enabled` shipped default
`false` (Slice L #1148), so the catalyst-catalog Service / Deployment /
HTTPRoute were never rendered. Every `/api/v1/catalog*` call therefore
404'd at the Cilium Gateway. The catalyst-api in-process CatalogClient
was wired (cmd/api/main.go:259) but pointed at a non-existent upstream.

Three coupled changes (chart 1.4.87 → 1.4.88):

1. values.yaml: `services.catalog.enabled: true` (default-on).
   Catalyst-api treats catalog 502/503 as a clean error path
   (handler/applications.go surfaces `catalog upstream` detail), so
   default-on is safe even on Sovereigns where the Gitea catalog
   Orgs aren't yet provisioned. Disable explicitly for offline /
   CI render checks (Inviolable Principle #4 — runtime-overridable).

2. values.yaml: `services.catalog.image.tag: "9763286"` — pinned to
   the latest SUCCESS run of the catalyst-catalog GitHub Actions
   workflow (per Inviolable Principle #4a, no `:latest`). Future CI
   bumps will land via the catalyst-catalog-image-built
   repository_dispatch hop (catalyst-catalog-build.yaml `notify` job
   → downstream chart-bump PR; this hop ships in a follow-up).

3. api-deployment.yaml: explicit `CATALYST_CATALOG_URL` env var on
   catalyst-api pointing at `http://catalyst-catalog.catalyst-system.
   svc.cluster.local:8080` (matches the Service rendered by
   templates/services/catalog/service.yaml in `.Release.Namespace`).
   Prior code-only default in `cmd/api/main.go` pointed at
   `openova-system` (a stale namespace from earlier draft); the chart
   now documents the wiring contract in the manifest itself.

Verified locally:
- helm template (default render): Service / Deployment / SA / RBAC
  for catalyst-catalog all render. CATALYST_CATALOG_URL env var
  appears on catalyst-api Pod.
- helm template (with ingress.hosts.api.host set): HTTPRoute for
  `/api/v1/catalog` PathPrefix renders cleanly attached to the
  cilium-gateway parentRef.

Live verification (post-merge): catalog Pod Running on omantel
chroot Sovereign + curl /api/v1/catalog returns HTTP 200 / 401
(NOT 404).

Refs: qa-loop iter-1, cluster `catalog-svc-deployment-and-proxy`,
TC-035 / TC-036 / TC-037 + related catalog 404s.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
2026-05-09 15:08:11 +04:00
e3mrah
65b5ceb345
fix(ui): null-guard compliance dashboard render path (qa-loop iter-1) (#1185)
TC-024 (`/sre/compliance`) and TC-025 (`/sec/compliance`) crashed
with "Something went wrong" + a TypeError on cold-start sovereigns.
Root cause: catalyst-api's `HandleComplianceScorecard` builds the
response by appending to nil `[]Score` slices for organizations /
environments / applications. Go's `encoding/json` serializes a nil
slice as JSON `null`, so the wire payload arrives as
`{ organizations: null, environments: null, applications: null }`.
The dashboard then called `.map()` / `.filter()` / `.length` on
`null`, throwing during render.

Frontend-only fix per qa-loop scope (Fix #4 cluster boundary):

  • `compliance.api.ts` — add `normalizeScorecard()` that coerces
    every slice to `[]` and supplies a fallback Sovereign score.
    `getScorecard` now runs every wire payload through it.
  • `SREDashboardPage.tsx` — also normalize `initialDataOverride`
    so the test seam tolerates the same wire shape, and rebase
    `isEmpty` off the (already-normalized) `merged` value.
  • `ComplianceTreemap.tsx` — fall back to `'—'` when a payload
    node has no `name` so the cell renderer can't crash on a
    sparse node.
  • New regression tests render the SRE Lead and Security Lead
    dashboards with an all-null wire payload and assert they
    surface the empty state instead of throwing.

Fix #4 — qa-loop iter-1, cluster `compliance-dashboard-crash`.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
2026-05-09 15:07:10 +04:00
github-actions[bot]
4009b61b9a deploy: update catalyst images to c4e1895 2026-05-09 11:05:33 +00:00
e3mrah
c4e1895f6c
fix(auth): stamp tier=owner + realm_access.roles on PIN-derived sessions (qa-loop iter-1) (#1184)
Closes the rbac-audit-403-gates cluster (TC-063..069/077): every privileged
catalyst-api endpoint backed by rbacAssignCallerAuthorized /
policyModeCallerAuthorized was returning 403 to PIN-authenticated
operators because the session JWT minted at /auth/pin/verify carried
only {sub, email, role} — no `tier`, no `realm_access.roles`.

Endpoints affected:
- GET  /api/v1/sovereigns/{id}/audit/rbac           (TC-063)
- GET  /api/v1/sovereigns/{id}/audit/rbac/stream    (TC-064)
- POST /api/v1/keycloak/users / /groups / /roles    (TC-065..069)
- POST /api/v1/blueprints/curate                    (TC-077)
- (and: continuum audit, policy_mode, blueprints/curate-list)

Root cause: HandlePinVerify built a jwt.MapClaims with only the legacy
single-string `role` field. The EPIC-3 (#1098) RBAC gates walk
claims.RealmAccess.Roles or claims.Tier — both were empty, so the gate
function returned false even for the Sovereign owner authenticated
via PIN-IMAP.

Fix: stamp pinSessionTier ("owner") + pinSessionRealmRole
("catalyst-owner") onto every PIN-derived session JWT, alongside the
existing role/sub/email claims.

Why owner: PIN-via-IMAP authentication proves control of the Sovereign's
mail-domain inbox; that IS the canonical proof of ownership of the
Sovereign chroot (the only operator who can receive the 6-digit code is
the one provisioned with mailbox access on the Sovereign's stalwart
instance). Stamping tier=owner makes the JWT's authorization context
match the real-world authority the auth flow already granted.

Per CLAUDE.md INVIOLABLE-PRINCIPLES #5 (least privilege): the stamp
happens ONLY at PIN-verify (i.e. only after the operator proved IMAP
control); pre-PIN sessions never carry these claims.

Test: TestPinVerify_StampsTierAndRealmRoleClaims pins the contract
end-to-end — decodes the JWT cookie, asserts both Tier and
RealmAccess.Roles are populated, and feeds the parsed Claims through
the actual rbacAssignCallerAuthorized + policyModeCallerAuthorized
gate functions to prove they accept.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:03:34 +04:00
github-actions[bot]
500b800709 deploy: update catalyst images to b9f0992 2026-05-09 09:52:53 +00:00
e3mrah
b9f09926d0
fix(rbac): add cutover-driver permissions for apps.openova.io + dr.openova.io (#1179)
Caught live on omantel iter-1 of qa-loop:

TC-040 → HTTP 500 with body:
  applications.apps.openova.io is forbidden: User
  "system:serviceaccount:catalyst-system:catalyst-api-cutover-driver"
  cannot list resource applications in API group apps.openova.io

TC-099 → HTTP 500 with body:
  continuums.dr.openova.io is forbidden: ...

EPIC-2 slice I (#1152) added the Application install handler. EPIC-6
slice U-DR-1 (#1162) added the Continuum DR handlers. Neither slice
updated the catalyst-api-cutover-driver ClusterRole — same violation as
PR #1173 (events.k8s.io + wgpolicyk8s.io).

Per `feedback_chroot_in_cluster_fallback.md`: every new GVR added to
catalyst-api dynamic-client paths MUST get matching ClusterRole rules
in the same PR.

Adds:
- apps.openova.io applications: create + get/list/watch/update/patch/delete
- dr.openova.io continuums: create + get/list/watch/update/patch/delete

split per `feedback_rbac_create_no_resourcenames.md`.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:50:46 +04:00
github-actions[bot]
4f49cefff1 deploy: update catalyst images to 56262df 2026-05-09 08:52:49 +00:00
e3mrah
56262df649
fix(auth): VerifyPinPage + /auth/handover set catalyst:authed marker BEFORE navigating (#1090 cluster A3) (#1174)
LIVE BUG report 2026-05-09: operator submits correct PIN at
console.omantel.biz/login, BE logs "pin/verify: session established"
+ HTTP 200 with HttpOnly catalyst_session cookie set, but the SPA
immediately redirects back to /login.

Root cause: PR #1109 (cluster A2) added rootRoute.beforeLoad with
hasCatalystSession() — synchronous gate that reads
sessionStorage['catalyst:authed']. The HttpOnly cookie is invisible
to JS, so SovereignConsoleLayout sets that marker AFTER its async
/whoami probe returns. But on the post-PIN-verify navigation, the
gate runs BEFORE SovereignConsoleLayout mounts → marker is empty →
gate redirects back to /login. Bounce loop.

Two fixes:

1. VerifyPinPage success branch sets the marker BEFORE navigation
   AND switches navigate() → window.location.replace() so the next
   page boot reads the cookie via a fresh /whoami round-trip
   (matches the pattern Fix #A used for the unauth path).

2. /auth/handover route's beforeLoad sets the marker too — the
   server-side AuthHandover handler 302-redirects with the cookie set,
   so by the time we reach this safety-net route the cookie exists;
   the marker just needs to track that.

Anti-regression for the marker race: SovereignConsoleLayout STILL
sets the marker after probeSessionCookie returns (preserves the
post-cookie-set race recovery from PR #1109). Both seams set it
defensively.

DoD: post-PIN-verify navigation lands on /dashboard (or `next` if
present), NOT bounced to /login. Confirmed BE side already works
(8h session minted on 200 response).

Co-authored-by: Hati Yildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:50:40 +04:00
github-actions[bot]
91ca7531ff deploy: update catalyst images to 3cc24be 2026-05-09 08:37:40 +00:00
e3mrah
3cc24beff6
fix(rbac): add cutover-driver permissions for wgpolicyk8s + events.k8s.io (#1173)
* fix(build): unblock Build & Deploy Catalyst — Containerfile + test typing

The Build & Deploy Catalyst workflow has been failing on every PR since
EPIC-2 Slice I (#1152) merged. Two real bugs caught after the founder
flagged that no images had been built or deployed:

1. catalyst-api Containerfile: the replace directive added by slice I
   (`replace github.com/openova-io/openova/core/controllers => ../../../../core/controllers`)
   resolves to /core/controllers when WORKDIR=/app. The Containerfile only
   copied products/catalyst/bootstrap/api/go.{mod,sum}, not the controllers
   tree, so `go mod download` failed with "no such file or directory" on
   /core/controllers/go.mod. Fix: COPY the controllers tree BEFORE go mod.

2. SessionsPage.test.tsx (slice X2+E #1169): vi.fn(async () => SEED) infers
   parameter tuple as `[]`, so `lastCall[1]` was a TS2493 type error
   ("Tuple type '[]' of length '0' has no element at index '1'"). Cast
   lastCall to the actual listSessions signature.

Per canon §7 + the founder's "you are the merger" rule, this is the kind
of CI-pipeline regression that MUST be caught BEFORE claiming slice
completion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rbac): add cutover-driver permissions for wgpolicyk8s + events.k8s.io

Caught live on omantel during qa-loop setup after image_roll(da1d3d1):

  failed to list events.k8s.io/v1, Resource=events: events.events.k8s.io
    is forbidden: User "system:serviceaccount:catalyst-system:catalyst-api-cutover-driver"
    cannot list resource "events" in API group "events.k8s.io"

  failed to list wgpolicyk8s.io/v1alpha2, Resource=policyreports:
    policyreports.wgpolicyk8s.io is forbidden

EPIC-1 slice W (#1139) added PolicyReport + ClusterPolicyReport to
DefaultKinds. EPIC-4 slice R (#1167) added Event kind. Neither slice
updated the catalyst-api-cutover-driver ClusterRole — violation of the
canon rule from `feedback_chroot_in_cluster_fallback.md`:
  "Future GVRs added to handlers via the dynamic client MUST get
   matching catalyst-api-cutover-driver ClusterRole rules in the same PR."

Adds:
- wgpolicyk8s.io {policyreports, clusterpolicyreports} get/list/watch
- events.k8s.io events get/list/watch

After this lands + image_roll, the qa-loop can run without the chroot
informer log-storm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:35:30 +04:00
github-actions[bot]
3b8734f27f deploy: update catalyst images to da1d3d1 2026-05-09 08:31:55 +00:00
e3mrah
da1d3d1ffa
fix(build): unblock Build & Deploy Catalyst — Containerfile + test typing (#1172)
* fix(build): unblock Build & Deploy Catalyst — Containerfile + test typing

The Build & Deploy Catalyst workflow has been failing on every PR since
EPIC-2 Slice I (#1152) merged. Two real bugs caught after the founder
flagged that no images had been built or deployed:

1. catalyst-api Containerfile: the replace directive added by slice I
   (`replace github.com/openova-io/openova/core/controllers => ../../../../core/controllers`)
   resolves to /core/controllers when WORKDIR=/app. The Containerfile only
   copied products/catalyst/bootstrap/api/go.{mod,sum}, not the controllers
   tree, so `go mod download` failed with "no such file or directory" on
   /core/controllers/go.mod. Fix: COPY the controllers tree BEFORE go mod.

2. SessionsPage.test.tsx (slice X2+E #1169): vi.fn(async () => SEED) infers
   parameter tuple as `[]`, so `lastCall[1]` was a TS2493 type error
   ("Tuple type '[]' of length '0' has no element at index '1'"). Cast
   lastCall to the actual listSessions signature.

Per canon §7 + the founder's "you are the merger" rule, this is the kind
of CI-pipeline regression that MUST be caught BEFORE claiming slice
completion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* deploy: update catalyst images to 7235431

---------

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-05-09 12:28:59 +04:00
e3mrah
9763286900
feat(z): cross-EPIC follow-ups — lastLuaRecord + fleet alerts + edit-pr (#1095/#1096/#1099/#1101) (#1170)
Slice Z bundles three small flags surfaced during EPIC-1..6 implementation
into one PR; each is <50 LOC, none blocks shipping individually.

Z1 — K-Cont-2: surface status.lastLuaRecord after PDM commit
- Continuum reconciler's runSwitchover wraps PDMCommit so a successful
  /v1/lua/commit patches Continuum.status.lastLuaRecord with the
  records-array shape U-DR-1's LuaRecordView already parses (records[].body).
- status.lastLuaRecordAt stamped server-side (RFC3339); rollbacks
  re-track to rolled-back records ("status reflects what PDM has").
- CRD extended: explicit status.lastLuaRecord (records[].{hostname,body,
  ttl,primaryRegion}) + status.lastLuaRecordAt fields. Server-side
  apply confirmed.

Z2 — EPIC-1 score aggregator → U-Fleet alerts count
- ComplianceHandler.SovereignAlertCount(clusterID) — len(violationsFor(
  clusterID, "")) with nil-tolerant receiver. Returns the per-cluster
  failing (resource, policy) pair count from the existing aggregator.
- summarizeSovereign() reads it instead of returning the alerts: 0
  placeholder. h.compliance unwired → 0 (dashboard stays green when
  the aggregator isn't wired).

Z3 — Gitea PR write seam for YamlEditor flux-managed branch
- gitea.Client.CreatePullRequest + findOpenPR: typed PullRequest shape,
  409 race re-fetches existing PR (mirrors EnsureRepo pattern). Repo
  404 → ErrRepoNotFound.
- gitea.Client.EnsureBranch promoted to GiteaBlueprintClient interface
  (was already on Client).
- POST /api/v1/sovereigns/{id}/blueprints/edit-pr — body {org, path,
  content, message, title}. Auth: applicationInstallCallerAuthorized
  (tier-admin or higher), mirrors /publish. Branch name deterministic
  per (path, content-hash) — same edit re-targets the same PR via 409
  fallback. EnsureBranch + PutFile + CreatePullRequest against
  <org>/shared-blueprints. 503 when Gitea unwired; 400 on bad input;
  404 when repo missing.
- UI: editPRBlueprint in catalog.api.ts. YamlEditor's flux Apply
  branch posts to /blueprints/edit-pr → renders prURL link
  ([data-testid=yaml-editor-pr-link]). Org slug derived from
  catalyst.openova.io/organization label with namespace fallback.

Tests
- Z1: TestRunSwitchover_PatchesLastLuaRecord +
  TestPatchStatus_LuaRecordOnlyOnNonNil +
  TestLuaRecordStatusValue_NilOnEmpty.
- Z2: TestCompliance_SovereignAlertCount (real aggregator + 3
  violations + nil-receiver guard) +
  TestHandleFleetSovereignSummary_AlertsFromCompliance (200 with seeded
  state) + TestHandleFleetSovereignSummary_AlertsZeroWhenComplianceNil.
- Z3: TestCreatePullRequest_HappyPath + RejectsMissingArgs +
  RepoNotFound + 409ReFetchesExisting (gitea client) +
  TestHandleBlueprintEditPR_OpensPR + DeterministicBranchPerContent +
  403WhenNotTierAdmin + 503WhenGiteaUnwired + 404WhenRepoMissing +
  BadRequest + TestEditPRBranchName_DeterministicAndPathSensitive
  (handler) + YamlEditor vitest "flux Apply opens PR" + "surfaces
  server error" (UI).

go test -count=1 -race ./... clean across core/controllers + catalyst-api;
go vet ./... clean; npm run typecheck clean for changed UI files
(SessionsPage.test.tsx pre-existing tsc error from #1169 per canon §7).
CRD applies via kubectl apply --dry-run=server.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:54:06 +04:00
e3mrah
7b59292cad
feat(catalyst-ui): X2+E — xterm.js logs viewer + Guacamole exec + session list + replay (slice X2+E1+E2+E3, #1099) (#1169)
EPIC-4 final slice. Replaces the Logs/Exec placeholders shipped by R
(#1167) with target-state implementations and lays the surface for the
Guacamole-fronted recorded shell flow.

UI (catalyst-ui):
  - widgets/cloud-list/LogViewer.tsx — xterm.js viewer for the X1
    Pod-log WebSocket. Container picker (multi-container Pods),
    search box (⌃F / ⌘F), 10k scrollback, reconnect-with-since on
    disconnect (per X1 resume protocol).
  - widgets/cloud-list/ExecPanel.tsx — Open Shell button → POST
    /k8s/exec/.../session → Guacamole iframe. 5s iframe-load timeout
    OR onError → falls through to xterm.js + X1-style fallback
    WebSocket; banner explains "recording disabled" on fallback.
  - pages/sovereign/sessions/SessionsPage.tsx — guacamole session list
    + filter (pod/user) + paginate + Replay modal. Mounted on both
    /provision/$id/sessions (mothership) and /sessions (chroot).
  - pages/sovereign/cloud-list/ResourceDetailPage.tsx — Logs tab now
    renders LogViewer; Exec tab now renders ExecPanel. Non-Pod kinds
    surface a "drill into Tree to find Pods" hint.
  - resource.api.ts — adds logsWebSocketURL + execWebSocketURL +
    createExecSession + listSessions + getSessionReplay helpers (single
    URL truth per INVIOLABLE-PRINCIPLES #4).

API (catalyst-api):
  - internal/handler/k8s_exec.go — three new endpoints:
      POST /api/v1/sovereigns/{id}/k8s/exec/{ns}/{pod}/{container}/session
        (tier-developer or higher; calls GuacamoleClient.CreateSession;
        emits guacamole-session-opened audit)
      GET  /api/v1/sovereigns/{id}/sessions?from=&to=&pod=&user=&page=
        (tier-admin or higher; paginated; reads from GuacamoleClient
        OR in-memory fallback when no client is wired)
      GET  /api/v1/sovereigns/{id}/sessions/{sessionId}/replay
        (admin/owner only — sessions.playback per EPIC-3 §6.2; emits
        guacamole-session-replayed audit)
  - internal/handler/k8s_exec_ws.go — direct WebSocket exec fallback
    (bidi pump; xterm.js client) for when Guacamole iframe is blocked.
  - GuacamoleClient interface + in-memory fallback session store: the
    chroot Sovereign / CI flow renders cleanly even when Guacamole isn't
    deployed; production wires the real client via SetGuacamoleClient.
  - Audit-type predicate IsGuacamoleAuditType + 3 canonical type names
    (guacamole-session-opened/closed/replayed). Reuses the EPIC-3 U5-U8
    audit Bus + the slice K+P+X1+G's reservation per the canonical seam
    map; future audit consumers filter via prefix `guacamole-*`.

Tests:
  - 9 LogViewer / ExecPanel / SessionsPage vitest test files, 38 tests
    passing in `pages/sovereign/cloud-list/` + `widgets/cloud-list/` +
    `pages/sovereign/sessions/`.
  - 22 Go test functions in k8s_exec_test.go + k8s_exec_ws_test.go
    covering happy/forbidden/not-found/audit-emit/pagination/filter
    paths. `go test -count=1 -race ./internal/handler/` clean.
  - 6 Playwright snapshot tests at 1440x900 in
    `e2e/logs-exec-sessions.spec.ts` covering LogViewer / search box /
    ExecPanel idle / ExecPanel post-click / SessionsPage list / filter.

`npm run typecheck` clean. `go vet ./...` clean. Pre-existing UI test
failures (12 files, 99 tests) confirmed identical to main per canon §7.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:18:06 +04:00
e3mrah
21810a3760
feat(catalyst-ui): R — resource browser drill-down + tree + YAML editor + events + metrics + actions (slice R, #1099) (#1167)
EPIC-4 Slice R bundle layered on the K+P+X1+G backend (#1164):
- R1 ResourceDetailPage with 7 tabs (Overview / YAML / Logs / Exec / Events / Metrics / Tree); routes mounted on both mothership (/provision/$id/cloud/resource/...) and chroot (/cloud/resource/...) trees.
- R2 ResourceTree widget with owner-walk UP and selector-walk DOWN, server-side at /k8s/{kind}/{ns}/{name}/tree using new k8scache GetResourcesByOwner + GetResourcesBySelector indexer-only paths.
- R3 YamlEditor with side-by-side diff, dry-run validation, flux-vs-manual branching (manual → /apply, flux → PR seam wired for the unified Gitea client).
- R4 EventsPanel filtering events.k8s.io/v1 Events by regarding-object; new "event" kind added to k8scache DefaultKinds.
- R5 MetricsPanel with Recharts sparkline; rolls up PodMetrics across owned Pods for Deployment/StatefulSet/DaemonSet.
- R6 ResourceActions widget: scale (Deployment/StatefulSet), restart (annotation stamp), delete (typed-confirmation gate). All mutation endpoints tier-admin gated server-side via the canonical applicationInstallCallerAuthorized seam — UI hide is convenience only.

K8sListPage rows are now clickable and navigate to the detail page.

7 server-side endpoints added under /api/v1/sovereigns/{id}/k8s/{kind}/{ns}/{name}: GET, /tree, /scale, /restart, /dry-run, /apply, DELETE — plus /k8s/metrics/{kind}/{ns}/{name}.

New k8scache.Factory accessors: DynamicClientFor + RedactForKind. Same lifecycle as CoreClient — no second per-cluster pool.

Tests: 37 new vitest cases (ResourceTree / YamlEditor / EventsPanel / MetricsPanel / ResourceActions / ResourceDetailPage / resource.api) all passing. 12 new Go test funcs covering GET / scale / restart / delete / dry-run / apply / tree / metrics + tree.go owner+selector walks. 8 Playwright snapshots at 1440x900 (one per tab + list-row entry).

Pre-existing baselines untouched: 59 lint errors (matches main); 12 vitest test files / 98 vitest tests still failing on main (StepComponents + cosmetic-guards + AppDetail), zero introduced by this slice; pre-existing TestGetKubeconfig_ReadsFromPathPointer TempDir-cleanup race observed only with -race + parallel run, passes in isolation.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 10:34:01 +04:00