openova

Author	SHA1	Message	Date
e3mrah	115c58885b	fix(cilium-gateway): allow world ingress to reserved:ingress (unblocks Sovereign public surfaces) (#1482 ) * fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled. On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and the wildcard Certificate sticks Ready=False — Cilium Gateway has no valid TLS secret → envoy listener never binds → public TLS handshake to console.<fqdn> dies with SSL_ERROR_SYSCALL. Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ? staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign- tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml references it as ${WILDCARD_CERT_ISSUER}. Default behaviour unchanged for non-QA (production) Sovereigns — they still resolve to letsencrypt-dns01-prod-powerdns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cilium-gateway): allow world ingress to Cilium Gateway reserved:ingress endpoint When Cilium Gateway API runs with gatewayAPI.hostNetwork.enabled=true and a default-deny CCNP is present, every public request to a Sovereign host (console, auth, gitea, registry, api, ...) hits the gateway listener and gets DENIED at envoy's cilium.l7policy filter with: cilium.l7policy: Ingress from 1 policy lookup for endpoint X for port 30443: DENY Public response: HTTP/1.1 403 Forbidden, body "Access denied", server: envoy. Root cause: Cilium creates a special endpoint with identity reserved:ingress (8) representing the gateway listener. By default this endpoint has policy-enabled=both with allowed-ingress-identities=[1 (host)] and empty L4 rules — so no port is permitted. The default-deny CCNP's NotIn-namespace endpointSelector does NOT cover this endpoint (it has no io.kubernetes.pod.namespace label), and our qa-fixtures didn't ship a matching allow-template for it. Net effect: TLS handshake succeeds, HTTPRoutes are Programmed, backends are healthy in-cluster, but every request 403s. Caught live on prov #80 (omantel.biz, 2026-05-14) after the Gateway hostNetwork fix (#1480) finally activated host-bind on :30443. Verified by: - envoy debug log: cilium.l7policy DENY for endpoint 10.42.0.201 port 30443 - cilium-dbg endpoint get 3282 -o json: l4.ingress: [] and allowed-ingress-identities: [1] - transiently applying the same CCNP via kubectl: console.omantel.biz → 200 Fix: ship a CCNP scoped to reserved:ingress that allows ingress from world, cluster, host, remote-node (multi-region CP-to-CP), and kube-apiserver, plus egress to all so envoy can forward to any backend service. This is the canonical Cilium hostNetwork Gateway-API zero-trust pattern. Chart bump: catalyst 1.4.142 → 1.4.143. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>	2026-05-14 18:50:34 +04:00
github-actions[bot]	fb99ae5fd0	deploy: update catalyst images to `a88e132`	2026-05-14 14:27:51 +00:00
github-actions[bot]	5752fc751f	deploy: update catalyst images to `bdceb3a`	2026-05-14 12:45:34 +00:00
github-actions[bot]	0e4cb67319	deploy: update catalyst images to `690d588`	2026-05-14 12:40:44 +00:00
github-actions[bot]	195c6b5bc5	deploy: update catalyst images to `13d79c7`	2026-05-14 12:35:31 +00:00
github-actions[bot]	5527652b49	deploy: update catalyst images to `f334950`	2026-05-14 12:29:07 +00:00
github-actions[bot]	fb8303766e	deploy: update catalyst images to `587a985`	2026-05-14 10:18:12 +00:00
github-actions[bot]	bb2726bcf9	deploy: update catalyst images to `f110a54`	2026-05-14 06:51:04 +00:00
github-actions[bot]	b4c96a6d0d	deploy: update catalyst images to `df1dfed`	2026-05-14 06:30:40 +00:00
github-actions[bot]	331e6b2834	deploy: update catalyst images to `b4c2f54`	2026-05-14 06:12:28 +00:00
github-actions[bot]	2f5b1cd0ee	deploy: update catalyst images to `4814c68`	2026-05-14 05:55:28 +00:00
github-actions[bot]	f5929e6114	deploy: update catalyst images to `2626d40`	2026-05-14 04:27:53 +00:00
github-actions[bot]	edf8e6fd18	deploy: update catalyst images to `c267ab5`	2026-05-14 04:20:59 +00:00
e3mrah	c267ab5338	fix(qa-fixtures): allow catalyst+newapi NS + kube-apiserver egress (prov #72 ) (#1465 ) * fix(flow_snapshot): region-scope dep edges (no cross-region wiring) Founder caught on prov #66 (3dc9249ea73a6840, 2026-05-13): hel1-2's install-* nodes all rendered dep arrows pointing at PRIMARY's install nodes — cross-region edges where NAMING-CONVENTION §1.3 demands independent fault domains (no cross-region wiring). Root cause: helmwatch.Bridge persists secondary-region Jobs with bare dep names ("install-cilium") because HR.spec.dependsOn carries chart names without region context. The snapshot composer's normaliser turned `install-cilium` → `<depID>:install-cilium` which IS the primary's cilium JobID, not hel1-2's `<depID>:install-hel1-2/cilium`. Every secondary install therefore drew a phantom cross-region edge. Fix: in flow_snapshot_local.go, region-scope dep names when the source Job is regional: jobRegion=="hel1-2" + dep="install-cilium" → "install-hel1-2/cilium" → "<depID>:install-hel1-2/cilium" Same fix applied to the Layer-2 hrDeps derivation path (per-AppID lookup also gets bare chart names from the primary watcher). hrDeps lookup is now done with the unprefixed AppID so it actually hits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud-init): wait for private NIC before k3s install (prov #71) Hetzner Cloud hot-attaches the private-network NIC ~10-20s AFTER server create. cloud-init init-local fetches /hetzner/v1/metadata/private-networks BEFORE the NIC is ready, renders netplan with only eth0, and the private NIC (kernel-renamed eth1 → enp7s0 by udev) stays DOWN. Effect on secondary CPs: k3s server starts with --node-ip=10.0.<10+idx>.2 --advertise-address=10.0.<10+idx>.2 and fatals on "listen tcp 10.0.11.2:2380: bind: cannot assign requested address" then crashloops. Caught on prov #71/omantel.biz/nbg1-1-cp1: k3s.service restart counter reached 5394, kubeconfig never PUT back to mothership, canvas showed secondary region as a permanent black hole. Diagnosed via Hetzner rescue mode SSH 2026-05-14. Primary CP works by luck of faster fsn1 zone NIC attach. Fix: in cloud-init runcmd, BEFORE the k3s install, poll up to 120s for the expected private IP (control plane) or a route to it (worker). If the NIC appears DOWN with no netplan stanza, generate one with dhcp4:true and `netplan apply`. Bail loudly if the IP/route never appears — failures surface in cloud-init.log instead of disguising as a slow boot. Symmetric fix in worker template covers autoscaler-spawned secondary workers when worker_count > 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(qa-fixtures): allow catalyst+newapi NS + kube-apiserver egress (prov #72) The qa-fixtures chart's `default-deny` CiliumClusterwideNetworkPolicy excluded `catalyst-system` from its NotIn list but FORGOT `catalyst` (where bp-self-sovereign-cutover's Jobs live: auto-trigger, gitea-mirror, harbor-projects, registry-pivot) and `newapi` (where bp-newapi's Application pods live). Effect on prov #72: - bp-self-sovereign-cutover-auto-trigger Job stuck 20m+ on HTTP 000000 curling http://catalyst-api.catalyst-system.svc → DNS resolution + TCP egress both denied by default-deny. Cutover never fires → handover blocked → bp-catalyst-platform's --wait never completes. - newapi-bp-newapi pod gets `secret newapi-oidc not found` but its inability to resolve apiserver compounds the issue. - qa-omantel cnpg cluster-primary/replica stuck "Setting up primary" for 18m because initdb's `dial tcp 10.43.0.1:443 i/o timeout` — the ClusterIP-rewritten kube-apiserver address has no allow-egress. Fixes: 1. Add `catalyst` + `newapi` to $excludedNamespaces — they're first-party blueprint namespaces analogous to catalyst-system. 2. Add `allow-kube-apiserver` CNP in qa-omantel using Cilium's canonical `toEntities: [kube-apiserver]` directive so cnpg initdb can reach the apiserver regardless of whether traffic resolves to ClusterIP, node IP, or Service VIP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 08:18:54 +04:00
github-actions[bot]	5f2298c550	deploy: update catalyst images to `a75463f`	2026-05-14 03:42:19 +00:00
github-actions[bot]	af3a1e6375	deploy: update catalyst images to `410a3db`	2026-05-13 18:05:18 +00:00
github-actions[bot]	3c38565951	deploy: update catalyst images to `4a14bbf`	2026-05-13 16:34:30 +00:00
github-actions[bot]	cd5ace8dcb	deploy: update catalyst images to `32e0b40`	2026-05-13 15:42:13 +00:00
github-actions[bot]	55edb953d5	deploy: update catalyst images to `44913d8`	2026-05-13 14:40:02 +00:00
github-actions[bot]	b6e6470ccf	deploy: update catalyst images to `5f4f9f2`	2026-05-13 14:01:04 +00:00
e3mrah	6fac1481d3	fix(catalyst-api): bump memory limit 1Gi → 4Gi for multi-region snapshot load (#1456 ) prov #61 (2e197a934a0e0461, 2026-05-13): catalyst-api OOMKilled 6× during phase-1 watch on a 3-region Sovereign. The in-memory state has grown substantially since the 1Gi limit was set: - 1 primary helmwatch.Watcher (45 HRs + informer cache) - N secondary helmwatch.Watchers (45 HRs × 2 secondary regions, each with its own informer cache) - jobs.Store backed by on-disk + in-memory tree - per-/snapshot poll: composes per-region region groups across all Job rows + cross-references hrDeps from the live primary watcher Combined steady-state exceeds 1Gi on cpx32-equivalent loads. Bumped limits to 4Gi (request 512Mi up from 128Mi). The mothership node has 8GB+ resident, no other tight constraint. Future fix: persist region in Job rows so secondary watchers don't need to be retained post phase-1 (orthogonal cleanup). Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:20:00 +04:00
github-actions[bot]	2c6374b200	deploy: update catalyst images to `8518bb1`	2026-05-13 12:48:59 +00:00
github-actions[bot]	ed4f66438f	deploy: update catalyst images to `d9d7fa2`	2026-05-13 12:26:59 +00:00
github-actions[bot]	6f50bc0a4a	deploy: update catalyst images to `3a08c23`	2026-05-13 12:05:56 +00:00
github-actions[bot]	16f41bef56	deploy: update catalyst images to `68372d7`	2026-05-12 16:13:41 +00:00
github-actions[bot]	1c6e82b83b	deploy: update catalyst images to `be47815`	2026-05-12 16:03:56 +00:00
github-actions[bot]	034da82c00	deploy: update catalyst images to `cdcc50a`	2026-05-12 15:58:30 +00:00
github-actions[bot]	fc71800a52	deploy: update catalyst images to `19a847e`	2026-05-12 12:30:55 +00:00
github-actions[bot]	bc0f56eb4e	deploy: update catalyst images to `4923938`	2026-05-12 12:15:30 +00:00
github-actions[bot]	effd75e4a7	deploy: update catalyst images to `c5d891a`	2026-05-12 11:26:54 +00:00
github-actions[bot]	5fb99be8e8	deploy: update catalyst images to `bd5d439`	2026-05-12 10:00:04 +00:00
github-actions[bot]	064fc3073f	deploy: update catalyst images to `0fe0cac`	2026-05-12 09:32:31 +00:00
github-actions[bot]	c80d43c6d8	deploy: update catalyst images to `2c1f767`	2026-05-12 09:27:06 +00:00
github-actions[bot]	fe337d571c	deploy: update catalyst images to `bb1bff2`	2026-05-12 08:42:18 +00:00
github-actions[bot]	24a2b13870	deploy: update catalyst images to `9da662c`	2026-05-12 08:36:45 +00:00
github-actions[bot]	41787d66c6	deploy: update catalyst images to `5e96d30`	2026-05-12 08:33:55 +00:00
github-actions[bot]	732949bc73	deploy: update catalyst images to `f980356`	2026-05-12 08:14:36 +00:00
github-actions[bot]	1a0333a43f	deploy: update catalyst images to `93c3e81`	2026-05-12 07:27:29 +00:00
github-actions[bot]	9011d1b635	deploy: update catalyst images to `048a4d8`	2026-05-12 06:46:54 +00:00
github-actions[bot]	7e4f38ec62	deploy: update catalyst images to `e3771f6`	2026-05-12 06:38:32 +00:00
github-actions[bot]	59b6940c18	deploy: update catalyst images to `2fbab45`	2026-05-12 06:08:41 +00:00
github-actions[bot]	4ceb74067f	deploy: update catalyst images to `50bf7a5`	2026-05-12 04:12:24 +00:00
e3mrah	50bf7a59ed	fix: F8 - double bp-catalyst-platform HR timeout (15m→30m) + catalyst-api phase1 budget (60m→120m) (#1428 ) prov #44 (d9399223c3caa4f9) hit the catalyst-api 60m phase1 watch cap with bp-catalyst-platform HR still mid-retry (failures=3) and 41/45 HRs True. F1-F7 are correct and live on main (qa-finalizer-strip Completed, autoscaler workers joined). The remaining wall is total bootstrap-kit install time exceeding the outer watch budget on a fresh cpx42×1 Sovereign without a warm Harbor proxy-cache. Two lock-step changes widen both bounds: 1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: install.timeout 15m → 30m, upgrade.timeout 15m → 30m. The umbrella chart genuinely needs >15m worst case when the full SME + Catalyst service stack rolls cold. 2. products/catalyst/bootstrap/api/internal/helmwatch/helmwatch.go: DefaultWatchTimeout 60m → 120m. Worst-case inner HR retry chain is now 30m × 3 = 90m; the outer phase1 budget MUST be larger so the watch never terminates while helm-controller still has remediation attempts left. CATALYST_PHASE1_WATCH_TIMEOUT env-var override path was already wired (issue #538 baseline) — chart template now declares the explicit "120m" value so the runtime knob is discoverable for capacity-bounded environments. Per INVIOLABLE- PRINCIPLES.md #4 the knob remains runtime-configurable. New unit test TestPhase1WatchConfig_ProductionDefaultIs120m pins the F8 floor against future regression. Existing env-var override + field- override tests still pass unchanged. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 08:10:24 +04:00
github-actions[bot]	dd095b8597	deploy: update catalyst images to `b743b64`	2026-05-12 02:13:30 +00:00
github-actions[bot]	d4d05f16f6	deploy: update catalyst images to `8c7d326`	2026-05-12 00:38:43 +00:00
e3mrah	8c7d32616e	fix(bp-catalyst-platform): qa-finalizer-strip hook unschedulable on saturated worker (Fix #185 , prov #38/#39/#41 recurrence) (#1426 ) Root cause (4-layer trace on prov #41, omantel.biz, 2026-05-12 00:28 UTC): bp-catalyst-platform HR install.timeout=15m → Helm pre-install hook: qa-finalizer-strip Job (weight -99) → Pod requests 50m CPU + 64Mi memory (tiny) → BUT no tolerations → scheduler restricted to worker → worker cpx32 (8vCPU/16GB) at 99% CPU requests (7980m of 8000m allocated) after bootstrap-kit fan-out → FailedScheduling: "0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: true}" → autoscaler triggers scale-up worker 2→3 → "1 in backoff after failed scale-up" → still Pending → 15m timeout → InstallFailed → Flux uninstall+rollback → installFailures: 3 → Flux gives up entirely Live evidence quoted from chroot kubeconfig on prov #41: - bp-catalyst-platform HR `Reconciling=True, reason=Progressing, message="Running 'install' action with timeout of 15m0s"` - HR `Released=False, reason=InstallFailed, message="Helm install failed for release catalyst-system/catalyst-platform with chart bp-catalyst-platform@1.4.140: failed pre-install: 1 error occurred: * timed out waiting for the condition"` - Pod `qa-finalizer-strip-m2hdb` status=Pending; events: `Warning FailedScheduling 108s default-scheduler 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: true}` - Worker `Allocated cpu 7980m (99%) of 8000m capacity` - Control-plane `Allocated cpu 635m (7%) of 8000m capacity` (idle) Fix: add tolerations for the control-plane NoSchedule taint + priorityClassName: system-cluster-critical so the qa-finalizer-strip Job can ALWAYS schedule regardless of worker-node CPU saturation. The hook is a defense-in-depth cleanup that runs in seconds on a clean cluster; it legitimately belongs anywhere with free capacity including the control-plane node (which on prov #41 had 7365m CPU free vs. the hook's 50m request). Why prior fixes didn't suffice: - Fix #114 introduced this hook to break a finalizer-deadlock loop on prov #9. Correct fix for that wedge; never anticipated worker saturation as a scheduling failure mode for the hook itself. - Fix #138 (chart 1.4.138) converted the qa-cnpg-backup-s3-seed + qa-cnpg-status-seed hooks (weight 0/post-install) to regular release resources to break a circular DAG dep. Different hook surface. - Fix #184 (chart 1.4.140) raised the gitea-token-mint pre-install hook (weight +10) wait budget for cold-start autoscaler. That hook runs AFTER qa-finalizer-strip (-99 < +10); if the -99 hook never starts, the +10 hook never runs. Recurring class: same family as Fix #114 (hook scheduling failure wedges entire HR install). 3 consecutive recurrences (prov #38, #39, #41) on chart pin 1.4.140 trigger the category-level audit threshold (CLAUDE.md rule "CATEGORY-LEVEL THINKING"). Coupled chart hygiene swept in same commit: - Switch image from bitnamilegacy/kubectl:1.29.3 (Docker-Hub redirect for deprecated Bitnami images, 2025-08 cutover documented at platform/self-sovereign-cutover/chart/values.yaml: 252) → harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4 — the canonical alpine-based kubectl image already used by sibling hook catalyst-gitea-token-mint (Fix #163). MIRROR-EVERYTHING + ARCHITECT-FIRST rules. Coordinator follow-up tickets: - Sibling Jobs in templates/qa-fixtures/cnpg-clusters-qa.yaml (qa-cnpgpair-status-seed) still reference bitnamilegacy/kubectl :1.29.3 — same Bitnami-deprecation class. Out of scope for this Fix (not part of the recurrence cluster); flagged for a sweep. - Worker cpx32 sizing may be undersized for the bootstrap-kit fan- out on omantel.biz — separate sizing ticket, not blocking. Changes: - products/catalyst/chart/templates/qa-fixtures/pre-install- finalizer-strip.yaml: add tolerations + priorityClassName; switch image to alpine/k8s:1.31.4. Inline doc comments explain the 4-layer trace and the Fix #114/#138/#184 history. - products/catalyst/chart/Chart.yaml: bump 1.4.140 → 1.4.141 with changelog entry capturing root cause + budget arithmetic. - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bump HR pin 1.4.140 → 1.4.141. Verification: - helm template renders cleanly (exit 0, ~6700 lines). - kubectl apply --dry-run=client validates the rendered Job manifest (job.batch/qa-finalizer-strip created (dry run)). - Rendered Job contains tolerations[control-plane Exists NoSchedule], priorityClassName: system-cluster-critical, image: alpine/k8s:1.31.4. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 04:36:35 +04:00
github-actions[bot]	5fdd33b7c0	deploy: update catalyst images to `0ba87bb`	2026-05-11 18:32:08 +00:00
github-actions[bot]	5c987309b5	deploy: update catalyst images to `5332ed0`	2026-05-11 17:56:31 +00:00
github-actions[bot]	1f05e52e77	deploy: update catalyst images to `36d1f56`	2026-05-11 17:47:04 +00:00
github-actions[bot]	0a869c3805	deploy: update catalyst images to `1863a25`	2026-05-11 16:54:46 +00:00

1 2 3 4 5 ...

631 Commits