Commit Graph

631 Commits

Author SHA1 Message Date
e3mrah
115c58885b
fix(cilium-gateway): allow world ingress to reserved:ingress (unblocks Sovereign public surfaces) (#1482)
* fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu

clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded
letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled.
On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate
limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and
the wildcard Certificate sticks Ready=False — Cilium Gateway has no
valid TLS secret → envoy listener never binds → public TLS handshake
to console.<fqdn> dies with SSL_ERROR_SYSCALL.

Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ?
staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign-
tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml
references it as ${WILDCARD_CERT_ISSUER}.

Default behaviour unchanged for non-QA (production) Sovereigns —
they still resolve to letsencrypt-dns01-prod-powerdns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cilium-gateway): allow world ingress to Cilium Gateway reserved:ingress endpoint

When Cilium Gateway API runs with gatewayAPI.hostNetwork.enabled=true and
a default-deny CCNP is present, every public request to a Sovereign host
(console, auth, gitea, registry, api, ...) hits the gateway listener and
gets DENIED at envoy's cilium.l7policy filter with:

    cilium.l7policy: Ingress from 1 policy lookup for endpoint X for port 30443: DENY

Public response: HTTP/1.1 403 Forbidden, body "Access denied", server: envoy.

Root cause: Cilium creates a special endpoint with identity reserved:ingress (8)
representing the gateway listener. By default this endpoint has
policy-enabled=both with allowed-ingress-identities=[1 (host)] and empty
L4 rules — so no port is permitted. The default-deny CCNP's NotIn-namespace
endpointSelector does NOT cover this endpoint (it has no
io.kubernetes.pod.namespace label), and our qa-fixtures didn't ship a
matching allow-template for it. Net effect: TLS handshake succeeds, HTTPRoutes
are Programmed, backends are healthy in-cluster, but every request 403s.

Caught live on prov #80 (omantel.biz, 2026-05-14) after the Gateway hostNetwork
fix (#1480) finally activated host-bind on :30443. Verified by:
- envoy debug log: cilium.l7policy DENY for endpoint 10.42.0.201 port 30443
- cilium-dbg endpoint get 3282 -o json: l4.ingress: [] and allowed-ingress-identities: [1]
- transiently applying the same CCNP via kubectl: console.omantel.biz → 200

Fix: ship a CCNP scoped to reserved:ingress that allows ingress from world,
cluster, host, remote-node (multi-region CP-to-CP), and kube-apiserver,
plus egress to all so envoy can forward to any backend service. This is
the canonical Cilium hostNetwork Gateway-API zero-trust pattern.

Chart bump: catalyst 1.4.142 → 1.4.143.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
2026-05-14 18:50:34 +04:00
github-actions[bot]
fb99ae5fd0 deploy: update catalyst images to a88e132 2026-05-14 14:27:51 +00:00
github-actions[bot]
5752fc751f deploy: update catalyst images to bdceb3a 2026-05-14 12:45:34 +00:00
github-actions[bot]
0e4cb67319 deploy: update catalyst images to 690d588 2026-05-14 12:40:44 +00:00
github-actions[bot]
195c6b5bc5 deploy: update catalyst images to 13d79c7 2026-05-14 12:35:31 +00:00
github-actions[bot]
5527652b49 deploy: update catalyst images to f334950 2026-05-14 12:29:07 +00:00
github-actions[bot]
fb8303766e deploy: update catalyst images to 587a985 2026-05-14 10:18:12 +00:00
github-actions[bot]
bb2726bcf9 deploy: update catalyst images to f110a54 2026-05-14 06:51:04 +00:00
github-actions[bot]
b4c96a6d0d deploy: update catalyst images to df1dfed 2026-05-14 06:30:40 +00:00
github-actions[bot]
331e6b2834 deploy: update catalyst images to b4c2f54 2026-05-14 06:12:28 +00:00
github-actions[bot]
2f5b1cd0ee deploy: update catalyst images to 4814c68 2026-05-14 05:55:28 +00:00
github-actions[bot]
f5929e6114 deploy: update catalyst images to 2626d40 2026-05-14 04:27:53 +00:00
github-actions[bot]
edf8e6fd18 deploy: update catalyst images to c267ab5 2026-05-14 04:20:59 +00:00
e3mrah
c267ab5338
fix(qa-fixtures): allow catalyst+newapi NS + kube-apiserver egress (prov #72) (#1465)
* fix(flow_snapshot): region-scope dep edges (no cross-region wiring)

Founder caught on prov #66 (3dc9249ea73a6840, 2026-05-13): hel1-2's
install-* nodes all rendered dep arrows pointing at PRIMARY's install
nodes — cross-region edges where NAMING-CONVENTION §1.3 demands
independent fault domains (no cross-region wiring).

Root cause: helmwatch.Bridge persists secondary-region Jobs with bare
dep names ("install-cilium") because HR.spec.dependsOn carries chart
names without region context. The snapshot composer's normaliser
turned `install-cilium` → `<depID>:install-cilium` which IS the
primary's cilium JobID, not hel1-2's `<depID>:install-hel1-2/cilium`.
Every secondary install therefore drew a phantom cross-region edge.

Fix: in flow_snapshot_local.go, region-scope dep names when the source
Job is regional:

  jobRegion=="hel1-2" + dep="install-cilium"
    → "install-hel1-2/cilium" → "<depID>:install-hel1-2/cilium"

Same fix applied to the Layer-2 hrDeps derivation path (per-AppID
lookup also gets bare chart names from the primary watcher). hrDeps
lookup is now done with the unprefixed AppID so it actually hits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud-init): wait for private NIC before k3s install (prov #71)

Hetzner Cloud hot-attaches the private-network NIC ~10-20s AFTER server
create. cloud-init init-local fetches /hetzner/v1/metadata/private-networks
BEFORE the NIC is ready, renders netplan with only eth0, and the
private NIC (kernel-renamed eth1 → enp7s0 by udev) stays DOWN.

Effect on secondary CPs: k3s server starts with
  --node-ip=10.0.<10+idx>.2 --advertise-address=10.0.<10+idx>.2
and fatals on
  "listen tcp 10.0.11.2:2380: bind: cannot assign requested address"
then crashloops. Caught on prov #71/omantel.biz/nbg1-1-cp1: k3s.service
restart counter reached 5394, kubeconfig never PUT back to mothership,
canvas showed secondary region as a permanent black hole. Diagnosed via
Hetzner rescue mode SSH 2026-05-14. Primary CP works by luck of faster
fsn1 zone NIC attach.

Fix: in cloud-init runcmd, BEFORE the k3s install, poll up to 120s for
the expected private IP (control plane) or a route to it (worker). If
the NIC appears DOWN with no netplan stanza, generate one with dhcp4:true
and `netplan apply`. Bail loudly if the IP/route never appears — failures
surface in cloud-init.log instead of disguising as a slow boot.

Symmetric fix in worker template covers autoscaler-spawned secondary
workers when worker_count > 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(qa-fixtures): allow catalyst+newapi NS + kube-apiserver egress (prov #72)

The qa-fixtures chart's `default-deny` CiliumClusterwideNetworkPolicy
excluded `catalyst-system` from its NotIn list but FORGOT `catalyst`
(where bp-self-sovereign-cutover's Jobs live: auto-trigger,
gitea-mirror, harbor-projects, registry-pivot) and `newapi` (where
bp-newapi's Application pods live).

Effect on prov #72:
- bp-self-sovereign-cutover-auto-trigger Job stuck 20m+ on HTTP 000000
  curling http://catalyst-api.catalyst-system.svc → DNS resolution + TCP
  egress both denied by default-deny. Cutover never fires → handover
  blocked → bp-catalyst-platform's --wait never completes.
- newapi-bp-newapi pod gets `secret newapi-oidc not found` but its
  inability to resolve apiserver compounds the issue.
- qa-omantel cnpg cluster-primary/replica stuck "Setting up primary"
  for 18m because initdb's `dial tcp 10.43.0.1:443 i/o timeout` — the
  ClusterIP-rewritten kube-apiserver address has no allow-egress.

Fixes:
1. Add `catalyst` + `newapi` to $excludedNamespaces — they're first-party
   blueprint namespaces analogous to catalyst-system.
2. Add `allow-kube-apiserver` CNP in qa-omantel using Cilium's canonical
   `toEntities: [kube-apiserver]` directive so cnpg initdb can reach the
   apiserver regardless of whether traffic resolves to ClusterIP, node
   IP, or Service VIP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 08:18:54 +04:00
github-actions[bot]
5f2298c550 deploy: update catalyst images to a75463f 2026-05-14 03:42:19 +00:00
github-actions[bot]
af3a1e6375 deploy: update catalyst images to 410a3db 2026-05-13 18:05:18 +00:00
github-actions[bot]
3c38565951 deploy: update catalyst images to 4a14bbf 2026-05-13 16:34:30 +00:00
github-actions[bot]
cd5ace8dcb deploy: update catalyst images to 32e0b40 2026-05-13 15:42:13 +00:00
github-actions[bot]
55edb953d5 deploy: update catalyst images to 44913d8 2026-05-13 14:40:02 +00:00
github-actions[bot]
b6e6470ccf deploy: update catalyst images to 5f4f9f2 2026-05-13 14:01:04 +00:00
e3mrah
6fac1481d3
fix(catalyst-api): bump memory limit 1Gi → 4Gi for multi-region snapshot load (#1456)
prov #61 (2e197a934a0e0461, 2026-05-13): catalyst-api OOMKilled 6× during
phase-1 watch on a 3-region Sovereign. The in-memory state has grown
substantially since the 1Gi limit was set:

- 1 primary helmwatch.Watcher (45 HRs + informer cache)
- N secondary helmwatch.Watchers (45 HRs × 2 secondary regions, each
  with its own informer cache)
- jobs.Store backed by on-disk + in-memory tree
- per-/snapshot poll: composes per-region region groups across all
  Job rows + cross-references hrDeps from the live primary watcher

Combined steady-state exceeds 1Gi on cpx32-equivalent loads. Bumped
limits to 4Gi (request 512Mi up from 128Mi). The mothership node has
8GB+ resident, no other tight constraint. Future fix: persist region
in Job rows so secondary watchers don't need to be retained post
phase-1 (orthogonal cleanup).

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:20:00 +04:00
github-actions[bot]
2c6374b200 deploy: update catalyst images to 8518bb1 2026-05-13 12:48:59 +00:00
github-actions[bot]
ed4f66438f deploy: update catalyst images to d9d7fa2 2026-05-13 12:26:59 +00:00
github-actions[bot]
6f50bc0a4a deploy: update catalyst images to 3a08c23 2026-05-13 12:05:56 +00:00
github-actions[bot]
16f41bef56 deploy: update catalyst images to 68372d7 2026-05-12 16:13:41 +00:00
github-actions[bot]
1c6e82b83b deploy: update catalyst images to be47815 2026-05-12 16:03:56 +00:00
github-actions[bot]
034da82c00 deploy: update catalyst images to cdcc50a 2026-05-12 15:58:30 +00:00
github-actions[bot]
fc71800a52 deploy: update catalyst images to 19a847e 2026-05-12 12:30:55 +00:00
github-actions[bot]
bc0f56eb4e deploy: update catalyst images to 4923938 2026-05-12 12:15:30 +00:00
github-actions[bot]
effd75e4a7 deploy: update catalyst images to c5d891a 2026-05-12 11:26:54 +00:00
github-actions[bot]
5fb99be8e8 deploy: update catalyst images to bd5d439 2026-05-12 10:00:04 +00:00
github-actions[bot]
064fc3073f deploy: update catalyst images to 0fe0cac 2026-05-12 09:32:31 +00:00
github-actions[bot]
c80d43c6d8 deploy: update catalyst images to 2c1f767 2026-05-12 09:27:06 +00:00
github-actions[bot]
fe337d571c deploy: update catalyst images to bb1bff2 2026-05-12 08:42:18 +00:00
github-actions[bot]
24a2b13870 deploy: update catalyst images to 9da662c 2026-05-12 08:36:45 +00:00
github-actions[bot]
41787d66c6 deploy: update catalyst images to 5e96d30 2026-05-12 08:33:55 +00:00
github-actions[bot]
732949bc73 deploy: update catalyst images to f980356 2026-05-12 08:14:36 +00:00
github-actions[bot]
1a0333a43f deploy: update catalyst images to 93c3e81 2026-05-12 07:27:29 +00:00
github-actions[bot]
9011d1b635 deploy: update catalyst images to 048a4d8 2026-05-12 06:46:54 +00:00
github-actions[bot]
7e4f38ec62 deploy: update catalyst images to e3771f6 2026-05-12 06:38:32 +00:00
github-actions[bot]
59b6940c18 deploy: update catalyst images to 2fbab45 2026-05-12 06:08:41 +00:00
github-actions[bot]
4ceb74067f deploy: update catalyst images to 50bf7a5 2026-05-12 04:12:24 +00:00
e3mrah
50bf7a59ed
fix: F8 - double bp-catalyst-platform HR timeout (15m→30m) + catalyst-api phase1 budget (60m→120m) (#1428)
prov #44 (d9399223c3caa4f9) hit the catalyst-api 60m phase1 watch cap
with bp-catalyst-platform HR still mid-retry (failures=3) and 41/45 HRs
True. F1-F7 are correct and live on main (qa-finalizer-strip Completed,
autoscaler workers joined). The remaining wall is total bootstrap-kit
install time exceeding the outer watch budget on a fresh cpx42×1
Sovereign without a warm Harbor proxy-cache.

Two lock-step changes widen both bounds:

1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
   install.timeout 15m → 30m, upgrade.timeout 15m → 30m. The umbrella
   chart genuinely needs >15m worst case when the full SME + Catalyst
   service stack rolls cold.

2. products/catalyst/bootstrap/api/internal/helmwatch/helmwatch.go:
   DefaultWatchTimeout 60m → 120m. Worst-case inner HR retry chain is
   now 30m × 3 = 90m; the outer phase1 budget MUST be larger so the
   watch never terminates while helm-controller still has remediation
   attempts left. CATALYST_PHASE1_WATCH_TIMEOUT env-var override path
   was already wired (issue #538 baseline) — chart template now
   declares the explicit "120m" value so the runtime knob is
   discoverable for capacity-bounded environments. Per INVIOLABLE-
   PRINCIPLES.md #4 the knob remains runtime-configurable.

New unit test TestPhase1WatchConfig_ProductionDefaultIs120m pins the
F8 floor against future regression. Existing env-var override + field-
override tests still pass unchanged.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 08:10:24 +04:00
github-actions[bot]
dd095b8597 deploy: update catalyst images to b743b64 2026-05-12 02:13:30 +00:00
github-actions[bot]
d4d05f16f6 deploy: update catalyst images to 8c7d326 2026-05-12 00:38:43 +00:00
e3mrah
8c7d32616e
fix(bp-catalyst-platform): qa-finalizer-strip hook unschedulable on saturated worker (Fix #185, prov #38/#39/#41 recurrence) (#1426)
Root cause (4-layer trace on prov #41, omantel.biz, 2026-05-12 00:28 UTC):

  bp-catalyst-platform HR install.timeout=15m
    → Helm pre-install hook: qa-finalizer-strip Job (weight -99)
      → Pod requests 50m CPU + 64Mi memory (tiny)
        → BUT no tolerations → scheduler restricted to worker
          → worker cpx32 (8vCPU/16GB) at 99% CPU requests
            (7980m of 8000m allocated) after bootstrap-kit fan-out
            → FailedScheduling: "0/2 nodes are available: 1
              Insufficient cpu, 1 node(s) had untolerated taint
              {node-role.kubernetes.io/control-plane: true}"
            → autoscaler triggers scale-up worker 2→3 → "1 in backoff
              after failed scale-up" → still Pending → 15m timeout
              → InstallFailed → Flux uninstall+rollback → installFailures: 3
              → Flux gives up entirely

Live evidence quoted from chroot kubeconfig on prov #41:
  - bp-catalyst-platform HR `Reconciling=True, reason=Progressing,
    message="Running 'install' action with timeout of 15m0s"`
  - HR `Released=False, reason=InstallFailed, message="Helm install
    failed for release catalyst-system/catalyst-platform with chart
    bp-catalyst-platform@1.4.140: failed pre-install: 1 error occurred:
    * timed out waiting for the condition"`
  - Pod `qa-finalizer-strip-m2hdb` status=Pending; events:
    `Warning  FailedScheduling 108s default-scheduler 0/2 nodes are
    available: 1 Insufficient cpu, 1 node(s) had untolerated taint
    {node-role.kubernetes.io/control-plane: true}`
  - Worker `Allocated cpu 7980m (99%) of 8000m capacity`
  - Control-plane `Allocated cpu 635m (7%) of 8000m capacity` (idle)

Fix: add tolerations for the control-plane NoSchedule taint +
priorityClassName: system-cluster-critical so the qa-finalizer-strip
Job can ALWAYS schedule regardless of worker-node CPU saturation.
The hook is a defense-in-depth cleanup that runs in seconds on a
clean cluster; it legitimately belongs anywhere with free capacity
including the control-plane node (which on prov #41 had 7365m CPU
free vs. the hook's 50m request).

Why prior fixes didn't suffice:
  - Fix #114 introduced this hook to break a finalizer-deadlock loop
    on prov #9. Correct fix for that wedge; never anticipated worker
    saturation as a scheduling failure mode for the hook itself.
  - Fix #138 (chart 1.4.138) converted the qa-cnpg-backup-s3-seed +
    qa-cnpg-status-seed hooks (weight 0/post-install) to regular
    release resources to break a circular DAG dep. Different hook
    surface.
  - Fix #184 (chart 1.4.140) raised the gitea-token-mint pre-install
    hook (weight +10) wait budget for cold-start autoscaler. That
    hook runs AFTER qa-finalizer-strip (-99 < +10); if the -99 hook
    never starts, the +10 hook never runs.

Recurring class: same family as Fix #114 (hook scheduling failure
wedges entire HR install). 3 consecutive recurrences (prov #38, #39,
#41) on chart pin 1.4.140 trigger the category-level audit threshold
(CLAUDE.md rule "CATEGORY-LEVEL THINKING"). Coupled chart hygiene
swept in same commit:

  - Switch image from bitnamilegacy/kubectl:1.29.3 (Docker-Hub
    redirect for deprecated Bitnami images, 2025-08 cutover
    documented at platform/self-sovereign-cutover/chart/values.yaml:
    252) → harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4 —
    the canonical alpine-based kubectl image already used by sibling
    hook catalyst-gitea-token-mint (Fix #163). MIRROR-EVERYTHING +
    ARCHITECT-FIRST rules.

Coordinator follow-up tickets:
  - Sibling Jobs in templates/qa-fixtures/cnpg-clusters-qa.yaml
    (qa-cnpgpair-status-seed) still reference bitnamilegacy/kubectl
    :1.29.3 — same Bitnami-deprecation class. Out of scope for this
    Fix (not part of the recurrence cluster); flagged for a sweep.
  - Worker cpx32 sizing may be undersized for the bootstrap-kit fan-
    out on omantel.biz — separate sizing ticket, not blocking.

Changes:
  - products/catalyst/chart/templates/qa-fixtures/pre-install-
    finalizer-strip.yaml: add tolerations + priorityClassName;
    switch image to alpine/k8s:1.31.4. Inline doc comments explain
    the 4-layer trace and the Fix #114/#138/#184 history.
  - products/catalyst/chart/Chart.yaml: bump 1.4.140 → 1.4.141 with
    changelog entry capturing root cause + budget arithmetic.
  - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
    bump HR pin 1.4.140 → 1.4.141.

Verification:
  - helm template renders cleanly (exit 0, ~6700 lines).
  - kubectl apply --dry-run=client validates the rendered Job
    manifest (job.batch/qa-finalizer-strip created (dry run)).
  - Rendered Job contains tolerations[control-plane Exists NoSchedule],
    priorityClassName: system-cluster-critical, image: alpine/k8s:1.31.4.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 04:36:35 +04:00
github-actions[bot]
5fdd33b7c0 deploy: update catalyst images to 0ba87bb 2026-05-11 18:32:08 +00:00
github-actions[bot]
5c987309b5 deploy: update catalyst images to 5332ed0 2026-05-11 17:56:31 +00:00
github-actions[bot]
1f05e52e77 deploy: update catalyst images to 36d1f56 2026-05-11 17:47:04 +00:00
github-actions[bot]
0a869c3805 deploy: update catalyst images to 1863a25 2026-05-11 16:54:46 +00:00