* fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu
clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded
letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled.
On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate
limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and
the wildcard Certificate sticks Ready=False — Cilium Gateway has no
valid TLS secret → envoy listener never binds → public TLS handshake
to console.<fqdn> dies with SSL_ERROR_SYSCALL.
Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ?
staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign-
tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml
references it as ${WILDCARD_CERT_ISSUER}.
Default behaviour unchanged for non-QA (production) Sovereigns —
they still resolve to letsencrypt-dns01-prod-powerdns.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cilium-gateway): allow world ingress to Cilium Gateway reserved:ingress endpoint
When Cilium Gateway API runs with gatewayAPI.hostNetwork.enabled=true and
a default-deny CCNP is present, every public request to a Sovereign host
(console, auth, gitea, registry, api, ...) hits the gateway listener and
gets DENIED at envoy's cilium.l7policy filter with:
cilium.l7policy: Ingress from 1 policy lookup for endpoint X for port 30443: DENY
Public response: HTTP/1.1 403 Forbidden, body "Access denied", server: envoy.
Root cause: Cilium creates a special endpoint with identity reserved:ingress (8)
representing the gateway listener. By default this endpoint has
policy-enabled=both with allowed-ingress-identities=[1 (host)] and empty
L4 rules — so no port is permitted. The default-deny CCNP's NotIn-namespace
endpointSelector does NOT cover this endpoint (it has no
io.kubernetes.pod.namespace label), and our qa-fixtures didn't ship a
matching allow-template for it. Net effect: TLS handshake succeeds, HTTPRoutes
are Programmed, backends are healthy in-cluster, but every request 403s.
Caught live on prov #80 (omantel.biz, 2026-05-14) after the Gateway hostNetwork
fix (#1480) finally activated host-bind on :30443. Verified by:
- envoy debug log: cilium.l7policy DENY for endpoint 10.42.0.201 port 30443
- cilium-dbg endpoint get 3282 -o json: l4.ingress: [] and allowed-ingress-identities: [1]
- transiently applying the same CCNP via kubectl: console.omantel.biz → 200
Fix: ship a CCNP scoped to reserved:ingress that allows ingress from world,
cluster, host, remote-node (multi-region CP-to-CP), and kube-apiserver,
plus egress to all so envoy can forward to any backend service. This is
the canonical Cilium hostNetwork Gateway-API zero-trust pattern.
Chart bump: catalyst 1.4.142 → 1.4.143.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
* fix(flow_snapshot): region-scope dep edges (no cross-region wiring)
Founder caught on prov #66 (3dc9249ea73a6840, 2026-05-13): hel1-2's
install-* nodes all rendered dep arrows pointing at PRIMARY's install
nodes — cross-region edges where NAMING-CONVENTION §1.3 demands
independent fault domains (no cross-region wiring).
Root cause: helmwatch.Bridge persists secondary-region Jobs with bare
dep names ("install-cilium") because HR.spec.dependsOn carries chart
names without region context. The snapshot composer's normaliser
turned `install-cilium` → `<depID>:install-cilium` which IS the
primary's cilium JobID, not hel1-2's `<depID>:install-hel1-2/cilium`.
Every secondary install therefore drew a phantom cross-region edge.
Fix: in flow_snapshot_local.go, region-scope dep names when the source
Job is regional:
jobRegion=="hel1-2" + dep="install-cilium"
→ "install-hel1-2/cilium" → "<depID>:install-hel1-2/cilium"
Same fix applied to the Layer-2 hrDeps derivation path (per-AppID
lookup also gets bare chart names from the primary watcher). hrDeps
lookup is now done with the unprefixed AppID so it actually hits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cloud-init): wait for private NIC before k3s install (prov #71)
Hetzner Cloud hot-attaches the private-network NIC ~10-20s AFTER server
create. cloud-init init-local fetches /hetzner/v1/metadata/private-networks
BEFORE the NIC is ready, renders netplan with only eth0, and the
private NIC (kernel-renamed eth1 → enp7s0 by udev) stays DOWN.
Effect on secondary CPs: k3s server starts with
--node-ip=10.0.<10+idx>.2 --advertise-address=10.0.<10+idx>.2
and fatals on
"listen tcp 10.0.11.2:2380: bind: cannot assign requested address"
then crashloops. Caught on prov #71/omantel.biz/nbg1-1-cp1: k3s.service
restart counter reached 5394, kubeconfig never PUT back to mothership,
canvas showed secondary region as a permanent black hole. Diagnosed via
Hetzner rescue mode SSH 2026-05-14. Primary CP works by luck of faster
fsn1 zone NIC attach.
Fix: in cloud-init runcmd, BEFORE the k3s install, poll up to 120s for
the expected private IP (control plane) or a route to it (worker). If
the NIC appears DOWN with no netplan stanza, generate one with dhcp4:true
and `netplan apply`. Bail loudly if the IP/route never appears — failures
surface in cloud-init.log instead of disguising as a slow boot.
Symmetric fix in worker template covers autoscaler-spawned secondary
workers when worker_count > 0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(qa-fixtures): allow catalyst+newapi NS + kube-apiserver egress (prov #72)
The qa-fixtures chart's `default-deny` CiliumClusterwideNetworkPolicy
excluded `catalyst-system` from its NotIn list but FORGOT `catalyst`
(where bp-self-sovereign-cutover's Jobs live: auto-trigger,
gitea-mirror, harbor-projects, registry-pivot) and `newapi` (where
bp-newapi's Application pods live).
Effect on prov #72:
- bp-self-sovereign-cutover-auto-trigger Job stuck 20m+ on HTTP 000000
curling http://catalyst-api.catalyst-system.svc → DNS resolution + TCP
egress both denied by default-deny. Cutover never fires → handover
blocked → bp-catalyst-platform's --wait never completes.
- newapi-bp-newapi pod gets `secret newapi-oidc not found` but its
inability to resolve apiserver compounds the issue.
- qa-omantel cnpg cluster-primary/replica stuck "Setting up primary"
for 18m because initdb's `dial tcp 10.43.0.1:443 i/o timeout` — the
ClusterIP-rewritten kube-apiserver address has no allow-egress.
Fixes:
1. Add `catalyst` + `newapi` to $excludedNamespaces — they're first-party
blueprint namespaces analogous to catalyst-system.
2. Add `allow-kube-apiserver` CNP in qa-omantel using Cilium's canonical
`toEntities: [kube-apiserver]` directive so cnpg initdb can reach the
apiserver regardless of whether traffic resolves to ClusterIP, node
IP, or Service VIP.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prov #61 (2e197a934a0e0461, 2026-05-13): catalyst-api OOMKilled 6× during
phase-1 watch on a 3-region Sovereign. The in-memory state has grown
substantially since the 1Gi limit was set:
- 1 primary helmwatch.Watcher (45 HRs + informer cache)
- N secondary helmwatch.Watchers (45 HRs × 2 secondary regions, each
with its own informer cache)
- jobs.Store backed by on-disk + in-memory tree
- per-/snapshot poll: composes per-region region groups across all
Job rows + cross-references hrDeps from the live primary watcher
Combined steady-state exceeds 1Gi on cpx32-equivalent loads. Bumped
limits to 4Gi (request 512Mi up from 128Mi). The mothership node has
8GB+ resident, no other tight constraint. Future fix: persist region
in Job rows so secondary watchers don't need to be retained post
phase-1 (orthogonal cleanup).
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prov #44 (d9399223c3caa4f9) hit the catalyst-api 60m phase1 watch cap
with bp-catalyst-platform HR still mid-retry (failures=3) and 41/45 HRs
True. F1-F7 are correct and live on main (qa-finalizer-strip Completed,
autoscaler workers joined). The remaining wall is total bootstrap-kit
install time exceeding the outer watch budget on a fresh cpx42×1
Sovereign without a warm Harbor proxy-cache.
Two lock-step changes widen both bounds:
1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
install.timeout 15m → 30m, upgrade.timeout 15m → 30m. The umbrella
chart genuinely needs >15m worst case when the full SME + Catalyst
service stack rolls cold.
2. products/catalyst/bootstrap/api/internal/helmwatch/helmwatch.go:
DefaultWatchTimeout 60m → 120m. Worst-case inner HR retry chain is
now 30m × 3 = 90m; the outer phase1 budget MUST be larger so the
watch never terminates while helm-controller still has remediation
attempts left. CATALYST_PHASE1_WATCH_TIMEOUT env-var override path
was already wired (issue #538 baseline) — chart template now
declares the explicit "120m" value so the runtime knob is
discoverable for capacity-bounded environments. Per INVIOLABLE-
PRINCIPLES.md #4 the knob remains runtime-configurable.
New unit test TestPhase1WatchConfig_ProductionDefaultIs120m pins the
F8 floor against future regression. Existing env-var override + field-
override tests still pass unchanged.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause (4-layer trace on prov #41, omantel.biz, 2026-05-12 00:28 UTC):
bp-catalyst-platform HR install.timeout=15m
→ Helm pre-install hook: qa-finalizer-strip Job (weight -99)
→ Pod requests 50m CPU + 64Mi memory (tiny)
→ BUT no tolerations → scheduler restricted to worker
→ worker cpx32 (8vCPU/16GB) at 99% CPU requests
(7980m of 8000m allocated) after bootstrap-kit fan-out
→ FailedScheduling: "0/2 nodes are available: 1
Insufficient cpu, 1 node(s) had untolerated taint
{node-role.kubernetes.io/control-plane: true}"
→ autoscaler triggers scale-up worker 2→3 → "1 in backoff
after failed scale-up" → still Pending → 15m timeout
→ InstallFailed → Flux uninstall+rollback → installFailures: 3
→ Flux gives up entirely
Live evidence quoted from chroot kubeconfig on prov #41:
- bp-catalyst-platform HR `Reconciling=True, reason=Progressing,
message="Running 'install' action with timeout of 15m0s"`
- HR `Released=False, reason=InstallFailed, message="Helm install
failed for release catalyst-system/catalyst-platform with chart
bp-catalyst-platform@1.4.140: failed pre-install: 1 error occurred:
* timed out waiting for the condition"`
- Pod `qa-finalizer-strip-m2hdb` status=Pending; events:
`Warning FailedScheduling 108s default-scheduler 0/2 nodes are
available: 1 Insufficient cpu, 1 node(s) had untolerated taint
{node-role.kubernetes.io/control-plane: true}`
- Worker `Allocated cpu 7980m (99%) of 8000m capacity`
- Control-plane `Allocated cpu 635m (7%) of 8000m capacity` (idle)
Fix: add tolerations for the control-plane NoSchedule taint +
priorityClassName: system-cluster-critical so the qa-finalizer-strip
Job can ALWAYS schedule regardless of worker-node CPU saturation.
The hook is a defense-in-depth cleanup that runs in seconds on a
clean cluster; it legitimately belongs anywhere with free capacity
including the control-plane node (which on prov #41 had 7365m CPU
free vs. the hook's 50m request).
Why prior fixes didn't suffice:
- Fix#114 introduced this hook to break a finalizer-deadlock loop
on prov #9. Correct fix for that wedge; never anticipated worker
saturation as a scheduling failure mode for the hook itself.
- Fix#138 (chart 1.4.138) converted the qa-cnpg-backup-s3-seed +
qa-cnpg-status-seed hooks (weight 0/post-install) to regular
release resources to break a circular DAG dep. Different hook
surface.
- Fix#184 (chart 1.4.140) raised the gitea-token-mint pre-install
hook (weight +10) wait budget for cold-start autoscaler. That
hook runs AFTER qa-finalizer-strip (-99 < +10); if the -99 hook
never starts, the +10 hook never runs.
Recurring class: same family as Fix#114 (hook scheduling failure
wedges entire HR install). 3 consecutive recurrences (prov #38, #39,
#41) on chart pin 1.4.140 trigger the category-level audit threshold
(CLAUDE.md rule "CATEGORY-LEVEL THINKING"). Coupled chart hygiene
swept in same commit:
- Switch image from bitnamilegacy/kubectl:1.29.3 (Docker-Hub
redirect for deprecated Bitnami images, 2025-08 cutover
documented at platform/self-sovereign-cutover/chart/values.yaml:
252) → harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4 —
the canonical alpine-based kubectl image already used by sibling
hook catalyst-gitea-token-mint (Fix#163). MIRROR-EVERYTHING +
ARCHITECT-FIRST rules.
Coordinator follow-up tickets:
- Sibling Jobs in templates/qa-fixtures/cnpg-clusters-qa.yaml
(qa-cnpgpair-status-seed) still reference bitnamilegacy/kubectl
:1.29.3 — same Bitnami-deprecation class. Out of scope for this
Fix (not part of the recurrence cluster); flagged for a sweep.
- Worker cpx32 sizing may be undersized for the bootstrap-kit fan-
out on omantel.biz — separate sizing ticket, not blocking.
Changes:
- products/catalyst/chart/templates/qa-fixtures/pre-install-
finalizer-strip.yaml: add tolerations + priorityClassName;
switch image to alpine/k8s:1.31.4. Inline doc comments explain
the 4-layer trace and the Fix #114/#138/#184 history.
- products/catalyst/chart/Chart.yaml: bump 1.4.140 → 1.4.141 with
changelog entry capturing root cause + budget arithmetic.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
bump HR pin 1.4.140 → 1.4.141.
Verification:
- helm template renders cleanly (exit 0, ~6700 lines).
- kubectl apply --dry-run=client validates the rendered Job
manifest (job.batch/qa-finalizer-strip created (dry run)).
- Rendered Job contains tolerations[control-plane Exists NoSchedule],
priorityClassName: system-cluster-critical, image: alpine/k8s:1.31.4.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>