fix(cilium-gateway): allow world ingress to reserved:ingress (unblocks Sovereign public surfaces) (#1482)

* fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu

clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded
letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled.
On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate
limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and
the wildcard Certificate sticks Ready=False — Cilium Gateway has no
valid TLS secret → envoy listener never binds → public TLS handshake
to console.<fqdn> dies with SSL_ERROR_SYSCALL.

Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ?
staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign-
tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml
references it as ${WILDCARD_CERT_ISSUER}.

Default behaviour unchanged for non-QA (production) Sovereigns —
they still resolve to letsencrypt-dns01-prod-powerdns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cilium-gateway): allow world ingress to Cilium Gateway reserved:ingress endpoint

When Cilium Gateway API runs with gatewayAPI.hostNetwork.enabled=true and
a default-deny CCNP is present, every public request to a Sovereign host
(console, auth, gitea, registry, api, ...) hits the gateway listener and
gets DENIED at envoy's cilium.l7policy filter with:

    cilium.l7policy: Ingress from 1 policy lookup for endpoint X for port 30443: DENY

Public response: HTTP/1.1 403 Forbidden, body "Access denied", server: envoy.

Root cause: Cilium creates a special endpoint with identity reserved:ingress (8)
representing the gateway listener. By default this endpoint has
policy-enabled=both with allowed-ingress-identities=[1 (host)] and empty
L4 rules — so no port is permitted. The default-deny CCNP's NotIn-namespace
endpointSelector does NOT cover this endpoint (it has no
io.kubernetes.pod.namespace label), and our qa-fixtures didn't ship a
matching allow-template for it. Net effect: TLS handshake succeeds, HTTPRoutes
are Programmed, backends are healthy in-cluster, but every request 403s.

Caught live on prov #80 (omantel.biz, 2026-05-14) after the Gateway hostNetwork
fix (#1480) finally activated host-bind on :30443. Verified by:
- envoy debug log: cilium.l7policy DENY for endpoint 10.42.0.201 port 30443
- cilium-dbg endpoint get 3282 -o json: l4.ingress: [] and allowed-ingress-identities: [1]
- transiently applying the same CCNP via kubectl: console.omantel.biz → 200

Fix: ship a CCNP scoped to reserved:ingress that allows ingress from world,
cluster, host, remote-node (multi-region CP-to-CP), and kube-apiserver,
plus egress to all so envoy can forward to any backend service. This is
the canonical Cilium hostNetwork Gateway-API zero-trust pattern.

Chart bump: catalyst 1.4.142 → 1.4.143.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
This commit is contained in:
e3mrah 2026-05-14 18:50:34 +04:00 committed by GitHub
parent fb99ae5fd0
commit 115c58885b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 50 additions and 2 deletions

View File

@ -473,7 +473,7 @@ spec:
# from bitnamilegacy/kubectl:1.29.3 → alpine/k8s:1.31.4 in same
# commit (rule-17 MIRROR-EVERYTHING hygiene; bitnamilegacy is
# the Docker-Hub redirect for deprecated Bitnami 2025-08 cutover).
version: 1.4.142
version: 1.4.143
sourceRef:
kind: HelmRepository
name: bp-catalyst-platform

View File

@ -1058,7 +1058,7 @@ name: bp-catalyst-platform
# Fix #154 (HR-timeout audit). Those bumped the HelmRelease
# install.timeout. This bumps the chart-INTERNAL wait loop budget
# inside the pre-install hook Job, which is a different seam.
version: 1.4.142
version: 1.4.143
appVersion: 1.4.94
# 1.4.141 (qa-loop Fix #185, prov #38/#39/#41 recurrence — pre-install
# hook unscheduable on saturated worker):

View File

@ -67,6 +67,54 @@ spec:
egress:
- {}
---
# 1b/12 — Allow external traffic into Cilium Gateway (reserved:ingress).
#
# Root cause: Cilium Gateway API (gatewayAPI.hostNetwork.enabled=true)
# creates a special endpoint with identity `reserved:ingress` (8) that
# represents the gateway listener. By default this endpoint has
# policy-enabled=both, allowed-ingress-identities=[1 (host)], and an
# empty L4 rule set — i.e. world traffic that arrives at the gateway
# is dropped by cilium.l7policy with a 403 "Access denied" before any
# HTTPRoute is evaluated.
#
# Symptom: every public Sovereign host (console, auth, gitea, api, …)
# returns `HTTP/1.1 403 Forbidden` body=`Access denied` server=envoy
# even though the HTTPRoutes are Programmed, the Gateway is Accepted,
# and the backend services are healthy in-cluster. Caught live on
# prov #80 (omantel.biz, 2026-05-14): TLS handshake OK with the
# correct cert, envoy reachable on :30443, but every request 403'd.
# Confirmed via `cilium-dbg endpoint get 3282 -o json` showing
# `l4.ingress: []` and `allowed-ingress-identities: [1]` only.
#
# Fix: a CCNP scoped to the `reserved:ingress` endpoint that allows
# ingress from `world`, `cluster`, `host`, `remote-node` (multi-region
# CP-to-CP), and `kube-apiserver`, plus egress to `all` so envoy can
# forward to any backend service. This is the canonical Cilium pattern
# for hostNetwork Gateway-API zero-trust — without it the gateway
# becomes a black hole the moment a default-deny CCNP is present.
apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
name: allow-gateway-world-ingress
labels:
openova.io/managed-by: qa-fixtures
openova.io/policy-tier: gateway-allow
spec:
description: "Allow world + cluster traffic to reach the Cilium Gateway listener; default-deny would otherwise drop all public requests at the gateway."
endpointSelector:
matchLabels:
reserved.ingress: ""
ingress:
- fromEntities:
- world
- cluster
- host
- remote-node
- kube-apiserver
egress:
- toEntities:
- all
---
# 2/12 — qa-omantel: allow DNS egress (kube-dns)
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy