openova/products/catalyst/chart/templates/sovereign-wildcard-certs.yaml
e3mrah 90aa2767da
fix(bp-cert-manager-powerdns-webhook,bp-catalyst-platform): staging ClusterIssuer for QA Sovereigns (Fix #123, LE rate-limit bypass) (#1339)
Root cause (qa-loop iter-1 wedge, 2026-05-10):
  Let's Encrypt production hit the 5-certs/168h rate limit on
  *.omantel.biz (retry after 2026-05-11 22:08 UTC). Cilium-envoy
  could not get a wildcard cert -> console.omantel.biz TLS handshake
  failed -> iter-1 Test Executor could not run. Customer Sovereigns
  are unaffected (one cert per registered domain in their lifetime),
  but QA Sovereigns wipe + re-provision dozens of times in a session
  and exhaust the production ceiling within hours.

Fix (target-state, NOT workaround):
  - bp-cert-manager-powerdns-webhook 1.1.0 ships a SECOND ClusterIssuer
    (letsencrypt-dns01-staging-powerdns) alongside the existing
    production one. Same DNS-01 webhook config (same PowerDNS endpoint,
    same API key) -> only the ACME directory URL + account key differ.
    Both ClusterIssuers are real cert-manager resources; LE treats them
    as wholly independent issuers so a rate-limit hit on production
    does NOT block staging issuance.
  - bp-catalyst-platform 1.4.136 adds wildcardCert.useStaging (bool,
    default false). When true, sovereign-wildcard-certs.yaml renders
    Certificate(s) with issuerRef.name pointing at the staging issuer
    instead of production.
  - bootstrap-kit slot 13 wires WILDCARD_CERT_USE_STAGING via envsubst,
    same passthrough pattern as QA_FIXTURES_ENABLED.
  - catalyst-api auto-stamps wildcard_cert_use_staging="true" on QA
    Sovereigns (Request.QATestEnabled=true) so the per-Sovereign
    overlay flips both QA fixtures + staging certs from one wizard
    toggle.
  - tofu var wildcard_cert_use_staging propagates through main.tf
    into the cloudinit postBuild.substitute block on both primary +
    secondary regions.

Result:
  cilium-envoy on a fresh QA Sovereign gets a staging-signed wildcard
  cert in <2min (no production rate limit). curl -sk + Playwright
  (ignoreHTTPSErrors:true) accept the cert; iter-1 Executor can run
  within minutes of provision. Customer Sovereigns (QATestEnabled=
  false) keep getting real-trusted production certs.

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every ACME URL
+ issuer name is values-overridable. Operators wiring a private
staging ACME (e.g. internal Smallstep CA) override via per-Sovereign
overlay without rebuilding any Blueprint. Staging is the documented
LE pattern (https://letsencrypt.org/docs/staging-environment/), not a
band-aid.

_None directly -- infrastructure fix; bypasses Let's Encrypt 5/168h rate limit on QA Sovereigns by using staging ACME endpoint, enabling iter-1 to run within minutes of fresh provision_

Co-authored-by: alierenbaysal <159913086+alierenbaysal@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 01:08:07 +04:00

147 lines
6.8 KiB
YAML

{{- /*
Per-zone wildcard Certificate(s) for the Cilium Gateway listener.
Issue #827 (parent epic #825): a franchised Sovereign now supports
N parent zones, NOT one. The operator brings 1+ parent domains at
signup (`omani.works` for own use, `omani.trade` for the SME pool,
etc.) and may add more post-handover via the admin console (#829).
This template renders one cert-manager.io/v1.Certificate resource
per entry in `.Values.parentZones`, each requesting `*.<zone>` plus
the apex from the `letsencrypt-dns01-prod-powerdns` ClusterIssuer
(shipped by bp-cert-manager-powerdns-webhook, bootstrap-kit slot
49). Each Certificate renews independently — a stalled DNS-01
challenge on one zone does not block another zone's renewal.
Single-zone fallback: when `parentZones` is empty AND
`global.sovereignFQDN` is non-empty, render exactly ONE Certificate
covering `*.<sovereignFQDN>` + apex. This preserves backward
compatibility with the legacy clusters/_template/sovereign-tls/
cilium-gateway-cert.yaml path so single-zone Sovereigns keep working
without per-overlay edits during the cutover window. (That legacy
file remains in place for clusters that have not yet adopted the
multi-zone overlay; both paths produce a Certificate named
`sovereign-wildcard-tls`, so the legacy file's resource is
overwritten by Helm's owner reference once this chart starts
rendering it. The legacy file is kept until every active Sovereign
has been re-templated through bp-catalyst-platform 1.4.0+.)
Skip-render guards (per the chart-default-render contract used
across bp-* — see e.g. bp-cert-manager-powerdns-webhook's
clusterissuer.yaml skip-render pattern):
1. .Values.wildcardCert.enabled — operator opt-out
2. parentZones non-empty OR global.sovereignFQDN non-empty —
never emit a Certificate with an empty hostname; cert-manager
would reject it at admission anyway.
Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode) every
operationally-meaningful value flows from values.yaml — issuer,
namespace, duration, renewBefore, secret-name, and the zones list
itself are all operator-overridable.
Resource naming:
- When parentZones is non-empty: each Certificate is named
`sovereign-wildcard-tls-<sanitised-name>` (default) or the
explicit `secretName` from the entry. The Secret name MUST
match the resource name so the Gateway listener's
certificateRefs block can resolve it.
- When falling back to single-zone (parentZones empty,
global.sovereignFQDN populated): named `sovereign-wildcard-tls`
to preserve the legacy contract referenced by
clusters/_template/sovereign-tls/cilium-gateway.yaml's
`certificateRefs[0].name: sovereign-wildcard-tls`.
*/}}
{{- if .Values.wildcardCert.enabled }}
{{- $ns := .Values.wildcardCert.namespace | default "kube-system" }}
{{/*
Issuer selection (Fix #123, LE rate-limit bypass for QA Sovereigns):
- .Values.wildcardCert.useStaging=true → staging issuer (default
`letsencrypt-dns01-staging-powerdns`, shipped by
bp-cert-manager-powerdns-webhook 1.1.0+ alongside the production
issuer). Hits LE's staging ACME endpoint
(https://acme-staging-v02.api.letsencrypt.org/directory). Cert is
signed by Fake LE Intermediate X1 so browsers reject without an
explicit exception, but `curl -sk` and Playwright
(ignoreHTTPSErrors:true) accept it. Production rate limit (5
certs/168h per registered domain) does NOT apply to staging.
- .Values.wildcardCert.useStaging=false → production issuer (default
`letsencrypt-dns01-prod-powerdns`). Real-trusted certs.
Default false on the chart; the bootstrap-kit slot for QA Sovereigns
flips this to true via ${WILDCARD_CERT_USE_STAGING:-false} envsubst.
Per docs/INVIOLABLE-PRINCIPLES.md #4 every issuer name is values-
overridable (e.g. private ACME).
*/}}
{{- $issuer := .Values.wildcardCert.issuerName | default "letsencrypt-dns01-prod-powerdns" }}
{{- if .Values.wildcardCert.useStaging }}
{{- $issuer = .Values.wildcardCert.issuerNameStaging | default "letsencrypt-dns01-staging-powerdns" }}
{{- end }}
{{- $duration := .Values.wildcardCert.duration }}
{{- $renewBefore := .Values.wildcardCert.renewBefore }}
{{- /* Determine the effective zone list.
Render policy (avoids conflict with the legacy
clusters/_template/sovereign-tls/cilium-gateway-cert.yaml which
is still owned by Kustomization/sovereign-tls):
- parentZones populated → render N Certificates here, each
named sovereign-wildcard-tls-<sanitised-zone>.
These are NEW resources, not collisions.
- parentZones empty → render NOTHING. The legacy
sovereign-tls Kustomization owns the
single-zone Certificate. Once every
active Sovereign moves to multi-zone
overlays, the legacy file is
deletable.
*/}}
{{- $zones := list }}
{{- if gt (len .Values.parentZones) 0 }}
{{- $zones = .Values.parentZones }}
{{- end }}
{{- range $i, $z := $zones }}
{{- /* Sanitise the zone name into a DNS-1123-compatible label suffix.
PowerDNS zone names contain dots; K8s resource names cannot.
`sovereign-wildcard-tls-omani.works` -> `sovereign-wildcard-tls-omani-works`.
Each per-zone Certificate uses a UNIQUE secret name (sanitised
zone) so the chart NEVER collides with the legacy
sovereign-tls Kustomization's `sovereign-wildcard-tls` resource.
The Cilium Gateway listener for each zone references the
corresponding sovereign-wildcard-tls-<sanitised-zone> Secret in
its certificateRefs block — operators that ship a multi-zone
Sovereign update the Gateway listener config in their per-cluster
overlay (or rely on the chart's Gateway template once issue #831
lands a multi-listener Gateway). */}}
{{- $sanitised := replace "." "-" $z.name }}
{{- $secretName := default (printf "sovereign-wildcard-tls-%s" $sanitised) $z.secretName }}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: {{ $secretName }}
namespace: {{ $ns }}
labels:
catalyst.openova.io/component: sovereign-wildcard-cert
catalyst.openova.io/parent-zone: {{ $z.name | quote }}
catalyst.openova.io/parent-zone-role: {{ default "primary" $z.role | quote }}
{{- if $.Values.global.sovereignFQDN }}
catalyst.openova.io/sovereign: {{ $.Values.global.sovereignFQDN | quote }}
{{- end }}
spec:
secretName: {{ $secretName }}
issuerRef:
name: {{ $issuer }}
kind: ClusterIssuer
commonName: "*.{{ $z.name }}"
dnsNames:
- "*.{{ $z.name }}"
- {{ $z.name | quote }}
{{- with $duration }}
duration: {{ . }}
{{- end }}
{{- with $renewBefore }}
renewBefore: {{ . }}
{{- end }}
{{- end }}
{{- end }}