openova/.github/workflows/test-bootstrap-kit.yaml
hatiyildiz 8bfdb80311 feat(ci): TBD-A26 pin-sync audit verifies GHCR artifact exists for each bootstrap-kit pin
The existing TBD-A6 + TBD-A20 system catches drift between Chart.yaml,
bootstrap-kit pin, and blueprint.yaml spec.version AFTER chart-publish
commits land on main, but it cannot detect the "chart bumped but never
published" failure mode: the bootstrap-kit pin points at a chart
version that GHCR never received because blueprint-release.yaml
failed (e.g. TBD-A20 YAML scanner break, race with TBD-A20 lockstep,
runner cancellation, transient GHCR push 5xx).

Concrete observed failure (2026-05-18/19): bp-catalyst-platform 1.4.180
and 1.4.181 were "lost" during the TBD-A20 scanner break window
(21:04Z → 22:07Z). The pin sync audit reported chart=pin=1.4.181 PASS
while ghcr.io/openova-io/bp-catalyst-platform:1.4.181 did NOT exist
until A58 manually re-fired the workflow via dispatch. Fresh
Sovereigns silently fell back to the last working tag.

What this adds
- scripts/check-bootstrap-kit-pin-sync.sh gains `--check-ghcr` (and
  optional `--ghcr-org <org>`). For every chart pinned in the kit, it
  lists ghcr.io/<org>/<chart> tags via `gh api
  /orgs/<org>/packages/container/<chart>/versions --paginate`, then
  asserts the pinned version appears. Exits 1 on any missing tag.
- A per-chart tag cache avoids redundant paginations.
- .github/workflows/test-bootstrap-kit.yaml `pin-sync-audit` job now
  passes `--check-ghcr` on `push` to main + `workflow_dispatch`
  (PR mode stays `--changed-only` and skips GHCR — PRs cannot publish
  to GHCR anyway). The job stays `continue-on-error: true` under the
  same observational umbrella as the existing post-merge full sweep
  so a transient API blip cannot red-flag every chart bump; the
  missing-tag list still surfaces on the run summary for operator
  attention.
- Job grants `packages: read` so the workflow GITHUB_TOKEN can list
  private package versions.

Verification (origin/main snapshot, 2026-05-19)
- Full sweep default: 50/50 chart→pin pairs OK, no GHCR check.
- Full sweep `--check-ghcr`: 50/50 pairs OK AND 50/50 GHCR tags
  present — PASS exit 0.
- Negative test: with products/catalyst/chart/Chart.yaml + slot 13
  both set to a non-existent 99.99.99, the script exits 1 with
  `GHCR MISS bp-catalyst-platform:99.99.99 — tag NOT FOUND` and the
  remediation hint pointing at `gh workflow run
  blueprint-release.yaml`.
- `--changed-only --base origin/main` against a no-change tree: clean
  exit 0 with the existing "nothing to check" message.

Refs #1872, #1864, #1856.

Closes #1872

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:11:14 +02:00

190 lines
7.7 KiB
YAML

name: Test — Bootstrap Kit (kind cluster + Flux)
# Closes #145 — integration test that the 11-component bootstrap kit's
# Flux Kustomizations are well-formed and accepted by a real K8s API
# server. Spins up a kind cluster, installs Flux, and asserts that all
# 11 Kustomizations get registered. Does NOT wait for full reconciliation
# (chart pulls + cloud creds belong to #141 Hetzner E2E).
on:
push:
paths:
- 'tests/e2e/bootstrap-kit/**'
- 'platform/**/blueprint.yaml'
- 'platform/**/chart/**'
- 'products/**/chart/**'
- 'clusters/**'
- 'scripts/check-bootstrap-deps.sh'
- 'scripts/check-bootstrap-kit-pin-sync.sh'
- 'scripts/expected-bootstrap-deps.yaml'
- '.github/workflows/test-bootstrap-kit.yaml'
branches: [main]
pull_request:
paths:
- 'tests/e2e/bootstrap-kit/**'
- 'platform/**/blueprint.yaml'
- 'platform/**/chart/**'
- 'products/**/chart/**'
- 'clusters/**'
- 'scripts/check-bootstrap-deps.sh'
- 'scripts/check-bootstrap-kit-pin-sync.sh'
- 'scripts/expected-bootstrap-deps.yaml'
- '.github/workflows/test-bootstrap-kit.yaml'
workflow_dispatch:
jobs:
dependency-graph-audit:
# Audit the bootstrap-kit dependency graph against the expected DAG declared
# in scripts/expected-bootstrap-deps.yaml. Mechanically verifies every HR's
# spec.dependsOn matches the design contract in
# docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md §2 + §3, and detects cycles. Runs on
# every PR that touches a bootstrap-kit HR or the audit data files. Owned by
# W2.K0; consumed by W2.K1-K4 PRs to validate slot 15-48 additions.
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Install yq
run: |
sudo wget -qO /usr/local/bin/yq \
https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64
sudo chmod +x /usr/local/bin/yq
yq --version
- name: Run bootstrap-kit dependency audit
run: bash scripts/check-bootstrap-deps.sh
pin-sync-audit:
# TBD-A6 regression test. Asserts every Chart.yaml in platform/* or
# products/* whose chart is pinned in clusters/_template/bootstrap-
# kit/ has the SAME version on both sides.
#
# On `pull_request` we use --changed-only --base <base-ref> so a PR
# is only blocked on chart→pin pairs IT modified. This keeps the
# gate effective (every new chart bump must update the pin) without
# forcing pre-existing drifts (13 charts as of 2026-05-18) to be
# fixed before any unrelated PR can land. The auto-bump hook in
# blueprint-release.yaml will heal those drifts on the next bump
# of each lagging chart.
#
# On `push` to main and `workflow_dispatch` we run the FULL sweep
# so post-merge drift is observable on the run summary even if the
# PR gate let it through.
#
# TBD-A17 mitigation (#1849, 2026-05-18): the full sweep on `push`
# to main races with the blueprint-release auto-bump hook. When a
# PR bumps a Chart.yaml version, the merge commit (which is what
# this push event sees) does NOT yet contain the matching
# bootstrap-kit pin bump — the auto-bump hook runs in a DIFFERENT
# workflow (blueprint-release.yaml) and pushes the pin bump as a
# follow-up bot commit, which (per GITHUB_TOKEN convention) does
# NOT retrigger this workflow. So the FIRST run on every chart-
# bumping merge sees `chart=N pin=N-1` drift and would block.
# The actual desired-state is that the follow-up bot commit heals
# the drift within ~60s. Push-mode is therefore observational, not
# blocking; we use `continue-on-error: true` so the workflow stays
# green while the drift is still visible on the run summary.
#
# TBD-A26 (issue #1872, 2026-05-19): full-sweep mode ALSO runs the
# `--check-ghcr` phase, which verifies every pinned chart version
# exists as a tag on ghcr.io/openova-io/<chart>. Catches the
# "chart bumped but never published" failure mode that TBD-A6 +
# TBD-A20 cannot see (e.g. blueprint-release.yaml failed with
# startup_failure, race against TBD-A20 lockstep). Stays under the
# same continue-on-error umbrella — observational on push/dispatch,
# so a transient GHCR API blip doesn't red-flag every chart bump.
# The job summary surfaces the missing-tag list for any operator
# who notices the warning.
runs-on: ubuntu-latest
continue-on-error: ${{ github.event_name == 'push' || github.event_name == 'workflow_dispatch' }}
permissions:
# `gh api /orgs/<org>/packages/container/<chart>/versions` needs
# the read:packages scope for private package metadata. The
# workflow GITHUB_TOKEN inherits this from the `packages: read`
# block when explicitly requested.
contents: read
packages: read
steps:
- name: Checkout
uses: actions/checkout@v4
with:
# Need history back to the PR base for the --changed-only diff.
fetch-depth: 0
- name: Run pin-sync audit (changed-only on PR, full sweep + --check-ghcr otherwise)
env:
# `gh` defers to GH_TOKEN when running on a runner; pass the
# workflow token explicitly so the package-listing API call
# picks up the `packages: read` scope granted above.
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
set -euo pipefail
if [ "${{ github.event_name }}" = "pull_request" ]; then
base="${{ github.event.pull_request.base.sha }}"
echo "Running --changed-only against base ${base}"
bash scripts/check-bootstrap-kit-pin-sync.sh --changed-only --base "${base}"
else
echo "Running full sweep + --check-ghcr (event=${{ github.event_name }})"
bash scripts/check-bootstrap-kit-pin-sync.sh --check-ghcr
fi
manifest-validation:
# Static-only validation: blueprint.yaml + chart Chart.yaml + clusters/_template
# parsing + dependency order check. Runs on every push.
runs-on: ubuntu-latest
needs: dependency-graph-audit
defaults:
run:
working-directory: tests/e2e/bootstrap-kit
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version: '1.22'
cache-dependency-path: tests/e2e/bootstrap-kit/go.sum
- name: Run static validation
run: go test -v -count=1
kind-reconciliation:
# Kind-cluster reconciliation: brings up kubernetes-in-docker, installs
# Flux, and verifies the API server accepts our 11 bootstrap-kit
# Kustomizations. Runs only on main to keep PRs fast — the ticket calls
# for "all 11 phases install in sequence on a kind cluster (CI)" so this
# is the long-form gate.
runs-on: ubuntu-latest
needs: manifest-validation
if: github.event_name == 'push' || github.event_name == 'workflow_dispatch'
defaults:
run:
working-directory: tests/e2e/bootstrap-kit
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version: '1.22'
cache-dependency-path: tests/e2e/bootstrap-kit/go.sum
- name: Set up kind
uses: helm/kind-action@v1
with:
cluster_name: bootstrap-kit-test
version: v0.25.0
node_image: kindest/node:v1.30.6
- name: Install Flux CLI
uses: fluxcd/flux2/action@main
- name: Run kind-reconciliation test
env:
BOOTSTRAP_KIT_KIND_TEST: '1'
BOOTSTRAP_KIT_GIT_URL: https://github.com/${{ github.repository }}
run: go test -v -count=1 -run TestBootstrapKit_KindReconciliation -timeout 10m