feat(ci): TBD-A26 pin-sync audit verifies GHCR artifact exists for each bootstrap-kit pin
The existing TBD-A6 + TBD-A20 system catches drift between Chart.yaml, bootstrap-kit pin, and blueprint.yaml spec.version AFTER chart-publish commits land on main, but it cannot detect the "chart bumped but never published" failure mode: the bootstrap-kit pin points at a chart version that GHCR never received because blueprint-release.yaml failed (e.g. TBD-A20 YAML scanner break, race with TBD-A20 lockstep, runner cancellation, transient GHCR push 5xx). Concrete observed failure (2026-05-18/19): bp-catalyst-platform 1.4.180 and 1.4.181 were "lost" during the TBD-A20 scanner break window (21:04Z → 22:07Z). The pin sync audit reported chart=pin=1.4.181 PASS while ghcr.io/openova-io/bp-catalyst-platform:1.4.181 did NOT exist until A58 manually re-fired the workflow via dispatch. Fresh Sovereigns silently fell back to the last working tag. What this adds - scripts/check-bootstrap-kit-pin-sync.sh gains `--check-ghcr` (and optional `--ghcr-org <org>`). For every chart pinned in the kit, it lists ghcr.io/<org>/<chart> tags via `gh api /orgs/<org>/packages/container/<chart>/versions --paginate`, then asserts the pinned version appears. Exits 1 on any missing tag. - A per-chart tag cache avoids redundant paginations. - .github/workflows/test-bootstrap-kit.yaml `pin-sync-audit` job now passes `--check-ghcr` on `push` to main + `workflow_dispatch` (PR mode stays `--changed-only` and skips GHCR — PRs cannot publish to GHCR anyway). The job stays `continue-on-error: true` under the same observational umbrella as the existing post-merge full sweep so a transient API blip cannot red-flag every chart bump; the missing-tag list still surfaces on the run summary for operator attention. - Job grants `packages: read` so the workflow GITHUB_TOKEN can list private package versions. Verification (origin/main snapshot, 2026-05-19) - Full sweep default: 50/50 chart→pin pairs OK, no GHCR check. - Full sweep `--check-ghcr`: 50/50 pairs OK AND 50/50 GHCR tags present — PASS exit 0. - Negative test: with products/catalyst/chart/Chart.yaml + slot 13 both set to a non-existent 99.99.99, the script exits 1 with `GHCR MISS bp-catalyst-platform:99.99.99 — tag NOT FOUND` and the remediation hint pointing at `gh workflow run blueprint-release.yaml`. - `--changed-only --base origin/main` against a no-change tree: clean exit 0 with the existing "nothing to check" message. Refs #1872, #1864, #1856. Closes #1872 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
26e4c8e30e
commit
8bfdb80311
29
.github/workflows/test-bootstrap-kit.yaml
vendored
29
.github/workflows/test-bootstrap-kit.yaml
vendored
@ -85,8 +85,26 @@ jobs:
|
||||
# the drift within ~60s. Push-mode is therefore observational, not
|
||||
# blocking; we use `continue-on-error: true` so the workflow stays
|
||||
# green while the drift is still visible on the run summary.
|
||||
#
|
||||
# TBD-A26 (issue #1872, 2026-05-19): full-sweep mode ALSO runs the
|
||||
# `--check-ghcr` phase, which verifies every pinned chart version
|
||||
# exists as a tag on ghcr.io/openova-io/<chart>. Catches the
|
||||
# "chart bumped but never published" failure mode that TBD-A6 +
|
||||
# TBD-A20 cannot see (e.g. blueprint-release.yaml failed with
|
||||
# startup_failure, race against TBD-A20 lockstep). Stays under the
|
||||
# same continue-on-error umbrella — observational on push/dispatch,
|
||||
# so a transient GHCR API blip doesn't red-flag every chart bump.
|
||||
# The job summary surfaces the missing-tag list for any operator
|
||||
# who notices the warning.
|
||||
runs-on: ubuntu-latest
|
||||
continue-on-error: ${{ github.event_name == 'push' || github.event_name == 'workflow_dispatch' }}
|
||||
permissions:
|
||||
# `gh api /orgs/<org>/packages/container/<chart>/versions` needs
|
||||
# the read:packages scope for private package metadata. The
|
||||
# workflow GITHUB_TOKEN inherits this from the `packages: read`
|
||||
# block when explicitly requested.
|
||||
contents: read
|
||||
packages: read
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
@ -94,7 +112,12 @@ jobs:
|
||||
# Need history back to the PR base for the --changed-only diff.
|
||||
fetch-depth: 0
|
||||
|
||||
- name: Run pin-sync audit (changed-only on PR, full sweep otherwise)
|
||||
- name: Run pin-sync audit (changed-only on PR, full sweep + --check-ghcr otherwise)
|
||||
env:
|
||||
# `gh` defers to GH_TOKEN when running on a runner; pass the
|
||||
# workflow token explicitly so the package-listing API call
|
||||
# picks up the `packages: read` scope granted above.
|
||||
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
run: |
|
||||
set -euo pipefail
|
||||
if [ "${{ github.event_name }}" = "pull_request" ]; then
|
||||
@ -102,8 +125,8 @@ jobs:
|
||||
echo "Running --changed-only against base ${base}"
|
||||
bash scripts/check-bootstrap-kit-pin-sync.sh --changed-only --base "${base}"
|
||||
else
|
||||
echo "Running full sweep (event=${{ github.event_name }})"
|
||||
bash scripts/check-bootstrap-kit-pin-sync.sh
|
||||
echo "Running full sweep + --check-ghcr (event=${{ github.event_name }})"
|
||||
bash scripts/check-bootstrap-kit-pin-sync.sh --check-ghcr
|
||||
fi
|
||||
|
||||
manifest-validation:
|
||||
|
||||
@ -35,6 +35,24 @@
|
||||
# 0 — every bootstrap-kit pin matches its source-tree Chart.yaml version.
|
||||
# 1 — at least one pin lags (or, less likely, leads) the source chart.
|
||||
# 2 — input/parse/usage error.
|
||||
#
|
||||
# TBD-A26 (issue #1872, 2026-05-19) — `--check-ghcr` extension.
|
||||
#
|
||||
# Even when every bootstrap-kit pin equals its source Chart.yaml version,
|
||||
# the published OCI artifact at ghcr.io/openova-io/<chart>:<pin-ver> may
|
||||
# still NOT EXIST. Concrete failure pattern from the 2026-05-18/19 wave:
|
||||
# the TBD-A20 YAML scanner break window (21:04Z → 22:07Z) caused
|
||||
# blueprint-release.yaml to fail with `startup_failure / jobs: []` while
|
||||
# the bootstrap-kit pin + Chart.yaml bumped normally. Versions 1.4.180 +
|
||||
# 1.4.181 of bp-catalyst-platform were "lost" until A58 manually re-fired
|
||||
# the workflow via dispatch — pin pointed at a GHCR tag that never landed.
|
||||
#
|
||||
# `--check-ghcr` adds a third phase: for every chart pinned in the kit,
|
||||
# call `gh api /orgs/openova-io/packages/container/<chart>/versions` and
|
||||
# assert the pin version appears in the published tags. Requires `gh`
|
||||
# authenticated with read:packages scope.
|
||||
#
|
||||
# Exit code 1 also covers a missing GHCR tag.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
@ -45,13 +63,17 @@ KIT_DIR="${REPO_ROOT}/clusters/_template/bootstrap-kit"
|
||||
|
||||
CHANGED_ONLY=""
|
||||
BASE_REF=""
|
||||
CHECK_GHCR=""
|
||||
GHCR_ORG="openova-io"
|
||||
|
||||
# Two modes:
|
||||
# Modes:
|
||||
# - Full sweep (default): check every chart in the working tree.
|
||||
# - --changed-only --base <ref>: only check charts whose Chart.yaml
|
||||
# was modified between <ref> and HEAD. This is the CI-gate mode —
|
||||
# it lets a PR ship without first fixing 13 pre-existing drifts
|
||||
# (the auto-bump hook will heal those over time).
|
||||
# - --check-ghcr: also verify each pin's GHCR artifact exists
|
||||
# (TBD-A26, issue #1872). Composes with both modes above.
|
||||
while [ "$#" -gt 0 ]; do
|
||||
case "$1" in
|
||||
--changed-only)
|
||||
@ -62,8 +84,16 @@ while [ "$#" -gt 0 ]; do
|
||||
BASE_REF="$2"
|
||||
shift 2
|
||||
;;
|
||||
--check-ghcr)
|
||||
CHECK_GHCR=1
|
||||
shift
|
||||
;;
|
||||
--ghcr-org)
|
||||
GHCR_ORG="$2"
|
||||
shift 2
|
||||
;;
|
||||
-h|--help)
|
||||
sed -n '2,40p' "$0"
|
||||
sed -n '2,60p' "$0"
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
@ -174,6 +204,13 @@ fi
|
||||
drift=0
|
||||
checked=0
|
||||
skipped=0
|
||||
# TBD-A26: collect (chart-name, pinned-version, pin-file) tuples for the
|
||||
# optional --check-ghcr phase. We use three parallel arrays (bash 3.x
|
||||
# friendly — GitHub runners default to bash 5 but the script must also
|
||||
# work on macOS dev machines with bash 3.2).
|
||||
declare -a GHCR_NAMES=()
|
||||
declare -a GHCR_VERSIONS=()
|
||||
declare -a GHCR_PINS=()
|
||||
|
||||
# Walk every Chart.yaml in platform/* and products/*. Reading from
|
||||
# Chart.yaml lets us follow a Chart.yaml `name:` rename without needing
|
||||
@ -228,6 +265,14 @@ for chart_yaml in "${CHART_YAMLS[@]}"; do
|
||||
echo " DRIFT ${name}: chart=${version} pin=${pinned_version} (file: ${pin_file#${REPO_ROOT}/})"
|
||||
drift=$((drift + 1))
|
||||
fi
|
||||
|
||||
# Collect the pin tuple for the optional --check-ghcr phase. We
|
||||
# check the PIN version (not the chart version) — the contract is
|
||||
# that whatever the kit installs must exist on GHCR. If drift is
|
||||
# also flagged, both errors are reported.
|
||||
GHCR_NAMES+=("${name}")
|
||||
GHCR_VERSIONS+=("${pinned_version}")
|
||||
GHCR_PINS+=("${pin_file#${REPO_ROOT}/}")
|
||||
done
|
||||
|
||||
echo
|
||||
@ -249,5 +294,88 @@ if [ "${drift}" -gt 0 ]; then
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# ──────────────────────────────────────────────────────────────────────
|
||||
# TBD-A26 (issue #1872) — GHCR artifact existence check
|
||||
# ──────────────────────────────────────────────────────────────────────
|
||||
# For every (chart, pinned_version) pair, assert the pin version exists
|
||||
# as a tag on ghcr.io/<org>/<chart>. Catches the failure mode where the
|
||||
# bootstrap-kit pin and Chart.yaml are in sync (drift=0) but the
|
||||
# blueprint-release workflow that should publish the OCI artifact never
|
||||
# actually ran (e.g. startup_failure from a YAML scanner break, race
|
||||
# with TBD-A20 lockstep) — Sovereigns then pin a tag GHCR never received.
|
||||
if [ -n "${CHECK_GHCR}" ]; then
|
||||
echo
|
||||
echo "── TBD-A26: GHCR artifact existence check (${GHCR_ORG}) ──"
|
||||
if ! command -v gh >/dev/null 2>&1; then
|
||||
echo "error: --check-ghcr requires the 'gh' CLI on PATH" >&2
|
||||
exit 2
|
||||
fi
|
||||
if ! command -v jq >/dev/null 2>&1; then
|
||||
echo "error: --check-ghcr requires 'jq' on PATH" >&2
|
||||
exit 2
|
||||
fi
|
||||
ghcr_missing=0
|
||||
ghcr_checked=0
|
||||
# Cache per-chart tag lists so we only paginate once even if a chart
|
||||
# appears in multiple slots (defence-in-depth — the one-slot-per-chart
|
||||
# invariant is enforced above, but the cache costs nothing).
|
||||
declare -A TAG_CACHE=()
|
||||
for idx in "${!GHCR_NAMES[@]}"; do
|
||||
name="${GHCR_NAMES[$idx]}"
|
||||
pin_ver="${GHCR_VERSIONS[$idx]}"
|
||||
pin_path="${GHCR_PINS[$idx]}"
|
||||
if [ -z "${TAG_CACHE[$name]+x}" ]; then
|
||||
# `gh api --paginate` walks every page of the versions list.
|
||||
# `2>/dev/null` suppresses progress noise; a real API error
|
||||
# surfaces as an empty body and a non-zero exit which we treat
|
||||
# as a fail (cannot prove existence ⇒ block).
|
||||
if ! tags_json=$(gh api "/orgs/${GHCR_ORG}/packages/container/${name}/versions" --paginate 2>/dev/null); then
|
||||
echo "::error title=GHCR API error::Failed to list versions for ghcr.io/${GHCR_ORG}/${name}. Check 'gh' auth has read:packages scope and the package exists." >&2
|
||||
ghcr_missing=$((ghcr_missing + 1))
|
||||
TAG_CACHE[$name]=""
|
||||
continue
|
||||
fi
|
||||
# Extract human-readable tags only (exclude cosign .sig/.att
|
||||
# synthetic tags shaped `sha256-…`). One tag per line.
|
||||
tags=$(echo "$tags_json" | jq -r '.[].metadata.container.tags[]?' 2>/dev/null | grep -v '^sha256-' | sort -u || true)
|
||||
TAG_CACHE[$name]="$tags"
|
||||
fi
|
||||
tags="${TAG_CACHE[$name]}"
|
||||
ghcr_checked=$((ghcr_checked + 1))
|
||||
if echo "$tags" | grep -qx "$pin_ver"; then
|
||||
echo " GHCR OK ${name}:${pin_ver} (pin file: ${pin_path})"
|
||||
else
|
||||
echo " GHCR MISS ${name}:${pin_ver} — tag NOT FOUND on ghcr.io/${GHCR_ORG}/${name} (pin file: ${pin_path})"
|
||||
ghcr_missing=$((ghcr_missing + 1))
|
||||
fi
|
||||
done
|
||||
echo
|
||||
echo "GHCR-checked ${ghcr_checked} pin(s); ${ghcr_missing} missing artifact(s)."
|
||||
if [ "${ghcr_missing}" -gt 0 ]; then
|
||||
echo
|
||||
echo "FAIL: ${ghcr_missing} bootstrap-kit pin(s) reference a chart version"
|
||||
echo "that does NOT exist on GHCR. Every fresh Sovereign provision will"
|
||||
echo "fail to install the affected Blueprints at the pinned version and"
|
||||
echo "fall back to the last working release."
|
||||
echo
|
||||
echo "Root cause is usually one of:"
|
||||
echo " - blueprint-release.yaml failed during the publish run that"
|
||||
echo " should have produced the artifact (e.g. startup_failure from"
|
||||
echo " a YAML scanner break — TBD-A20)."
|
||||
echo " - The publish run was cancelled, OOM'd, or hit a transient"
|
||||
echo " GHCR push 5xx."
|
||||
echo
|
||||
echo "Fix: re-fire the publish workflow on the commit that bumped the"
|
||||
echo "chart version, e.g.:"
|
||||
echo " gh workflow run blueprint-release.yaml \\"
|
||||
echo " --field blueprint=<chart-folder> --field tree=<platform|products>"
|
||||
echo "Then re-run this audit to confirm the tag now exists."
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "PASS: all bootstrap-kit pins are in sync with their source charts."
|
||||
if [ -n "${CHECK_GHCR}" ]; then
|
||||
echo "PASS: every pinned version exists as a GHCR tag."
|
||||
fi
|
||||
exit 0
|
||||
|
||||
Loading…
Reference in New Issue
Block a user