Commit Graph

2 Commits

Author SHA1 Message Date
e3mrah
0dbdf3b327
fix(bp-trivy): node-collector tolerates control-plane taint (closes #769) (#772)
PR #755 added `node-role.kubernetes.io/control-plane=true:NoSchedule` to
the CP node when worker_count > 0. Two bootstrap-kit charts have pods
that MUST land on the CP and lacked the matching toleration:

bp-trivy
  • node-collector: Pod pinned to each node via nodeSelector
    `kubernetes.io/hostname=<node>`. The CP-bound collector reads
    /var/lib/etcd, /var/lib/kubelet, /var/lib/kube-scheduler,
    /var/lib/kube-controller-manager via hostPath — these only exist
    on the CP. Without the toleration the collector sat Pending forever
    on otech93 (live evidence in #769).
  • scanJobTolerations: per-workload scan jobs the operator spawns may
    target pods on CP-only system DaemonSets (kube-system kube-proxy
    in non-Cilium mode, etc.). Adding the toleration here so reports
    are produced for those workloads too.

bp-alloy
  • DaemonSet — one pod MUST land on every node including the CP, so
    CP-local kubelet logs + node metrics flow into the LGTM stack.
    Without the toleration Alloy ran 3/4 nodes (Ready=N-1) on otech93
    and CP telemetry was silently lost.

Both tolerations are no-ops on solo Sovereigns (worker_count=0): the CP
is untainted in solo mode per PR #755's conditional.

Versions bumped:
  • bp-trivy 1.0.2 → 1.0.3 (Chart.yaml + 3× HelmRelease pins)
  • bp-alloy 1.0.0 → 1.0.1 (Chart.yaml + 3× HelmRelease pins)

Out of scope (audited, no change needed):
  • bp-cilium — upstream defaults already tolerate everything (verified
    on otech93: cilium DaemonSet at 4/4 nodes).
  • bp-falco — values.yaml already declares NoSchedule + NoExecute
    Exists tolerations (4/4 on otech93).
  • cnpg/harbor — no kubelet-cert-renew Jobs in current charts.

Verified:
  • `helm template` on both charts renders the expected toleration
    (alloy: pod-spec; trivy: trivy-operator-config ConfigMap consumed
     by the operator at scan-job spawn time).
  • `bash scripts/check-bootstrap-deps.sh` PASSED (no DAG drift).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:38:29 +02:00
e3mrah
75128781b3
feat(platform): observability stack umbrellas (grafana/loki/mimir/tempo/alloy/otel/langfuse/velero) (#214)
* feat(bp-grafana): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana — visualization layer of the
LGTM observability stack (Loki/Grafana/Tempo/Mimir).

Pinned to grafana/grafana 10.5.15 (appVersion 12.3.1) — current stable
on 2026-04-29. Solo-Sovereign defaults: 1 replica, 10Gi PVC,
ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-loki): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana Loki — log aggregation backend
of the LGTM stack. SingleBinary mode by default (solo-Sovereign min);
SimpleScalable/Distributed are values toggles.

Pinned to grafana/loki 7.0.0 (appVersion 3.6.7) on 2026-04-29.
Filesystem storage default; SeaweedFS S3 wiring is per-Sovereign overlay
when scaling out. All observability toggles default false per
BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-mimir): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana Mimir — metrics storage tier of
the LGTM stack.

Pinned to grafana/mimir-distributed 6.0.6 (appVersion 3.0.4) on
2026-04-29. Solo-Sovereign defaults: every component scaled to 1
replica, zoneAwareReplication disabled, Kafka ingest-storage disabled.
Bundled MinIO kept enabled as a stop-gap so the chart renders;
SeaweedFS S3 wiring is per-Sovereign overlay. All metaMonitoring
toggles default false per BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-tempo): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana Tempo — distributed tracing
backend of the LGTM stack. Single-binary mode by default
(solo-Sovereign min); microservice mode (tempo-distributed) is a chart
swap toggle.

Pinned to grafana/tempo 1.24.4 (appVersion 2.9.0) on 2026-04-29. Local
PVC storage default; SeaweedFS S3 wiring is per-Sovereign overlay.
Metrics generator disabled by default (depends on bp-mimir).
ServiceMonitor default false per BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-alloy): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana Alloy — unified telemetry
collector for the LGTM stack (logs, metrics, traces; OTLP-native).

Pinned to grafana/alloy 1.8.0 (appVersion v1.16.0) on 2026-04-29.
DaemonSet controller default (one Alloy per node) so node + container
telemetry work out of the box. Empty Alloy config by default;
per-Sovereign overlays populate forwarders to bp-loki/bp-mimir/bp-tempo
once those reconcile. ServiceMonitor + ingress + CRDs default false per
BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-opentelemetry): umbrella chart for observability stack

Catalyst Blueprint umbrella for the OpenTelemetry Collector — vendor-
neutral telemetry collector. Sibling to bp-alloy; per-Sovereign overlays
choose one.

Pinned to open-telemetry/opentelemetry-collector 0.152.0 (appVersion
0.150.1) on 2026-04-29. Uses the contrib distribution
(otel/opentelemetry-collector-contrib:0.150.1) so Loki/Mimir/Tempo
exporters are bundled. Deployment mode default (1 replica); DaemonSet
+ StatefulSet are values toggles. All presets default false; ingress
+ ServiceMonitor + PodMonitor + PrometheusRule + NetworkPolicy default
false per BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-langfuse): umbrella chart for observability stack

Catalyst Blueprint umbrella for Langfuse — LLM observability platform.
Complements bp-grafana (infrastructure metrics) with AI-specific
telemetry (traces, evaluations, prompts, cost attribution).

Pinned to langfuse/langfuse 1.5.28 (appVersion 3.171.0) on 2026-04-29.

Catalyst convention: ALL bundled Bitnami subcharts are disabled —
PostgreSQL via cnpg.io/Cluster (bp-cnpg), Redis via bp-valkey,
ClickHouse via bp-clickhouse, S3 via bp-seaweedfs. Per-Sovereign
overlays wire external endpoints + Secret references. Telemetry to
Langfuse Inc. defaulted false; signUpDisabled defaulted true.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-velero): umbrella chart for observability stack

Catalyst Blueprint umbrella for Velero — Kubernetes-native backup and
disaster recovery. Per platform/velero/README.md, ALL Velero output
goes to SeaweedFS (Catalyst's unified S3 encapsulation), which
transitions to a cloud archival backend on the cold tier.

Pinned to vmware-tanzu/velero 12.0.1 (appVersion 1.18.0) on 2026-04-29.
Bundled velero-plugin-for-aws:v1.14.0 init container so SeaweedFS S3 is
reachable. backupsEnabled/snapshotsEnabled defaulted false at this
layer (placeholders for backupStorageLocation); per-Sovereign overlays
flip on after wiring SeaweedFS endpoint + credentials. ServiceMonitor +
PodMonitor + PrometheusRule default false per BLUEPRINT-AUTHORING.md
§11.2.

Part of issue #204 observability-stack umbrellas batch.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-29 22:11:04 +02:00