DevOps & Observability

Containers, Kubernetes, Edge (HTTP/2/3 & gRPC), Metrics/Logs/Tracing, and SRE Playbooks

Status: Concept for v0 / v0.1 Last updated: YYYY‑MM‑DD


1) Purpose & Scope

This document defines how we run Crumpet components in production:

  • Workloads: CIN Nodes, Pinning Orchestrator, Watchers, Frontend, Authoring Desktop updater API (if any).

  • Edge: HTTP/2/3, gRPC, WebSocket termination (Envoy/Nginx).

  • Kubernetes: charts, autoscaling, secrets, backups.

  • Observability: Prometheus metrics, Loki logs, OpenTelemetry traces, Grafana dashboards.

  • Reliability: SLOs, alerts, playbooks, chaos & DR.

  • Security: TLS, mTLS, RBAC, network policies, rate‑limits, WAF.

Goal: A minimal, reproducible platform that any community can operate, with clear SLOs and fully auditable telemetry.


2) Reference Topology

vbnetCopyEditUsers ──▶ CDN/WAF ──▶ Envoy/Nginx (edge)
                    ├──▶ apps/web (Next.js) [SSR/ISR]
                    ├──▶ cin-node (HTTP/2 JSON + WS)
                    ├──▶ cin-grpc (gRPC control)
                    └──▶ pin-orchestrator (REST)
Sidecars: OpenTelemetry Collector (OTLP), Promtail
Shared: Prometheus, Loki, Grafana, Tempo/Jaeger, Alertmanager
Storage: IPFS Kubo (local), S3-compatible object store (snapshots/receipts), Postgres (optional)

Ports

  • Edge: 443 (HTTP/2 + HTTP/3/QUIC), WebSocket over H2.

  • CIN Node: 8080 (HTTP/2 JSON & WS), 8081 (gRPC), 4001/udp (libp2p QUIC).

  • Pin Orchestrator: 7070 (REST).

  • Kubo: 5001 (API), 4001 (P2P).


3) Containerization

Each component ships an image:

Image
Purpose
Example tag

ghcr.io/crumpet/cin-node

Indexer & API

v0.1.0

ghcr.io/crumpet/pin-orchestrator

Multi‑pin quorum

v0.1.0

ghcr.io/crumpet/watcher

Uptime/committee attestations

v0.1.0

ghcr.io/crumpet/web

Next.js dApp

v0.1.0

Dockerfile best practices

  • Distroless base, non‑root, read‑only FS (readOnlyRootFilesystem: true).


4) Kubernetes (Helm) — Minimal Values

NetworkPolicy (restrict lateral movement)


5) Edge & Protocols

envoy.yaml (excerpt)

gRPC‑Web If some clients need it, put an Envoy gRPC‑Web filter for /grpc.* routes.

5.2 HTTP/3 (QUIC)

  • Enable in Envoy/Cloud LB; keep HTTP/2 as fallback.

  • For CIN node gossip/data between nodes, prefer libp2p QUIC on a firewall‑opened UDP port with allowlists.


6) Secrets & TLS

  • Certificates: Let’s Encrypt via cert‑manager (ACME) or managed certs from cloud LB.

  • Secrets: Kubernetes Secrets + sealed‑secrets or external vault (HashiCorp Vault, AWS KMS).

  • mTLS (optional) between Envoy ↔ CIN gRPC for operator meshes.

  • Key rotation: semi‑annual for TLS; quarterly for watcher/orchestrator ed25519 keys (publish fingerprints).


7) Observability Stack

7.1 Metrics (Prometheus)

  • Scrape CIN (/metrics), Orchestrator, Watcher, Envoy, Node exporter.

  • Critical metrics (see dashboards §7.4):

    • cin_query_latency_ms{ranker}, ingest_lag_seconds, snapshot_age_seconds{lang}, preview_mismatch_total, ipfs_fetch_ms, ws_clients, quorum_fail_total.

    • Orchestrator: pin_jobs_total{result}, pin_ack_count, backend_latency_ms{backend}.

    • Watcher: probe_success_total{node}, attest_*_submissions_total.

    • Envoy: http_downstream_rq_xx, upstream_rq_time.

7.2 Logs (Loki + Promtail)

  • JSON logs; correlation id header x‑req‑id.

  • Retention: 14 days hot, 90 days cold (object storage).

7.3 Tracing (OpenTelemetry → Tempo/Jaeger)

  • OTLP exporter in services; parent span from edge → service → IPFS/Kasplex RPC.

  • Sample rate: 1% baseline; 100% for errors / long requests > 1s.

7.4 Dashboards (Grafana)

  • CIN Overview: QPS, p50/p95 latency by ranker, ingest lag, snapshot staleness by lang, WS clients.

  • Orchestrator: job success rate, acks histogram, per‑backend latency/errors.

  • Watcher: probe pass rate, freshness compliance, chain tx status.

  • Edge: 2xx/4xx/5xx, backend saturation, H/2 streams, H/3 ratio.

  • Node Rewards view: epoch timers, attestation volumes (pulled from on‑chain subgraph or direct RPC).


8) SLOs & Alerts

Service Level Objectives (per public node):

  • Availability: 99.9% monthly (HTTP 200 on /v1/health).

  • Latency (read): p95 ≤ 400 ms for /v1/search (size≤20) at nominal load.

  • Freshness: snapshot_age_seconds{lang} ≤ 900 for primary languages.

  • WebSocket uptime: > 99% clients connected ≥ 1h without server‑initiated drops.

Alerting (Alertmanager):

  • P1: HTTP 5xx > 5% for 10m, or p95 latency > 1.5s for 15m.

  • P1: snapshot_age_seconds > 1800 (stale index).

  • P1: Orchestrator pin_jobs_total{result="timeout"} / pin_jobs_total > 0.2 for 15m.

  • P2: Watcher probe success rate < 80% for 30m.

  • P2: Disk usage > 80% on index PVC.

  • P3: Cert expiry < 7 days.

Notification routes: PagerDuty/On‑call rotation for P1/P2; Slack/email for P3.


9) CI/CD & Releases

Branching

  • main: protected, release‑ready.

  • develop: integration branch.

  • Feature branches → PR → required checks (lint, tests, e2e).

Pipelines

  • Build containers (SBOM + cosign signature).

  • Unit + integration + e2e suites (docker‑compose for CIN trio with mock RPC/IPFS).

  • Security: Trivy scan, SAST, dependency audit.

  • Tagging: vMAJOR.MINOR.PATCH.

  • Release: Helm chart version bump, changelog, GitHub Release.

Deploy

  • Staging: canary 10% traffic for 1h; observe SLOs; auto‑roll back on error budgets burn.

  • Prod: progressive rollout (Argo Rollouts); blue/green for Envoy if config‑heavy.

  • Config via ConfigMap/Secret; immutable images.


10) Backups & DR

  • Indices (Tantivy): rebuildable from chain + IPFS, but snapshot nightly to S3 for fast restore (optional).

  • Config & receipts: back up Secrets/ConfigMaps and S3 buckets daily (versioned).

  • IPFS pins: rely on ≥3 independent providers; monitor with orchestrator repair.

  • DR runbook:

    1. Scale new region from IaC (Terraform).

    2. Restore Helm releases (same versions).

    3. Bootstrap IPFS peers; warm caches for last 7 days.

    4. Re‑ingest chain from checkpoint; switch DNS. RTO: ≤ 2h; RPO: ≤ 1h.


11) Security Hardening

  • RBAC: least privilege ServiceAccounts; no cluster‑admin pods.

  • PSP/PSA: runAsNonRoot, readOnlyRootFilesystem, drop capabilities.

  • NetworkPolicies: only Envoy may talk to CIN HTTP/GRPC; CIN egress limited to RPC endpoints, IPFS gateways, known peers.

  • WAF/CDN: basic rules—SQLi/XSS patterns; rate limit /v1/search per IP.

  • CORS: allow specific community frontends only; expose paramsHash header.

  • DDoS: CDN autoscale; edge rate limiting and token bucket per IP/API key.

  • Secrets: PSA tokens stored outside repo; rotate quarterly.

  • Supply chain: cosign verify on deploy; pin images by digest.


12) Capacity Planning

CIN Node (per lang ~10k docs):

  • CPU: 2 vCPU; Mem: 4 GB; Disk: 20–50 GB (index + cache).

  • Throughput: ~200 RPS search on midsize VM with p95 < 400 ms.

  • Scale horizontally by language; isolate high‑traffic langs.

Pin Orchestrator:

  • CPU: 1 vCPU; Mem: 512 MB; ~20 concurrent jobs.

  • Scale with HPA on pin_jobs_inflight.

Watcher:

  • CPU: 0.5 vCPU; Mem: 256 MB; network egress modest.

Edge (Envoy):

  • Start 2 vCPU/2 GB; autoscale on upstream_rq_time and downstream_cx_active.


13) Runbooks (SRE Playbooks)

R1: Snapshot Stale

  • Alert: snapshot_age_seconds > 1800.

  • Steps:

    1. Check ingest_lag_seconds. If high, inspect RPC provider health.

    2. Check IPFS gateway errors; failover to alternate gateways.

    3. Manually trigger cin snapshot --lang <L>.

    4. If renderer mismatch errors rise, roll back renderer version.

R2: Elevated 5xx at Edge

  • Check Envoy upstream health; look for CIN pod OOM/CPU throttle.

  • Scale CIN; check DB/disk IOPS.

  • Enable read‑only mode banner on frontend if needed.

R3: Pinning Timeouts

  • Check PSA status pages; reduce requireAcks temporarily (DAO decision) only for staging.

  • Reroute to alternate PSA; run orchestrator repair after recovery.

R4: Watcher Attest Failures

  • Inspect watcher logs for TLS/errors; verify allowlist keys in Config.

  • Reissue signatures if epochId mismatch (misconfigured epoch length).

R5: Suspected Index Discrepancy Across Nodes

  • Use SDK “quorum differences” endpoint on the UI to list mismatched CIDs.

  • Fetch snapshots roots and compare paramsHash; roll back node with divergent analyzer version.


14) Chaos & Load Testing

  • Chaos: kill CIN pods during snapshot; ensure recovery within 2m; inject 10% IPFS gateway errors.

  • Load: k6/Gatling script: GET /v1/search at ramp 0→300 RPS; assert p95.

  • Network: simulate 500ms latency to RPC to test ingest backlog & SLOs.


15) Compliance & Retention

  • Logs: redact IPs by default; keep last 2 octets only if necessary for abuse controls.

  • PII: none collected by default; analytics opt‑in.

  • Retention: metrics 13 months (roll‑up), logs 90 days, traces 14 days.


16) Acceptance Criteria

  • One‑click deploy via Helm installs edge + apps + observability; all dashboards populate.

  • SLOs enforced with alerts and working escalation routes.

  • Blue/green or canary releases complete with automated rollback on SLO burn.

  • Secrets managed via sealed‑secrets or Vault; TLS auto‑renewing.

  • Runbooks validated in game days; DR drill passes with RTO ≤ 2h.


17) Deliverables

  • Helm charts: charts/cin-node, charts/pin-orchestrator, charts/watcher, charts/envoy-edge.

  • Grafana dashboards JSONs under ops/grafana/.

  • Alert rules under ops/prometheus/alerts.yaml.

  • Envoy configs under ops/envoy/.

  • Terraform examples for cloud infra (optional).

  • Runbooks under docs/ops/runbooks/.


18) Future Enhancements

  • Service mesh (mTLS & traffic policy) with Linkerd/Istio (optional).

  • Autoscaling by latency via custom metrics adapter.

  • Provenance attestation (SLSA) and SBOM in release artifacts.

  • TEE‑based remote attestation for builder binaries (advanced).

Last updated