DevOps & Observability
Containers, Kubernetes, Edge (HTTP/2/3 & gRPC), Metrics/Logs/Tracing, and SRE Playbooks
Status: Concept for v0 / v0.1 Last updated: YYYY‑MM‑DD
1) Purpose & Scope
This document defines how we run Crumpet components in production:
Workloads: CIN Nodes, Pinning Orchestrator, Watchers, Frontend, Authoring Desktop updater API (if any).
Edge: HTTP/2/3, gRPC, WebSocket termination (Envoy/Nginx).
Kubernetes: charts, autoscaling, secrets, backups.
Observability: Prometheus metrics, Loki logs, OpenTelemetry traces, Grafana dashboards.
Reliability: SLOs, alerts, playbooks, chaos & DR.
Security: TLS, mTLS, RBAC, network policies, rate‑limits, WAF.
Goal: A minimal, reproducible platform that any community can operate, with clear SLOs and fully auditable telemetry.
2) Reference Topology
vbnetCopyEditUsers ──▶ CDN/WAF ──▶ Envoy/Nginx (edge)
├──▶ apps/web (Next.js) [SSR/ISR]
├──▶ cin-node (HTTP/2 JSON + WS)
├──▶ cin-grpc (gRPC control)
└──▶ pin-orchestrator (REST)
Sidecars: OpenTelemetry Collector (OTLP), Promtail
Shared: Prometheus, Loki, Grafana, Tempo/Jaeger, Alertmanager
Storage: IPFS Kubo (local), S3-compatible object store (snapshots/receipts), Postgres (optional)Ports
Edge: 443 (HTTP/2 + HTTP/3/QUIC), WebSocket over H2.
CIN Node: 8080 (HTTP/2 JSON & WS), 8081 (gRPC), 4001/udp (libp2p QUIC).
Pin Orchestrator: 7070 (REST).
Kubo: 5001 (API), 4001 (P2P).
3) Containerization
Each component ships an image:
ghcr.io/crumpet/cin-node
Indexer & API
v0.1.0
ghcr.io/crumpet/pin-orchestrator
Multi‑pin quorum
v0.1.0
ghcr.io/crumpet/watcher
Uptime/committee attestations
v0.1.0
ghcr.io/crumpet/web
Next.js dApp
v0.1.0
Dockerfile best practices
Distroless base, non‑root, read‑only FS (
readOnlyRootFilesystem: true).
4) Kubernetes (Helm) — Minimal Values
NetworkPolicy (restrict lateral movement)
5) Edge & Protocols
5.1 Envoy (recommended; H2/H3, gRPC, WS)
envoy.yaml (excerpt)
gRPC‑Web
If some clients need it, put an Envoy gRPC‑Web filter for /grpc.* routes.
5.2 HTTP/3 (QUIC)
Enable in Envoy/Cloud LB; keep HTTP/2 as fallback.
For CIN node gossip/data between nodes, prefer libp2p QUIC on a firewall‑opened UDP port with allowlists.
6) Secrets & TLS
Certificates: Let’s Encrypt via cert‑manager (ACME) or managed certs from cloud LB.
Secrets: Kubernetes Secrets + sealed‑secrets or external vault (HashiCorp Vault, AWS KMS).
mTLS (optional) between Envoy ↔ CIN gRPC for operator meshes.
Key rotation: semi‑annual for TLS; quarterly for watcher/orchestrator ed25519 keys (publish fingerprints).
7) Observability Stack
7.1 Metrics (Prometheus)
Scrape CIN (
/metrics), Orchestrator, Watcher, Envoy, Node exporter.Critical metrics (see dashboards §7.4):
cin_query_latency_ms{ranker},ingest_lag_seconds,snapshot_age_seconds{lang},preview_mismatch_total,ipfs_fetch_ms,ws_clients,quorum_fail_total.Orchestrator:
pin_jobs_total{result},pin_ack_count,backend_latency_ms{backend}.Watcher:
probe_success_total{node},attest_*_submissions_total.Envoy:
http_downstream_rq_xx,upstream_rq_time.
7.2 Logs (Loki + Promtail)
JSON logs; correlation id header
x‑req‑id.Retention: 14 days hot, 90 days cold (object storage).
7.3 Tracing (OpenTelemetry → Tempo/Jaeger)
OTLP exporter in services; parent span from edge → service → IPFS/Kasplex RPC.
Sample rate: 1% baseline; 100% for errors / long requests > 1s.
7.4 Dashboards (Grafana)
CIN Overview: QPS, p50/p95 latency by ranker, ingest lag, snapshot staleness by lang, WS clients.
Orchestrator: job success rate, acks histogram, per‑backend latency/errors.
Watcher: probe pass rate, freshness compliance, chain tx status.
Edge: 2xx/4xx/5xx, backend saturation, H/2 streams, H/3 ratio.
Node Rewards view: epoch timers, attestation volumes (pulled from on‑chain subgraph or direct RPC).
8) SLOs & Alerts
Service Level Objectives (per public node):
Availability: 99.9% monthly (HTTP 200 on
/v1/health).Latency (read): p95 ≤ 400 ms for
/v1/search(size≤20) at nominal load.Freshness:
snapshot_age_seconds{lang} ≤ 900for primary languages.WebSocket uptime: > 99% clients connected ≥ 1h without server‑initiated drops.
Alerting (Alertmanager):
P1: HTTP 5xx > 5% for 10m, or p95 latency > 1.5s for 15m.
P1:
snapshot_age_seconds > 1800(stale index).P1: Orchestrator
pin_jobs_total{result="timeout"} / pin_jobs_total > 0.2for 15m.P2: Watcher probe success rate < 80% for 30m.
P2: Disk usage > 80% on index PVC.
P3: Cert expiry < 7 days.
Notification routes: PagerDuty/On‑call rotation for P1/P2; Slack/email for P3.
9) CI/CD & Releases
Branching
main: protected, release‑ready.develop: integration branch.Feature branches → PR → required checks (lint, tests, e2e).
Pipelines
Build containers (SBOM + cosign signature).
Unit + integration + e2e suites (docker‑compose for CIN trio with mock RPC/IPFS).
Security: Trivy scan, SAST, dependency audit.
Tagging:
vMAJOR.MINOR.PATCH.Release: Helm chart version bump, changelog, GitHub Release.
Deploy
Staging: canary 10% traffic for 1h; observe SLOs; auto‑roll back on error budgets burn.
Prod: progressive rollout (Argo Rollouts); blue/green for Envoy if config‑heavy.
Config via ConfigMap/Secret; immutable images.
10) Backups & DR
Indices (Tantivy): rebuildable from chain + IPFS, but snapshot nightly to S3 for fast restore (optional).
Config & receipts: back up Secrets/ConfigMaps and S3 buckets daily (versioned).
IPFS pins: rely on ≥3 independent providers; monitor with orchestrator repair.
DR runbook:
Scale new region from IaC (Terraform).
Restore Helm releases (same versions).
Bootstrap IPFS peers; warm caches for last 7 days.
Re‑ingest chain from checkpoint; switch DNS. RTO: ≤ 2h; RPO: ≤ 1h.
11) Security Hardening
RBAC: least privilege ServiceAccounts; no cluster‑admin pods.
PSP/PSA:
runAsNonRoot,readOnlyRootFilesystem, drop capabilities.NetworkPolicies: only Envoy may talk to CIN HTTP/GRPC; CIN egress limited to RPC endpoints, IPFS gateways, known peers.
WAF/CDN: basic rules—SQLi/XSS patterns; rate limit
/v1/searchper IP.CORS: allow specific community frontends only; expose
paramsHashheader.DDoS: CDN autoscale; edge rate limiting and token bucket per IP/API key.
Secrets: PSA tokens stored outside repo; rotate quarterly.
Supply chain: cosign verify on deploy; pin images by digest.
12) Capacity Planning
CIN Node (per lang ~10k docs):
CPU: 2 vCPU; Mem: 4 GB; Disk: 20–50 GB (index + cache).
Throughput: ~200 RPS search on midsize VM with p95 < 400 ms.
Scale horizontally by language; isolate high‑traffic langs.
Pin Orchestrator:
CPU: 1 vCPU; Mem: 512 MB; ~20 concurrent jobs.
Scale with HPA on
pin_jobs_inflight.
Watcher:
CPU: 0.5 vCPU; Mem: 256 MB; network egress modest.
Edge (Envoy):
Start 2 vCPU/2 GB; autoscale on
upstream_rq_timeanddownstream_cx_active.
13) Runbooks (SRE Playbooks)
R1: Snapshot Stale
Alert:
snapshot_age_seconds > 1800.Steps:
Check
ingest_lag_seconds. If high, inspect RPC provider health.Check IPFS gateway errors; failover to alternate gateways.
Manually trigger
cin snapshot --lang <L>.If renderer mismatch errors rise, roll back renderer version.
R2: Elevated 5xx at Edge
Check Envoy upstream health; look for CIN pod OOM/CPU throttle.
Scale CIN; check DB/disk IOPS.
Enable read‑only mode banner on frontend if needed.
R3: Pinning Timeouts
Check PSA status pages; reduce
requireAckstemporarily (DAO decision) only for staging.Reroute to alternate PSA; run orchestrator repair after recovery.
R4: Watcher Attest Failures
Inspect watcher logs for TLS/errors; verify allowlist keys in Config.
Reissue signatures if epochId mismatch (misconfigured epoch length).
R5: Suspected Index Discrepancy Across Nodes
Use SDK “quorum differences” endpoint on the UI to list mismatched CIDs.
Fetch snapshots roots and compare
paramsHash; roll back node with divergent analyzer version.
14) Chaos & Load Testing
Chaos: kill CIN pods during snapshot; ensure recovery within 2m; inject 10% IPFS gateway errors.
Load: k6/Gatling script:
GET /v1/searchat ramp 0→300 RPS; assert p95.Network: simulate 500ms latency to RPC to test ingest backlog & SLOs.
15) Compliance & Retention
Logs: redact IPs by default; keep last 2 octets only if necessary for abuse controls.
PII: none collected by default; analytics opt‑in.
Retention: metrics 13 months (roll‑up), logs 90 days, traces 14 days.
16) Acceptance Criteria
One‑click deploy via Helm installs edge + apps + observability; all dashboards populate.
SLOs enforced with alerts and working escalation routes.
Blue/green or canary releases complete with automated rollback on SLO burn.
Secrets managed via sealed‑secrets or Vault; TLS auto‑renewing.
Runbooks validated in game days; DR drill passes with RTO ≤ 2h.
17) Deliverables
Helm charts:
charts/cin-node,charts/pin-orchestrator,charts/watcher,charts/envoy-edge.Grafana dashboards JSONs under
ops/grafana/.Alert rules under
ops/prometheus/alerts.yaml.Envoy configs under
ops/envoy/.Terraform examples for cloud infra (optional).
Runbooks under
docs/ops/runbooks/.
18) Future Enhancements
Service mesh (mTLS & traffic policy) with Linkerd/Istio (optional).
Autoscaling by latency via custom metrics adapter.
Provenance attestation (SLSA) and SBOM in release artifacts.
TEE‑based remote attestation for builder binaries (advanced).
Last updated