DevOps & Observability

Containers, Kubernetes, Edge (HTTP/2/3 & gRPC), Metrics/Logs/Tracing, and SRE Playbooks

Status: Concept for v0 / v0.1 Last updated: YYYY‑MM‑DD

1) Purpose & Scope

This document defines how we run Crumpet components in production:

Workloads: CIN Nodes, Pinning Orchestrator, Watchers, Frontend, Authoring Desktop updater API (if any).
Edge: HTTP/2/3, gRPC, WebSocket termination (Envoy/Nginx).
Kubernetes: charts, autoscaling, secrets, backups.
Observability: Prometheus metrics, Loki logs, OpenTelemetry traces, Grafana dashboards.
Reliability: SLOs, alerts, playbooks, chaos & DR.
Security: TLS, mTLS, RBAC, network policies, rate‑limits, WAF.

Goal: A minimal, reproducible platform that any community can operate, with clear SLOs and fully auditable telemetry.

2) Reference Topology

vbnetCopyEditUsers ──▶ CDN/WAF ──▶ Envoy/Nginx (edge)
                    ├──▶ apps/web (Next.js) [SSR/ISR]
                    ├──▶ cin-node (HTTP/2 JSON + WS)
                    ├──▶ cin-grpc (gRPC control)
                    └──▶ pin-orchestrator (REST)
Sidecars: OpenTelemetry Collector (OTLP), Promtail
Shared: Prometheus, Loki, Grafana, Tempo/Jaeger, Alertmanager
Storage: IPFS Kubo (local), S3-compatible object store (snapshots/receipts), Postgres (optional)

Ports

Edge: 443 (HTTP/2 + HTTP/3/QUIC), WebSocket over H2.
CIN Node: 8080 (HTTP/2 JSON & WS), 8081 (gRPC), 4001/udp (libp2p QUIC).
Pin Orchestrator: 7070 (REST).
Kubo: 5001 (API), 4001 (P2P).

3) Containerization

Each component ships an image:

Image

Purpose

Example tag

ghcr.io/crumpet/cin-node

Indexer & API

v0.1.0

ghcr.io/crumpet/pin-orchestrator

Multi‑pin quorum

v0.1.0

ghcr.io/crumpet/watcher

Uptime/committee attestations

v0.1.0

ghcr.io/crumpet/web

Next.js dApp

v0.1.0

Dockerfile best practices

dockerfileCopyEdit# example: CIN Node
FROM rust:1.79 as build
WORKDIR /src
COPY . .
RUN cargo build --release --bin cin

FROM gcr.io/distroless/cc-debian12
USER 10001
COPY --from=build /src/target/release/cin /usr/local/bin/cin
ENTRYPOINT ["/usr/local/bin/cin","run","--config","/etc/cin/cin.toml"]

Distroless base, non‑root, read‑only FS (readOnlyRootFilesystem: true).

4) Kubernetes (Helm) — Minimal Values

yamlCopyEditcin-node:
  image: ghcr.io/crumpet/cin-node:v0.1.0
  replicas: 3
  resources:
    requests: { cpu: "500m", memory: "1Gi" }
    limits:   { cpu: "2",    memory: "4Gi" }
  env:
    - name: CIN_CONFIG
      valueFrom: { secretKeyRef: { name: cin-config, key: cin.toml } }
  ports:
    - name: http2
      containerPort: 8080
    - name: grpc
      containerPort: 8081
    - name: quic
      containerPort: 4001
  livenessProbe:
    httpGet: { path: /v1/health, port: http2 }
    initialDelaySeconds: 20
    periodSeconds: 10
  readinessProbe:
    httpGet: { path: /v1/health, port: http2 }
  volumeMounts:
    - name: indices
      mountPath: /var/lib/cin
  persistence:
    indices:
      size: 50Gi
      storageClass: fast-ssd
  hpa:
    min: 3
    max: 10
    cpu: 70
    customMetrics:
      - type: Pods
        name: cin_query_rps
        targetAverageValue: "30"

pin-orchestrator:
  image: ghcr.io/crumpet/pin-orchestrator:v0.1.0
  replicas: 2
  resources: { requests: { cpu: "200m", memory: "512Mi" }, limits: { cpu: "1", memory: "1Gi" } }
  envFrom: [{ secretRef: { name: psa-keys } }]
  ports: [{ name: http, containerPort: 7070 }]
  hpa: { min: 2, max: 6, cpu: 70 }

watcher:
  image: ghcr.io/crumpet/watcher:v0.1.0
  replicas: 3
  resources: { requests: { cpu: "200m", memory: "256Mi" }, limits: { cpu: "1", memory: "512Mi" } }

ingress:
  className: traefik # or contour/ingress-nginx if HTTP/2/3 is supported
  tls:
    secretName: wildcard-crumpet

NetworkPolicy (restrict lateral movement)

yamlCopyEditapiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: cin-node-deny-all }
spec:
  podSelector: { matchLabels: { app: cin-node } }
  policyTypes: ["Ingress","Egress"]
  ingress:
    - from:
        - namespaceSelector: { matchLabels: { istio-injection: disabled } }
        - podSelector: { matchLabels: { app: envoy } }
      ports: [{ port: 8080 }, { port: 8081 }]
    - from: [] # libp2p QUIC intentionally not exposed here; use hostNetwork/nodePort with firewall rules
  egress:
    - to: [{ ipBlock: { cidr: 0.0.0.0/0 } }]
      ports: [{ protocol: TCP, port: 443 }, { protocol: UDP, port: 443 }]

5) Edge & Protocols

5.1 Envoy (recommended; H2/H3, gRPC, WS)

envoy.yaml (excerpt)

yamlCopyEditstatic_resources:
  listeners:
  - name: https
    address: { socket_address: { address: 0.0.0.0, port_value: 443 } }
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          http2_protocol_options: { allow_connect: true }
          upgrade_configs: [{ upgrade_type: websocket }]
          route_config:
            virtual_hosts:
            - name: app
              domains: ["*"]
              routes:
              - match: { prefix: "/v1/ws" }
                route: { cluster: cin-http, upgrade_configs: [{ upgrade_type: websocket }] }
              - match: { prefix: "/v1" }         # CIN JSON API
                route: { cluster: cin-http }
              - match: { prefix: "/grpc." }      # gRPC control (if using separate endpoint)
                route: { cluster: cin-grpc }
              - match: { prefix: "/" }           # Next.js
                route: { cluster: web }
  clusters:
  - name: cin-http
    type: logical_dns
    load_assignment: { endpoints: [{ lb_endpoints: [{ endpoint: { address: { socket_address: { address: cin-node, port_value: 8080 } } } }] }] }
    http2_protocol_options: {}
  - name: cin-grpc
    type: logical_dns
    http2_protocol_options: {}
    load_assignment: { endpoints: [{ lb_endpoints: [{ endpoint: { address: { socket_address: { address: cin-node, port_value: 8081 } } } }] }] }
  - name: web
    type: logical_dns
    load_assignment: { endpoints: [{ lb_endpoints: [{ endpoint: { address: { socket_address: { address: apps-web, port_value: 3000 }}}]}]}]

gRPC‑Web If some clients need it, put an Envoy gRPC‑Web filter for /grpc.* routes.

5.2 HTTP/3 (QUIC)

Enable in Envoy/Cloud LB; keep HTTP/2 as fallback.
For CIN node gossip/data between nodes, prefer libp2p QUIC on a firewall‑opened UDP port with allowlists.

6) Secrets & TLS

Certificates: Let’s Encrypt via cert‑manager (ACME) or managed certs from cloud LB.
Secrets: Kubernetes Secrets + sealed‑secrets or external vault (HashiCorp Vault, AWS KMS).
mTLS (optional) between Envoy ↔ CIN gRPC for operator meshes.
Key rotation: semi‑annual for TLS; quarterly for watcher/orchestrator ed25519 keys (publish fingerprints).

7) Observability Stack

7.1 Metrics (Prometheus)

Scrape CIN (/metrics), Orchestrator, Watcher, Envoy, Node exporter.
Critical metrics (see dashboards §7.4):
- cin_query_latency_ms{ranker}, ingest_lag_seconds, snapshot_age_seconds{lang}, preview_mismatch_total, ipfs_fetch_ms, ws_clients, quorum_fail_total.
- Orchestrator: pin_jobs_total{result}, pin_ack_count, backend_latency_ms{backend}.
- Watcher: probe_success_total{node}, attest_*_submissions_total.
- Envoy: http_downstream_rq_xx, upstream_rq_time.

7.2 Logs (Loki + Promtail)

JSON logs; correlation id header x‑req‑id.
Retention: 14 days hot, 90 days cold (object storage).

7.3 Tracing (OpenTelemetry → Tempo/Jaeger)

OTLP exporter in services; parent span from edge → service → IPFS/Kasplex RPC.
Sample rate: 1% baseline; 100% for errors / long requests > 1s.

7.4 Dashboards (Grafana)

CIN Overview: QPS, p50/p95 latency by ranker, ingest lag, snapshot staleness by lang, WS clients.
Orchestrator: job success rate, acks histogram, per‑backend latency/errors.
Watcher: probe pass rate, freshness compliance, chain tx status.
Edge: 2xx/4xx/5xx, backend saturation, H/2 streams, H/3 ratio.
Node Rewards view: epoch timers, attestation volumes (pulled from on‑chain subgraph or direct RPC).

8) SLOs & Alerts

Service Level Objectives (per public node):

Availability: 99.9% monthly (HTTP 200 on /v1/health).
Latency (read): p95 ≤ 400 ms for /v1/search (size≤20) at nominal load.
Freshness: snapshot_age_seconds{lang} ≤ 900 for primary languages.
WebSocket uptime: > 99% clients connected ≥ 1h without server‑initiated drops.

Alerting (Alertmanager):

P1: HTTP 5xx > 5% for 10m, or p95 latency > 1.5s for 15m.
P1: snapshot_age_seconds > 1800 (stale index).
P1: Orchestrator pin_jobs_total{result="timeout"} / pin_jobs_total > 0.2 for 15m.
P2: Watcher probe success rate < 80% for 30m.
P2: Disk usage > 80% on index PVC.
P3: Cert expiry < 7 days.

Notification routes: PagerDuty/On‑call rotation for P1/P2; Slack/email for P3.

9) CI/CD & Releases

Branching

main: protected, release‑ready.
develop: integration branch.
Feature branches → PR → required checks (lint, tests, e2e).

Pipelines

Build containers (SBOM + cosign signature).
Unit + integration + e2e suites (docker‑compose for CIN trio with mock RPC/IPFS).
Security: Trivy scan, SAST, dependency audit.
Tagging: vMAJOR.MINOR.PATCH.
Release: Helm chart version bump, changelog, GitHub Release.

Deploy

Staging: canary 10% traffic for 1h; observe SLOs; auto‑roll back on error budgets burn.
Prod: progressive rollout (Argo Rollouts); blue/green for Envoy if config‑heavy.
Config via ConfigMap/Secret; immutable images.

10) Backups & DR

Indices (Tantivy): rebuildable from chain + IPFS, but snapshot nightly to S3 for fast restore (optional).
Config & receipts: back up Secrets/ConfigMaps and S3 buckets daily (versioned).
IPFS pins: rely on ≥3 independent providers; monitor with orchestrator repair.
DR runbook:
1. Scale new region from IaC (Terraform).
2. Restore Helm releases (same versions).
3. Bootstrap IPFS peers; warm caches for last 7 days.
4. Re‑ingest chain from checkpoint; switch DNS. RTO: ≤ 2h; RPO: ≤ 1h.

11) Security Hardening

RBAC: least privilege ServiceAccounts; no cluster‑admin pods.
PSP/PSA: runAsNonRoot, readOnlyRootFilesystem, drop capabilities.
NetworkPolicies: only Envoy may talk to CIN HTTP/GRPC; CIN egress limited to RPC endpoints, IPFS gateways, known peers.
WAF/CDN: basic rules—SQLi/XSS patterns; rate limit /v1/search per IP.
CORS: allow specific community frontends only; expose paramsHash header.
DDoS: CDN autoscale; edge rate limiting and token bucket per IP/API key.
Secrets: PSA tokens stored outside repo; rotate quarterly.
Supply chain: cosign verify on deploy; pin images by digest.

12) Capacity Planning

CIN Node (per lang ~10k docs):

CPU: 2 vCPU; Mem: 4 GB; Disk: 20–50 GB (index + cache).
Throughput: ~200 RPS search on midsize VM with p95 < 400 ms.
Scale horizontally by language; isolate high‑traffic langs.

Pin Orchestrator:

CPU: 1 vCPU; Mem: 512 MB; ~20 concurrent jobs.
Scale with HPA on pin_jobs_inflight.

Watcher:

CPU: 0.5 vCPU; Mem: 256 MB; network egress modest.

Edge (Envoy):

Start 2 vCPU/2 GB; autoscale on upstream_rq_time and downstream_cx_active.

13) Runbooks (SRE Playbooks)

R1: Snapshot Stale

Alert: snapshot_age_seconds > 1800.
Steps:
1. Check ingest_lag_seconds. If high, inspect RPC provider health.
2. Check IPFS gateway errors; failover to alternate gateways.
3. Manually trigger cin snapshot --lang <L>.
4. If renderer mismatch errors rise, roll back renderer version.

R2: Elevated 5xx at Edge

Check Envoy upstream health; look for CIN pod OOM/CPU throttle.
Scale CIN; check DB/disk IOPS.
Enable read‑only mode banner on frontend if needed.

R3: Pinning Timeouts

Check PSA status pages; reduce requireAcks temporarily (DAO decision) only for staging.
Reroute to alternate PSA; run orchestrator repair after recovery.

R4: Watcher Attest Failures

Inspect watcher logs for TLS/errors; verify allowlist keys in Config.
Reissue signatures if epochId mismatch (misconfigured epoch length).

R5: Suspected Index Discrepancy Across Nodes

Use SDK “quorum differences” endpoint on the UI to list mismatched CIDs.
Fetch snapshots roots and compare paramsHash; roll back node with divergent analyzer version.

14) Chaos & Load Testing

Chaos: kill CIN pods during snapshot; ensure recovery within 2m; inject 10% IPFS gateway errors.
Load: k6/Gatling script: GET /v1/search at ramp 0→300 RPS; assert p95.
Network: simulate 500ms latency to RPC to test ingest backlog & SLOs.

15) Compliance & Retention

Logs: redact IPs by default; keep last 2 octets only if necessary for abuse controls.
PII: none collected by default; analytics opt‑in.
Retention: metrics 13 months (roll‑up), logs 90 days, traces 14 days.

16) Acceptance Criteria

One‑click deploy via Helm installs edge + apps + observability; all dashboards populate.
SLOs enforced with alerts and working escalation routes.
Blue/green or canary releases complete with automated rollback on SLO burn.
Secrets managed via sealed‑secrets or Vault; TLS auto‑renewing.
Runbooks validated in game days; DR drill passes with RTO ≤ 2h.

17) Deliverables

Helm charts: charts/cin-node, charts/pin-orchestrator, charts/watcher, charts/envoy-edge.
Grafana dashboards JSONs under ops/grafana/.
Alert rules under ops/prometheus/alerts.yaml.
Envoy configs under ops/envoy/.
Terraform examples for cloud infra (optional).
Runbooks under docs/ops/runbooks/.

18) Future Enhancements

Service mesh (mTLS & traffic policy) with Linkerd/Istio (optional).
Autoscaling by latency via custom metrics adapter.
Provenance attestation (SLSA) and SBOM in release artifacts.
TEE‑based remote attestation for builder binaries (advanced).

PreviousPinning Orchestrator NextRepo Skeleton & CONTRIBUTING.md

Last updated 6 months ago

hashtag1) Purpose & Scope

hashtag2) Reference Topology

hashtag3) Containerization

hashtag4) Kubernetes (Helm) — Minimal Values

hashtag5) Edge & Protocols

hashtag5.1 Envoy (recommended; H2/H3, gRPC, WS)

hashtag5.2 HTTP/3 (QUIC)

hashtag6) Secrets & TLS

hashtag7) Observability Stack

hashtag7.1 Metrics (Prometheus)

hashtag7.2 Logs (Loki + Promtail)

hashtag7.3 Tracing (OpenTelemetry → Tempo/Jaeger)

hashtag7.4 Dashboards (Grafana)

hashtag8) SLOs & Alerts

hashtag9) CI/CD & Releases

hashtag10) Backups & DR

hashtag11) Security Hardening

hashtag12) Capacity Planning

hashtag13) Runbooks (SRE Playbooks)

hashtag14) Chaos & Load Testing

hashtag15) Compliance & Retention

hashtag16) Acceptance Criteria

hashtag17) Deliverables

hashtag18) Future Enhancements

1) Purpose & Scope

2) Reference Topology

3) Containerization

4) Kubernetes (Helm) — Minimal Values

5) Edge & Protocols

5.1 Envoy (recommended; H2/H3, gRPC, WS)

5.2 HTTP/3 (QUIC)

6) Secrets & TLS

7) Observability Stack

7.1 Metrics (Prometheus)

7.2 Logs (Loki + Promtail)

7.3 Tracing (OpenTelemetry → Tempo/Jaeger)

7.4 Dashboards (Grafana)

8) SLOs & Alerts

9) CI/CD & Releases

10) Backups & DR

11) Security Hardening

12) Capacity Planning

13) Runbooks (SRE Playbooks)

14) Chaos & Load Testing

15) Compliance & Retention

16) Acceptance Criteria

17) Deliverables

18) Future Enhancements