CIN Node (Layer‑3)

Status: Finalized for v0 / v0.1 Owners: Node Operators WG · Rust Team · DevOps Last updated: YYYY‑MM‑DD


1) Purpose & Scope

The Crumpet Indexer Node (CIN) is the Layer‑3 daemon that:

  • Ingests on‑chain events (publish, actions, disputes).

  • Fetches & verifies AOM packages (IPFS).

  • Renders & normalizes content deterministically.

  • Indexes documents per language.

  • Publishes snapshots (Merkle‑rooted) and participates in federation.

  • Serves queries over HTTP/2 + JSON and WebSocket streams, with gRPC for node‑to‑node control.

ELI5: Each node is a librarian: it watches the blockchain, grabs the boxes (articles), checks them, files them by language, and answers your searches. Librarians compare notes so nobody can hide books.

Out of scope: on‑chain token logic (contracts), frontend UI design, DAO governance.


2) Responsibilities & Non‑Goals

Responsibilities

  • Reliable event ingestion with reorg handling.

  • AOM validation (hashes, layout, renderer policy).

  • Deterministic rendering (Rust/WASM) when preview.html is missing or untrusted.

  • Per‑language Tantivy indexes with analyzers (ICU/CJK/RTL).

  • Snapshot creation, signing, gossip, and cross‑validation.

  • Low‑latency query API & live updates.

  • Optional on‑chain announcements (Node Rewards “build” credits).

Non‑Goals

  • Global consensus or finality (we verify and cross‑check).

  • Per‑query cryptographic proofs (future v1).

  • Moderation decisions (we display labels, never delete content).


3) Architecture Overview

Language: Rust (stable); WASM for the renderer embedded. Core crates (workspace):

  • aom/ — AOM parse, CAR/UnixFS, hashing & constraints.

  • renderer/ — Markdown→HTML, sanitizer, deterministic serializer (WASM/native).

  • indexer/ — Tantivy integration, schema, analyzers.

  • node/ — services: ingest, verifier, snapshotter, federation, APIs.

  • proto/ — gRPC definitions (tonic).

Data flow (simplified):


4) Connectivity & Transports

  • Node↔Node control: gRPC over HTTP/2 (tonic, TLS), for health, snapshot metadata, attestations.

  • Node↔Node gossip: libp2p (Noise) over QUIC for snapshot announcements & peer discovery.

  • Node↔Node data: HTTP/3 (QUIC) for snapshot bundles (falls back to HTTP/2).

  • Node↔UI:

    • HTTP/2 JSON for queries (/v1/search, /v1/article, etc.).

    • WebSocket for live events (/v1/ws/events).

    • Optional gRPC‑Web (via Envoy) if clients prefer RPC.

Ports (default):

  • 8080 — HTTP/2 (queries & health), upgrades to WS

  • 8081 — gRPC control (mTLS capable)

  • 4001 — libp2p QUIC (gossip)

  • 5001 — local IPFS API (external dependency)


5) Document Model (Index Schema)

Tantivy schema fields (per CID/edition):

  • cid (bytes) — primary key (multihash digest).

  • first_cid (bytes) — lineage root.

  • version (u32).

  • author (bytes) — 20‑byte address.

  • created_at (i64) — seconds since epoch.

  • lang (string) — BCP‑47.

  • title (text, analyzed).

  • subtitle (text, analyzed).

  • tags (string array, keyword filterable).

  • body (text, analyzed) — rendered text content (sanitized).

  • status (u8) — 0: normal, 1: under dispute, 2: action taken (label present).

  • score_up (i64), score_down (i64), score_net (i64) — from on‑chain.

  • params_hash (bytes32) — Config snapshot hash at index time.

Analyzers

  • Latin languages: ICU tokenizer + stemming (where applicable).

  • CJK: bigram tokenizer.

  • RTL (AR/Hindi): ICU word break.

  • Per‑language analyzer selected by lang. Unknown lang → not indexed in text fields but still stored (metadata retrievable).


6) Ingestion & Verification

On‑chain sources

  • ArticlePublished(author, cid, version, previousCid, createdAt)

  • Upvoted/Downvoted/Flagged (Actions)

  • DisputeOpened/Resolved (Disputes)

Pipeline

  1. Subscribe to RPC logs from a configured start time (supports backfill).

  2. Finality buffer: delay inclusion in a final snapshot by --publish-finality-seconds (default 60s).

  3. Fetch AOM package (CID) from: local IPFS → peers’ HTTP caches → multi‑gateway race.

  4. Verify:

    • Manifest schema (DAG‑CBOR), hashes.bodyMd, hashes.assets.

    • If preview.html present, re‑render and compare hash; otherwise render body.md.

    • Images under /media, size limits, count ≤10.

  5. Normalize: extract text, compute reading metrics, compute params_hash (from Config at ingestion).

  6. Index into the language segment.

  7. Track on‑chain score deltas and dispute status; update document fields.

Reorg handling

  • Maintain an append‑only event log with chain height/time and a compact doc state.

  • On reorg detection, roll back to a checkpoint before divergence and re‑apply logs.


7) Snapshots, Merkle Roots & Federation

What is a snapshot? A per‑language index bundle plus metadata, containing:

  • DocSet: ordered list of (cid, status flags, score_net) (or a deterministic encoding).

  • Merkle root over the DocSet (e.g., SHA‑256 Merkle tree with fixed leaf encode).

  • Posting lists digest (optional v0.1) — per term group digest for stronger checks.

  • Metadata:

    • lang, fromTime, toTime (seconds, not blocks)

    • rendererVersion, sanitizerVersion, policyHash

    • analyzerVersion, paramsHash (Config)

    • indexerVersion, formatVersion

  • Bundle: compressed archive pinned to IPFS.

  • Signature: ed25519 signature of (lang | root | metaHash | ipfsCid) with the node’s key.

Lifecycle

  1. Build snapshot at interval or when doc delta > threshold.

  2. Sign & Pin snapshot to IPFS; compute root, metaHash.

  3. Gossip announcement (libp2p): (lang, root, metaHash, ipfsCid, sig).

  4. On‑chain announce (NodeRewards): announceSnapshot(lang, {root, metaHash, ipfsCid}, sig).

  5. Fetch & Validate from peers: a node consumes foreign languages only if ≥2 matching roots (or committee attest), then performs random CID spot checks.

  6. Serve foreign language queries from fetched snapshot (read‑only), and re‑validate on new announcements.

Committee fallback If a language has only one builder, DAO‑allowlisted watchers act as a committee to attest (2‑of‑N signatures) that the snapshot’s Merkle root matches content spot‑checks.


8) Query Processing & Ranking

Endpoints support:

  • Top: BM25‑like (fixed constants) over title + body + subtitle + tags.

  • Trending: trend = score_net * exp(− age / halfLife) with half‑life default 36h, plus optional recent upvote velocity bonus (small).

  • Newest: by created_at descending.

Filters: lang, author, tag, status, time window (createdAt range). Pagination: from (offset) + size (limit), size ≤ 50.

Stability

  • Keep ranking constants and half‑life in open config (node reports in /version and snapshot metadata).

  • Clients can request &ranker=top|trending|new.


9) Public APIs (HTTP/2 JSON)

Response

9.2 Article

Returns parsed manifest (safe fields), verified hash status, computed metrics, lineage info (prev/next if known), dispute labels/notes (if any), and up/down weights.

Returns top N by trending score within the window.

9.4 Snapshot metadata

9.5 Health & version

9.6 WebSocket Stream

Emits JSON lines:


10) gRPC Control (node↔node)

Service (excerpt)

  • mTLS optional between known peers; otherwise Noise (libp2p) on QUIC for gossip channel.


11) Federation Policy

  • Consume foreign language snapshots only if:

    • At least 2 distinct builders announced identical root or

    • Committee attestations (2‑of‑N watchers) validate a single builder’s root.

  • Spot checks: randomly select K CIDs from the foreign snapshot, fetch AOM, verify manifest and presence in DocSet before accepting.

  • Quarantine: mismatches or time‑skewed snapshots are quarantined and reported via /v1/health.


12) Operator Guide

12.1 Hardware & OS

  • Builder (2 langs): 2 vCPU, 4–8 GB RAM, 100 GB SSD, Linux x86_64 (Ubuntu 22.04+).

  • Serve‑only: 2 vCPU, 4 GB RAM, 50 GB SSD.

12.2 Dependencies

  • Rust stable, Docker (optional), local IPFS node (Kubo), access to Kasplex RPC endpoint (HTTPS).

12.3 Config file (cin.toml)

12.4 CLI


13) Performance & SLOs

  • Query p95: < 400 ms @ 10k docs per language; scale horizontally by language.

  • Cold fetch to first paint: < 2.0 s (frontend coordinated).

  • Snapshot build: < 5 s for 10k docs/lang (incremental).

  • Availability: 99.9% for public nodes (SLA guidelines for rewards uptime).

Optimizations:

  • Memory‑mapped Tantivy indices; warm caches.

  • Concurrent IPFS gateway race; HTTP/3 preferred.

  • Throttled rendering pool; reuse WASM instances.


14) Security, Abuse & DoS Mitigation

  • Rate limiting per client IP or API token (nginx/envoy front or internal).

  • Query timeouts and size caps; reject q > 256 chars; pagination caps.

  • Backpressure: bounded work queues for render, fetch, index.

  • Sanitizer is strict; invalid packages quarantined.

  • TLS for HTTP/2; mTLS optional for gRPC peer meshes.

  • Key management: ed25519 keys stored via OS keyring or HSM if available; rotate with downtime window.


15) Observability

  • Metrics (Prometheus):

    • ingest_lag_seconds, ipfs_fetch_ms, preview_mismatch_count, index_size_bytes, query_latency_ms{ranker}, snapshot_age_seconds{lang}, peers_count, spotcheck_failures.

  • Logs (structured, JSON): per request id; redact user IP by default.

  • Tracing (tracing crate): spans for fetch/render/index steps.


16) Testing Matrix

Unit

  • AOM validation (hashes, image constraints).

  • Render determinism & sanitizer policy.

  • Merkle root generation (stable across runs).

  • Analyzers per language.

Integration

  • Ingest from mocked RPC; reorg simulation.

  • IPFS failures & gateway fallback.

  • Federation: accept on 2‑of‑3 roots; quarantine on mismatch.

  • API correctness: same query yields same order (stable ranking).

Property

  • Snapshot idempotence: same corpus → identical root & metaHash.

  • Random spot check selection stable given seed; fails when content missing.

Load

  • 500 RPS mixed queries; p95 under targets.

  • Snapshot build during query load maintains SLOs (priority scheduling).


17) Acceptance Criteria

  • Node can ingest from empty state to latest with no manual intervention.

  • For primary languages, node builds snapshots and signs announcements.

  • For foreign languages, node fetches ≥2 matching peer snapshots, spot‑checks K CIDs, and serves queries.

  • APIs meet latency and correctness targets; health & metrics exposed.

  • Snapshot metadata includes paramsHash, renderer/analyzer versions; UIs display them.


18) Future Extensions

  • Per‑query proofs (inclusion & ranking) with verifiable IR invariants.

  • ZK search proofs for privacy‑preserving verification.

  • Delta snapshots (SMT) to shrink bundles.

  • Peer reputation scoring.

  • Encrypted transport of fetched bundles with peer attestation (TEE optional).


19) Changelog

  • v1: Initial CIN Node spec with federation, gRPC/HTTP2/WS transports, Tantivy indexing, Merkle snapshots, and operator guidance.

Last updated