CIN Node (Layer‑3)
Status: Finalized for v0 / v0.1 Owners: Node Operators WG · Rust Team · DevOps Last updated: YYYY‑MM‑DD
1) Purpose & Scope
The Crumpet Indexer Node (CIN) is the Layer‑3 daemon that:
Ingests on‑chain events (publish, actions, disputes).
Fetches & verifies AOM packages (IPFS).
Renders & normalizes content deterministically.
Indexes documents per language.
Publishes snapshots (Merkle‑rooted) and participates in federation.
Serves queries over HTTP/2 + JSON and WebSocket streams, with gRPC for node‑to‑node control.
ELI5: Each node is a librarian: it watches the blockchain, grabs the boxes (articles), checks them, files them by language, and answers your searches. Librarians compare notes so nobody can hide books.
Out of scope: on‑chain token logic (contracts), frontend UI design, DAO governance.
2) Responsibilities & Non‑Goals
Responsibilities
Reliable event ingestion with reorg handling.
AOM validation (hashes, layout, renderer policy).
Deterministic rendering (Rust/WASM) when
preview.htmlis missing or untrusted.Per‑language Tantivy indexes with analyzers (ICU/CJK/RTL).
Snapshot creation, signing, gossip, and cross‑validation.
Low‑latency query API & live updates.
Optional on‑chain announcements (Node Rewards “build” credits).
Non‑Goals
Global consensus or finality (we verify and cross‑check).
Per‑query cryptographic proofs (future v1).
Moderation decisions (we display labels, never delete content).
3) Architecture Overview
Language: Rust (stable); WASM for the renderer embedded. Core crates (workspace):
aom/— AOM parse, CAR/UnixFS, hashing & constraints.renderer/— Markdown→HTML, sanitizer, deterministic serializer (WASM/native).indexer/— Tantivy integration, schema, analyzers.node/— services: ingest, verifier, snapshotter, federation, APIs.proto/— gRPC definitions (tonic).
Data flow (simplified):
4) Connectivity & Transports
Node↔Node control: gRPC over HTTP/2 (
tonic, TLS), for health, snapshot metadata, attestations.Node↔Node gossip: libp2p (Noise) over QUIC for snapshot announcements & peer discovery.
Node↔Node data: HTTP/3 (QUIC) for snapshot bundles (falls back to HTTP/2).
Node↔UI:
HTTP/2 JSON for queries (
/v1/search,/v1/article, etc.).WebSocket for live events (
/v1/ws/events).Optional gRPC‑Web (via Envoy) if clients prefer RPC.
Ports (default):
8080 — HTTP/2 (queries & health), upgrades to WS
8081 — gRPC control (mTLS capable)
4001 — libp2p QUIC (gossip)
5001 — local IPFS API (external dependency)
5) Document Model (Index Schema)
Tantivy schema fields (per CID/edition):
cid(bytes) — primary key (multihash digest).first_cid(bytes) — lineage root.version(u32).author(bytes) — 20‑byte address.created_at(i64) — seconds since epoch.lang(string) — BCP‑47.title(text, analyzed).subtitle(text, analyzed).tags(string array, keyword filterable).body(text, analyzed) — rendered text content (sanitized).status(u8) — 0: normal, 1: under dispute, 2: action taken (label present).score_up(i64),score_down(i64),score_net(i64) — from on‑chain.params_hash(bytes32) — Config snapshot hash at index time.
Analyzers
Latin languages: ICU tokenizer + stemming (where applicable).
CJK: bigram tokenizer.
RTL (AR/Hindi): ICU word break.
Per‑language analyzer selected by
lang. Unknownlang→ not indexed in text fields but still stored (metadata retrievable).
6) Ingestion & Verification
On‑chain sources
ArticlePublished(author, cid, version, previousCid, createdAt)Upvoted/Downvoted/Flagged(Actions)DisputeOpened/Resolved(Disputes)
Pipeline
Subscribe to RPC logs from a configured start time (supports backfill).
Finality buffer: delay inclusion in a final snapshot by
--publish-finality-seconds(default 60s).Fetch AOM package (CID) from: local IPFS → peers’ HTTP caches → multi‑gateway race.
Verify:
Manifest schema (DAG‑CBOR),
hashes.bodyMd,hashes.assets.If
preview.htmlpresent, re‑render and compare hash; otherwise renderbody.md.Images under
/media, size limits, count ≤10.
Normalize: extract text, compute reading metrics, compute
params_hash(from Config at ingestion).Index into the language segment.
Track on‑chain score deltas and dispute status; update document fields.
Reorg handling
Maintain an append‑only event log with chain height/time and a compact doc state.
On reorg detection, roll back to a checkpoint before divergence and re‑apply logs.
7) Snapshots, Merkle Roots & Federation
What is a snapshot? A per‑language index bundle plus metadata, containing:
DocSet: ordered list of
(cid, status flags, score_net)(or a deterministic encoding).Merkle root over the DocSet (e.g., SHA‑256 Merkle tree with fixed leaf encode).
Posting lists digest (optional v0.1) — per term group digest for stronger checks.
Metadata:
lang,fromTime,toTime(seconds, not blocks)rendererVersion,sanitizerVersion,policyHashanalyzerVersion,paramsHash(Config)indexerVersion,formatVersion
Bundle: compressed archive pinned to IPFS.
Signature: ed25519 signature of
(lang | root | metaHash | ipfsCid)with the node’s key.
Lifecycle
Build snapshot at interval or when doc delta > threshold.
Sign & Pin snapshot to IPFS; compute
root,metaHash.Gossip announcement (libp2p):
(lang, root, metaHash, ipfsCid, sig).On‑chain announce (NodeRewards):
announceSnapshot(lang, {root, metaHash, ipfsCid}, sig).Fetch & Validate from peers: a node consumes foreign languages only if ≥2 matching roots (or committee attest), then performs random CID spot checks.
Serve foreign language queries from fetched snapshot (read‑only), and re‑validate on new announcements.
Committee fallback If a language has only one builder, DAO‑allowlisted watchers act as a committee to attest (2‑of‑N signatures) that the snapshot’s Merkle root matches content spot‑checks.
8) Query Processing & Ranking
Endpoints support:
Top: BM25‑like (fixed constants) over
title + body + subtitle + tags.Trending:
trend = score_net * exp(− age / halfLife)with half‑life default 36h, plus optional recent upvote velocity bonus (small).Newest: by
created_atdescending.
Filters: lang, author, tag, status, time window (createdAt range).
Pagination: from (offset) + size (limit), size ≤ 50.
Stability
Keep ranking constants and half‑life in open config (node reports in
/versionand snapshot metadata).Clients can request
&ranker=top|trending|new.
9) Public APIs (HTTP/2 JSON)
9.1 Search
Response
9.2 Article
Returns parsed manifest (safe fields), verified hash status, computed metrics, lineage info (prev/next if known), dispute labels/notes (if any), and up/down weights.
9.3 Trending
Returns top N by trending score within the window.
9.4 Snapshot metadata
9.5 Health & version
9.6 WebSocket Stream
Emits JSON lines:
10) gRPC Control (node↔node)
Service (excerpt)
mTLS optional between known peers; otherwise Noise (libp2p) on QUIC for gossip channel.
11) Federation Policy
Consume foreign language snapshots only if:
At least 2 distinct builders announced identical root or
Committee attestations (2‑of‑N watchers) validate a single builder’s root.
Spot checks: randomly select K CIDs from the foreign snapshot, fetch AOM, verify manifest and presence in DocSet before accepting.
Quarantine: mismatches or time‑skewed snapshots are quarantined and reported via
/v1/health.
12) Operator Guide
12.1 Hardware & OS
Builder (2 langs): 2 vCPU, 4–8 GB RAM, 100 GB SSD, Linux x86_64 (Ubuntu 22.04+).
Serve‑only: 2 vCPU, 4 GB RAM, 50 GB SSD.
12.2 Dependencies
Rust stable, Docker (optional), local IPFS node (Kubo), access to Kasplex RPC endpoint (HTTPS).
12.3 Config file (cin.toml)
cin.toml)12.4 CLI
13) Performance & SLOs
Query p95: < 400 ms @ 10k docs per language; scale horizontally by language.
Cold fetch to first paint: < 2.0 s (frontend coordinated).
Snapshot build: < 5 s for 10k docs/lang (incremental).
Availability: 99.9% for public nodes (SLA guidelines for rewards uptime).
Optimizations:
Memory‑mapped Tantivy indices; warm caches.
Concurrent IPFS gateway race; HTTP/3 preferred.
Throttled rendering pool; reuse WASM instances.
14) Security, Abuse & DoS Mitigation
Rate limiting per client IP or API token (nginx/envoy front or internal).
Query timeouts and size caps; reject
q> 256 chars; pagination caps.Backpressure: bounded work queues for render, fetch, index.
Sanitizer is strict; invalid packages quarantined.
TLS for HTTP/2; mTLS optional for gRPC peer meshes.
Key management: ed25519 keys stored via OS keyring or HSM if available; rotate with downtime window.
15) Observability
Metrics (Prometheus):
ingest_lag_seconds, ipfs_fetch_ms, preview_mismatch_count, index_size_bytes, query_latency_ms{ranker}, snapshot_age_seconds{lang}, peers_count, spotcheck_failures.
Logs (structured, JSON): per request id; redact user IP by default.
Tracing (
tracingcrate): spans for fetch/render/index steps.
16) Testing Matrix
Unit
AOM validation (hashes, image constraints).
Render determinism & sanitizer policy.
Merkle root generation (stable across runs).
Analyzers per language.
Integration
Ingest from mocked RPC; reorg simulation.
IPFS failures & gateway fallback.
Federation: accept on 2‑of‑3 roots; quarantine on mismatch.
API correctness: same query yields same order (stable ranking).
Property
Snapshot idempotence: same corpus → identical
root&metaHash.Random spot check selection stable given seed; fails when content missing.
Load
500 RPS mixed queries; p95 under targets.
Snapshot build during query load maintains SLOs (priority scheduling).
17) Acceptance Criteria
Node can ingest from empty state to latest with no manual intervention.
For primary languages, node builds snapshots and signs announcements.
For foreign languages, node fetches ≥2 matching peer snapshots, spot‑checks K CIDs, and serves queries.
APIs meet latency and correctness targets; health & metrics exposed.
Snapshot metadata includes
paramsHash, renderer/analyzer versions; UIs display them.
18) Future Extensions
Per‑query proofs (inclusion & ranking) with verifiable IR invariants.
ZK search proofs for privacy‑preserving verification.
Delta snapshots (SMT) to shrink bundles.
Peer reputation scoring.
Encrypted transport of fetched bundles with peer attestation (TEE optional).
19) Changelog
v1: Initial CIN Node spec with federation, gRPC/HTTP2/WS transports, Tantivy indexing, Merkle snapshots, and operator guidance.
Last updated