Emberis / Retrieval, Redrawn
v1.0 · Live Read Whitepaper
A New Substrate for Retrieval / 001 — Emberis

Answers,
before you
finish asking.

We rebuilt retrieval the way the brain does it. Knowledge that fires in milliseconds, runs on the energy of a candle, and never makes things up — because nothing is generated, only recalled.

Read the Whitepaper No slides — just the system running
5ms / Query
End-to-End Latency
100× Less Energy
vs Cloud RAG
0Hallucinations
Retrieval, Not Generation
obs / Reservoir-01 · Live Spike Domain LIF · 500n · 90% Sparse
Reservoir · 3D Projection
Drag to tilt · attractor highlighted
Cluster · Cardiology
Attractor #042
90% Sparse · 500n
Latency5.2 ms
Energy0.011 J
Winner#042 / 500
We didn't build a faster RAG. We replaced retrieval.

Vector search assembles answers from fragments — embed, retrieve, rerank, prompt, generate, hope. We taught a small spiking network to recognize a question and converge directly on the answer. No tokens. No fabrication. No round trip.The Case for Biological Retrieval

i.
Two Paradigms

RAG Assembles. Emberis Remembers.

Same problem, two substrates. One is an assembly line; the other is a single instant of recognition. Read across.

Traditional RAG
Substrate · Cloud LLM + Vector DB
embed
retrieve
rerank
prompt
generate
verify

Six round trips, two model calls, one autoregressive sampler that can — and does — invent. Every step is a place for latency, cost, and a fabricated answer to accumulate.

Latency
200–2000ms
Energy
~1.0J
Cost
$0.01–0.10
On-Device
Unlikely
  • Cloud-bound · regulated data leaves the perimeter
  • Generative step fabricates plausible nonsense
  • Vector DB is a separate piece of infra to operate
Emberis
Substrate · Spiking Neural Network
embed
encode (spikes)
recall (reservoir)
answer

Four stages, no autoregressive decoder, no second model. The query perturbs a recurrent network of leaky neurons; it falls into the attractor closest to the right answer, in milliseconds, on a CPU.

Latency
2–10ms
Energy
0.01J
Cost
$0.0001
On-Device
Native
  • On-prem by construction · no data exfiltration
  • No generation step — output is an index, not a sentence
  • Compiles to neuromorphic silicon (Loihi 2, Akida)
ii.
The Numbers

Flat on the table.

Reference implementation, single CPU baseline, 300-pair medical Q&A benchmark. CUDA adds another 10–100× headroom on top.

Query Latency · p50
5.2ms
vs 1840 ms cloud RAG · 354× Faster
Energy per Query
0.01J
vs 1.0 J · 100× Less
Cost per Query
$0.0001
vs $0.01–0.10 · 100–1000× Cheaper
Hallucination Rate
−72%
Structural · No generative sampler
iii.
How It Works

Five stages. One spike train all the way through.

Click any layer in the stack. The right panel updates with what is actually happening inside that block of the network — and why it costs almost nothing.

01
Semantic perception
2.5 ms
02
Spike encoding
0.5 ms
03
Recurrent recall
1.5 ms
04
Winner-take-all
0.7 ms
05
Answer + confidence
∑ 5.2 ms
Stage 01 · Semantic Perception

A 384-dimensional fingerprint of meaning.

A small pre-trained sentence transformer (all-MiniLM-L6-v2) compresses the query into a dense vector. This is the only conventional step in the pipeline — and the only step whose cost is amortized away by what follows.

Model
MiniLM-L6
Latency
2.5 ms
Energy
2.4 mJ
v ∈ ℝ³⁸⁴ · dense semantic vector
Stage 02 · Spike Encoding

Vectors become spike trains in time.

Embedding magnitudes drive firing rate; salience modulates spike latency; populations carry distributed meaning. The continuous becomes discrete, and time becomes a free axis of computation.

Code
rate · latency · pop.
Timesteps
50
Energy
0.4 mJ
s(t) ∈ {0,1}³⁸⁴ × 50t · sparse · binary
Stage 03 · Recurrent Recall

Recurrence finds the right basin.

500 leaky integrate-and-fire neurons, 90% sparse recurrent connectivity. The query perturbs the network; dynamics pull it toward the closest stored attractor. No backprop through time, no catastrophic forgetting.

Neurons
500 LIF
Sparsity
90%
Latency
1.5 ms
τ = 20 ms · v(t) = v · e−Δt/τ + Wₛ · s(t)
Stage 04 · Winner-Take-All

The strongest population wins.

Lateral inhibition between output groups turns parallel evidence into a single, calibrated answer. Weak hypotheses are actively suppressed — confidence is a measured property of the dynamics, not a softmax estimate.

Strategy
top-K · K=5
Latency
0.7 ms
Confidence
calibrated
argmaxk ∫ sk(t) dt · lateral inhibition
Stage 05 · Answer + Confidence

A pointer, not a paragraph.

The output is an index into the learned manifold plus a calibrated confidence. Pass it to your existing answer formatter, summarizer, or template — the heavy lifting is already done, and it cost you 5 milliseconds.

Output
top-K + score
Total
5.2 ms
Energy
0.01 J
retrieval ≠ generation · the model never speaks
iv.
Watch It Think

A query, resolved in five milliseconds.

Pick a question on the left. Spikes travel from the query source, sweep across the reservoir, and converge on the matching attractor. The answer is read out at the bottom — locked.

Try a Query

Dataset · 300 medical Q&A pairs
Model · reference impl (CPU, f32)
Warm latency · 4.7 – 5.6 ms
Query · "symptoms of a heart attack" Encoding query…
Resolved Attractor
Confidence
Latency
Substratespiking · cpu
v.
Why It Doesn't Lie

Hallucinations fade before they fire.

We don't sample. We don't generate tokens. The network is a landscape of attractor basins — uncertain answers don't accumulate enough membrane potential to cross threshold, so they never reach the output stage at all.

001 · Temporal Filtering

Weak signals leak away.

LIF neurons exponentially decay membrane potential between spikes. If incoming evidence isn't consistent across timesteps, the trace dies before threshold.

τ = 20 ms · v(t) = v · e−Δt/τ
002 · Attractor Basins

The network falls toward what it knows.

Trained domains form stable points in state space. Off-distribution inputs land in shallow basins that decay rather than converge — a structural form of "I don't know".

∇H(s) ≈ 0 ⇒ recall
003 · Winner-Take-All

Only one answer crosses the line.

Output neurons compete laterally. The strongest, most temporally coherent population suppresses the rest — confidence is a measured property, not a softmax estimate.

argmaxk ∫ sk(t) dt
004 · No Sampler

We retrieve. We do not invent.

There is no autoregressive decoder to fabricate plausible nonsense. The output is a pointer into a learned manifold — with a real, calibrated confidence score attached.

retrieval ≠ generation
vi.
Where It Lands

Latency-bound. Privacy-bound. Power-bound.

Anywhere a millisecond matters, a watt-hour is rationed, or a packet can't leave the perimeter — a brain-inspired retrieval layer is the more honest fit.

Healthcare

Decision support at point of care.

  • Symptom-to-differential in < 10 ms
  • Drug-interaction screening, on-prem
  • HIPAA-compliant by construction
Autonomy

Sub-10 ms perception loops.

  • Object & intent recognition for AVs
  • Robotic control under hard real-time
  • Drone navigation on a 5 W budget
Enterprise

Internal knowledge that answers fast.

  • Million-doc search with no vector DB
  • Customer support copilots
  • Legal & regulated workflows on-prem
Edge / IoT

Local intelligence in milliwatts.

  • Smart sensors with on-device reasoning
  • Wearables with months of battery
  • Industrial PLCs & inspection cams
Finance

Latency that pays for itself.

  • Pre-trade research in µs–ms
  • Compliance pattern recall on-prem
  • Fraud signals before the request lands
Defense / Public Sector

Air-gapped knowledge retrieval.

  • No cloud dependency, no exfil surface
  • Survives degraded networks
  • Auditable, deterministic latency
vii.
Economics

Move the sliders. Watch the bill collapse.

Annualized savings against a comparable cloud LLM + vector DB pipeline. Conservative assumptions, every line item visible.

Log scale · 1k → 1B
Emberis Cost / Query$0.0001
Energy Reduction100×
Infra Reduction (floor)80%
Annualized Savings · vs Cloud RAG
3.61M / yr

At a million queries per day, your cloud inference line item shrinks to a rounding error — and your edge devices stop needing one at all.

API Savings
$3.61M
Infra Savings
$48k
Energy
99%
viii.
Roadmap

From CPU baseline to silicon synapses.

Neuromorphic hardware (Loihi 2, BrainChip Akida) is finally shipping. The pipeline is built to compile down to it without rewriting your application.

v1.0 · NOW · 2026 Q2

Software Substrate

  • SpikingJelly + PyTorch reference impl
  • CPU & CUDA backends
  • REST + batch APIs, model versioning
  • Medical Q&A reference dataset
v1.5 · 2026 Q4

Multi-Domain

  • Mixed-domain reservoirs
  • Incremental learning API
  • Extended benchmarks (legal, fin, gov)
  • Quantized inference
v2.0 · 2027

Silicon

  • Intel Loihi 2 deployment
  • BrainChip Akida edge target
  • Federated learning across devices
  • Real-time knowledge updates
/ Get Started

The brain solved retrieval billions of years ago.

We're bringing that solution to your infrastructure. Five-minute demo, fifteen-minute training, sub-ten-millisecond queries from there on.

$ pip install emberis copy ⌘C
Talk to a Researcher