A New Substrate for Retrieval / 001 — Emberis

Answers,
before you
finish asking.

We rebuilt retrieval the way the brain does it. Knowledge that fires in milliseconds, runs on the energy of a candle, and never makes things up — because nothing is generated, only recalled.

Read the Whitepaper No slides — just the system running

5ms / Query

End-to-End Latency

100× Less Energy

vs Cloud RAG

0Hallucinations

Retrieval, Not Generation

obs / Reservoir-01 · Live Spike Domain LIF · 500n · 90% Sparse

Reservoir · 3D Projection

Drag to tilt · attractor highlighted

Cluster · Cardiology

Attractor #042

90% Sparse · 500n

Latency5.2 ms

Energy0.011 J

Winner#042 / 500

Two Paradigms

RAG Assembles. Emberis Remembers.

Same problem, two substrates. One is an assembly line; the other is a single instant of recognition. Read across.

Traditional RAG

Substrate · Cloud LLM + Vector DB

embed

→

retrieve

→

rerank

→

prompt

→

generate

→

verify

Six round trips, two model calls, one autoregressive sampler that can — and does — invent. Every step is a place for latency, cost, and a fabricated answer to accumulate.

Latency

200–2000ms

Energy

~1.0J

Cost

$0.01–0.10

On-Device

Unlikely

Cloud-bound · regulated data leaves the perimeter
Generative step fabricates plausible nonsense
Vector DB is a separate piece of infra to operate

Emberis

Substrate · Spiking Neural Network

embed

→

encode (spikes)

→

recall (reservoir)

→

answer

Four stages, no autoregressive decoder, no second model. The query perturbs a recurrent network of leaky neurons; it falls into the attractor closest to the right answer, in milliseconds, on a CPU.

Latency

2–10ms

Energy

0.01J

Cost

$0.0001

On-Device

Native

On-prem by construction · no data exfiltration
No generation step — output is an index, not a sentence
Compiles to neuromorphic silicon (Loihi 2, Akida)

ii.

The Numbers

Flat on the table.

Reference implementation, single CPU baseline, 300-pair medical Q&A benchmark. CUDA adds another 10–100× headroom on top.

Query Latency · p50

5.2ms

vs 1840 ms cloud RAG · 354× Faster

Energy per Query

0.01J

vs 1.0 J · 100× Less

Cost per Query

$0.0001

vs $0.01–0.10 · 100–1000× Cheaper

Hallucination Rate

−72%

Structural · No generative sampler

iii.

How It Works

Five stages. One spike train all the way through.

Click any layer in the stack. The right panel updates with what is actually happening inside that block of the network — and why it costs almost nothing.

Semantic perception

2.5 ms

Spike encoding

0.5 ms

Recurrent recall

1.5 ms

Winner-take-all

0.7 ms

Answer + confidence

∑ 5.2 ms

Stage 01 · Semantic Perception

A 384-dimensional fingerprint of meaning.

A small pre-trained sentence transformer (all-MiniLM-L6-v2) compresses the query into a dense vector. This is the only conventional step in the pipeline — and the only step whose cost is amortized away by what follows.

Model

MiniLM-L6

Latency

2.5 ms

Energy

2.4 mJ

v ∈ ℝ³⁸⁴ · dense semantic vector

Stage 02 · Spike Encoding

Vectors become spike trains in time.

Embedding magnitudes drive firing rate; salience modulates spike latency; populations carry distributed meaning. The continuous becomes discrete, and time becomes a free axis of computation.

Code

rate · latency · pop.

Timesteps

Energy

0.4 mJ

s(t) ∈ {0,1}³⁸⁴ × 50t · sparse · binary

Stage 03 · Recurrent Recall

Recurrence finds the right basin.

500 leaky integrate-and-fire neurons, 90% sparse recurrent connectivity. The query perturbs the network; dynamics pull it toward the closest stored attractor. No backprop through time, no catastrophic forgetting.

Neurons

500 LIF

Sparsity

90%

Latency

1.5 ms

τ = 20 ms · v(t) = v · e^−Δt/τ + Wₛ · s(t)

Stage 04 · Winner-Take-All

The strongest population wins.

Lateral inhibition between output groups turns parallel evidence into a single, calibrated answer. Weak hypotheses are actively suppressed — confidence is a measured property of the dynamics, not a softmax estimate.

Strategy

top-K · K=5

Latency

0.7 ms

Confidence

calibrated

argmax_k ∫ s_k(t) dt · lateral inhibition

Stage 05 · Answer + Confidence

A pointer, not a paragraph.

The output is an index into the learned manifold plus a calibrated confidence. Pass it to your existing answer formatter, summarizer, or template — the heavy lifting is already done, and it cost you 5 milliseconds.

Output

top-K + score

Total

5.2 ms

Energy

0.01 J

retrieval ≠ generation · the model never speaks

iv.

Watch It Think

A query, resolved in five milliseconds.

Pick a question on the left. Spikes travel from the query source, sweep across the reservoir, and converge on the matching attractor. The answer is read out at the bottom — locked.

Try a Query

Dataset · 300 medical Q&A pairs

Model · reference impl (CPU, f32)

Warm latency · 4.7 – 5.6 ms

Query · "symptoms of a heart attack" Encoding query…

Resolved Attractor—

Confidence—

Latency—

Substratespiking · cpu

Why It Doesn't Lie

Hallucinations fade before they fire.

We don't sample. We don't generate tokens. The network is a landscape of attractor basins — uncertain answers don't accumulate enough membrane potential to cross threshold, so they never reach the output stage at all.

001 · Temporal Filtering

Weak signals leak away.

LIF neurons exponentially decay membrane potential between spikes. If incoming evidence isn't consistent across timesteps, the trace dies before threshold.

τ = 20 ms · v(t) = v · e^−Δt/τ

002 · Attractor Basins

The network falls toward what it knows.

Trained domains form stable points in state space. Off-distribution inputs land in shallow basins that decay rather than converge — a structural form of "I don't know".

∇H(s) ≈ 0 ⇒ recall

003 · Winner-Take-All

Only one answer crosses the line.

Output neurons compete laterally. The strongest, most temporally coherent population suppresses the rest — confidence is a measured property, not a softmax estimate.

argmax_k ∫ s_k(t) dt

004 · No Sampler

We retrieve. We do not invent.

There is no autoregressive decoder to fabricate plausible nonsense. The output is a pointer into a learned manifold — with a real, calibrated confidence score attached.

retrieval ≠ generation

vi.

Where It Lands

Latency-bound. Privacy-bound. Power-bound.

Anywhere a millisecond matters, a watt-hour is rationed, or a packet can't leave the perimeter — a brain-inspired retrieval layer is the more honest fit.

Healthcare

Decision support at point of care.

Symptom-to-differential in < 10 ms
Drug-interaction screening, on-prem
HIPAA-compliant by construction

Autonomy

Sub-10 ms perception loops.

Object & intent recognition for AVs
Robotic control under hard real-time
Drone navigation on a 5 W budget

Enterprise

Internal knowledge that answers fast.

Million-doc search with no vector DB
Customer support copilots
Legal & regulated workflows on-prem

Edge / IoT

Local intelligence in milliwatts.

Smart sensors with on-device reasoning
Wearables with months of battery
Industrial PLCs & inspection cams

Finance

Latency that pays for itself.

Pre-trade research in µs–ms
Compliance pattern recall on-prem
Fraud signals before the request lands

Defense / Public Sector

Air-gapped knowledge retrieval.

No cloud dependency, no exfil surface
Survives degraded networks
Auditable, deterministic latency

vii.

Economics

Move the sliders. Watch the bill collapse.

Annualized savings against a comparable cloud LLM + vector DB pipeline. Conservative assumptions, every line item visible.

Queries / Day1,000,000

Log scale · 1k → 1B

Cloud Cost / Query$0.010

Infra ($ / Month)$5,000

Emberis Cost / Query$0.0001

Energy Reduction100×

Infra Reduction (floor)80%

Annualized Savings · vs Cloud RAG

3.61M / yr

At a million queries per day, your cloud inference line item shrinks to a rounding error — and your edge devices stop needing one at all.

API Savings

$3.61M

Infra Savings

$48k

Energy

99%

viii.

Roadmap

From CPU baseline to silicon synapses.

Neuromorphic hardware (Loihi 2, BrainChip Akida) is finally shipping. The pipeline is built to compile down to it without rewriting your application.

v1.0 · NOW · 2026 Q2

Software Substrate

SpikingJelly + PyTorch reference impl
CPU & CUDA backends
REST + batch APIs, model versioning
Medical Q&A reference dataset

v1.5 · 2026 Q4

Multi-Domain

Mixed-domain reservoirs
Incremental learning API
Extended benchmarks (legal, fin, gov)
Quantized inference

v2.0 · 2027

Silicon

Intel Loihi 2 deployment
BrainChip Akida edge target
Federated learning across devices
Real-time knowledge updates

Answers,before youfinish asking.

RAG Assembles. Emberis Remembers.

Flat on the table.

Five stages. One spike train all the way through.

A 384-dimensional fingerprint of meaning.

Vectors become spike trains in time.

Recurrence finds the right basin.

The strongest population wins.

A pointer, not a paragraph.

A query, resolved in five milliseconds.

Try a Query

Hallucinations fade before they fire.

Weak signals leak away.

The network falls toward what it knows.

Only one answer crosses the line.

We retrieve. We do not invent.

Latency-bound. Privacy-bound. Power-bound.

Decision support at point of care.

Sub-10 ms perception loops.

Internal knowledge that answers fast.

Local intelligence in milliwatts.

Latency that pays for itself.

Air-gapped knowledge retrieval.

Move the sliders. Watch the bill collapse.

From CPU baseline to silicon synapses.

Software Substrate

Multi-Domain

Silicon

The brain solved retrieval billions of years ago.

Answers,
before you
finish asking.