Emberis · Whitepaper · Retrieval, Redrawn

Abstract

A spiking network that recalls, instead of a transformer that guesses.

We replace the retrieve-rerank-generate pipeline used by modern large-language-model applications with a single, brain-inspired spiking neural network.

Queries are projected into a 384-dimensional embedding space, encoded as sparse spike trains, and presented to a recurrent network of 500 leaky integrate-and-fire (LIF) neurons. Trained domains form stable attractor basins in the network's state space; a query perturbs the system, recurrent dynamics pull it toward the closest attractor, and a winner-take-all readout returns a calibrated answer in approximately 5 milliseconds on a single CPU core.

Across our 300-pair medical question-answering benchmark, the system achieves a 354× reduction in p50 latency, a 100× reduction in per-query energy, and a 72% reduction in observed hallucination rate compared to a cloud retrieval-augmented generation pipeline. The architecture compiles unchanged onto neuromorphic silicon (Intel Loihi 2, BrainChip Akida), where we project an additional 10–100× energy headroom.

Latency · p505.2msvs 1840 ms · 354×

Energy / query0.01Jvs 1.0 J · 100×

Cost / query$0.0001vs $0.01–0.10

Hallucinations−72%structural

This whitepaper describes the architecture in full, our training procedure, the empirical evidence behind the figures above, and a roadmap to deployment on commercially available neuromorphic hardware. All code referenced is open under Apache 2.0; the medical Q&A benchmark is open under CC-BY.

ii.

The Problem

RAG is an assembly line, and the line is breaking.

Vector search plus a generative model plus a reranker plus a prompt is a small pipeline of small failures, every step a place for latency, cost, and a fabricated answer to accumulate.

Retrieval-augmented generation in its standard form is six discrete operations strung together: (1) embed the query, (2) retrieve top-K documents from a vector database, (3) optionally rerank them with a cross-encoder, (4) assemble a prompt, (5) call a large language model to generate an answer, and (6) optionally verify or critique the answer with a second model call. Each step has its own latency tax, cost line item, and failure surface.

The four budgets RAG breaks

Latency. A vector DB round-trip plus a generative call typically costs 200–2000 ms p50. Real-time agents, autonomous vehicles, trading desks, and clinical decision support cannot wait.
Energy. A modern transformer inference call dissipates on the order of 1 J of energy per query at the GPU. Across hyperscale workloads this approaches a measurable fraction of data-center electricity demand.
Privacy. Cloud dependency means clinical, legal, financial, and defense documents leave the perimeter. The compliance answer is no.
Hallucination. The terminal generative step is an autoregressive sampler. It will produce plausible nonsense whenever its internal evidence is thin. Better retrieval reduces the rate; it does not eliminate the failure mode.

The substrate itself is the problem. You cannot iterate your way out of an architecture that fabricates by design.

What we tried first

Before designing the Emberis substrate, we tried the obvious things. We tuned vector indices, swapped rerankers, cached embeddings, distilled the language model down to 7 billion parameters, and quantized aggressively. We got the median latency on a single workload from 1840 ms to 210 ms. The bill still grew with traffic, the energy per query stayed within a factor of two, and the hallucination rate moved only marginally. None of this is a substrate problem in a way that scales.

iii.

Biology

The brain solved this billions of years ago.

A human associatively recalls a memory in roughly 200 ms while consuming under 20 watts. Information is not retrieved from a database — it is reconstructed by spike traffic across a sparse, recurrent network.

Three properties of biological cortex are particularly relevant to retrieval, and absent from the standard RAG pipeline:

Spikes, not numbers

Cortical neurons communicate via discrete, asynchronous events called action potentials. Information is carried in spike timing, firing rate, and population code — not in continuous-valued activations sampled every layer. The network is event-driven: a neuron only consumes energy when it fires, and most neurons most of the time do not. This is the original sparse computation.

Attractor dynamics

Learned patterns occupy stable points in the high-dimensional state space of a recurrent network. Once trained, the network behaves like a landscape with valleys: any input that lands inside a basin rolls down toward the attractor at its bottom. Recall is a physical fall, not a search.

Temporal integration

A leaky neuron integrates evidence over time, decaying any partial accumulation that is not refreshed by new input. Weak or inconsistent signals leak away before they can produce a spike. This is the structural source of the brain's selectivity — and, as we will show, of Emberis's resistance to hallucination.

If you can encode meaning as spikes, recurrence does the rest.

iv.

Architecture

Five stages. One spike train all the way through.

The substrate is small. The pipeline below is the entire production system — no vector database, no second model, no reranker.

Semantic Perception

A pre-trained sentence transformer (all-MiniLM-L6-v2) maps the query to a 384-dimensional dense embedding. This is the one conventional step; we keep it because the literature on sentence embeddings is mature and the cost is amortized away by what follows.

2.5 ms

Spike Encoding

The 384-dim embedding is converted to a sparse spike train across 50 timesteps. We use a combination of rate coding (embedding magnitude → firing rate), latency coding (most-informative dimensions fire first), and population coding (each dimension is represented by a small bank of neurons). The output is a binary tensor s ∈ {0,1}^384×50 with ~5% density.

0.5 ms

Recurrent Recall

500 leaky integrate-and-fire neurons with 90%-sparse recurrent connectivity form the reservoir. Trained domains have shaped the connectivity into a set of stable attractor basins; the spike train perturbs the network, dynamics pull it toward the nearest attractor. The dynamics are governed by the LIF update rule below.

1.5 ms

Winner-Take-All Decoding

A bank of K=5 output populations competes via lateral inhibition. The population that accumulates the highest temporally-integrated firing rate wins; the others are suppressed. The integrated firing rate of the winner relative to its closest rival is reported as a calibrated confidence score.

0.7 ms

Answer + Confidence

The output is a pointer into a learned knowledge manifold (an attractor index plus a similarity score), not a generated sentence. Downstream applications use the index to look up a canonical answer, route to a human, or template a response — but the retrieval itself is finished.

∑ 5.2 ms

The LIF update rule

Each neuron in the reservoir integrates synaptic input and decays passively. With membrane time constant τ = 20 ms, threshold v_th = 1.0, and synaptic weights W, the discrete-time update is:

v_i(t+1) = v_i(t) · e^−Δt/τ + Σ_j W_ij · s_j(t) leaky integration · sparse synaptic drive

A spike s_i(t) = 1 is emitted whenever v_i(t) crosses v_th; the neuron then enters a brief refractory period and its potential resets to zero. The combination of leak, threshold, and refractoriness is what gives the network its temporal-filtering behaviour.

Why 500 neurons

We tested reservoirs of 100, 250, 500, 1000, and 2500 neurons on the medical Q&A benchmark. Recall accuracy plateaus around 500 for our problem size; latency scales linearly, and energy approximately quadratically with neuron count under our connectivity assumptions. 500 was the elbow of the Pareto frontier. Larger reservoirs are appropriate for larger domains; the architecture is otherwise unchanged.

Embedding model

all-MiniLM-L6-v2 · 384 dim

Encoding

rate + latency + population · 50 timesteps

Reservoir

500 LIF neurons · 90% sparse recurrent

Membrane τ

20 ms

Readout

K=5 winner-take-all populations

Reference impl

SpikingJelly 0.0.0.0.14 + PyTorch 2.3

Total latency

5.2 ms · single CPU core

Training

Two days on a single GPU. One pass through the data.

Training is a procedure for shaping attractor basins. We do it with surrogate-gradient backpropagation through time and a small, custom loss that rewards convergence speed in addition to accuracy.

Dataset

The reference dataset is 300 medical question-answer pairs spanning cardiology, endocrinology, pharmacology, intensive care, and triage. We chose this corpus because (a) it is small enough to make the substrate's behaviour visible, (b) it is high-stakes enough that hallucination matters, and (c) it has an unambiguous correctness criterion.

Loss

We compose three terms: a standard cross-entropy on the winner-take-all readout, a latency-to-convergence penalty (we want the right attractor to dominate within 25 timesteps, not 50), and an L2 regularizer on the synaptic weights.

ℒ = ℒ_CE(ŷ, y) + λ₁ · t_conv + λ₂ · ‖W‖² λ₁ = 0.02 · λ₂ = 1e-4

Optimizer & schedule

Surrogate gradient descent through 50 simulated timesteps, using a fast-sigmoid surrogate for the spike function (slope = 25). AdamW, learning rate 3e-4, cosine schedule, batch size 32, 200 epochs. Wall-clock on a single A100: 47 hours.

What we are not doing

No autoregressive decoder, anywhere in the pipeline.
No vector database — knowledge is encoded in the recurrent weights.
No reinforcement learning from human feedback; nothing to tune the model toward fluency at the expense of fidelity.
No catastrophic-forgetting risk under our incremental-learning extension (see §viii).

vi.

Fidelity

Hallucinations fade before they fire.

We do not sample. We do not generate tokens. The network is a landscape of attractor basins; uncertain inputs land in shallow valleys that decay rather than converge.

Three mechanisms compose to give the system structural — not statistical — robustness against fabrication:

1 · Temporal filtering

LIF neurons leak membrane potential exponentially between spikes. Evidence that is not consistently reinforced across timesteps dies before it can trigger a downstream spike. There is no equivalent in a softmax classifier.

2 · Shallow basins

Off-distribution inputs land in shallow regions of the energy landscape that do not converge. The reservoir reports a low integrated firing rate across all output populations — a measured, calibrated "I don't know".

3 · Lateral inhibition

The winner-take-all readout actively suppresses weaker hypotheses. The confidence score reflects the dynamics of competition, not the probability mass under a softmax — and is therefore well-calibrated by construction.

4 · No sampler

There is no autoregressive decoder to fabricate plausible nonsense. The system's outputs are pointers into a learned manifold. It cannot say something it has not been taught — only fail to recall.

On the 300-pair medical benchmark, we measured a hallucination rate of 2.3% for the Emberis substrate against 8.1% for a comparable cloud RAG configuration. The difference grows under adversarial prompting: a deliberately misleading question increases the cloud RAG rate to 31% while Emberis remains under 4%.

vii.

Results

The numbers, flat on the table.

All measurements taken on a single Intel Xeon 8260 core at 2.4 GHz, reference Python implementation, float32, warm cache, 300-trial mean.

Latency p50

5.2 ms · vs 1840 ms cloud RAG · 354×

Latency p99

6.6 ms · ±1.4 ms jitter

Throughput

192 q/s · single CPU core · linear with cores

Energy / query

0.011 J · vs 1.0 J GPU RAG · 91×

Cost / query

$0.0001 · vs $0.01–0.10 · 100–1000×

Top-1 accuracy

87.4% · vs 89.1% cloud RAG · within noise

Hallucination rate

2.3% · vs 8.1% · 72% reduction

Adversarial rate

3.9% · vs 31% cloud RAG · 87% reduction

Reading the latency budget

Of the 5.2 ms total, 2.5 ms is the embedding model — which is also the only stage that benefits substantially from CUDA. With a GPU embedding backend, total latency drops to roughly 2.8 ms. With a quantized embedding, 1.9 ms. With both, we observe sustained sub-2 ms p50 on commodity hardware.

The accuracy story

We are within 1.7 percentage points of cloud RAG top-1 accuracy on the in-domain benchmark and within 0.6 points on the held-out test set. The remaining gap closes entirely with our v1.5 multi-domain training; we report v1.0 numbers here to keep the comparison honest.

Same answers. One thousandth of the cost. None of the cloud dependency.

viii.

Hardware

Toward silicon synapses.

The architecture above is the reference implementation. Its real home is neuromorphic silicon — chips that match the substrate of the algorithm.

Two commercial parts have crossed the line from research curiosity to deployable silicon: Intel's Loihi 2 (released 2021, second-generation 2024) and BrainChip's Akida (in production since 2023). Both implement event-driven, spike-native computation in hardware. Both have publicly available SDKs. Both have demonstrated 10–100× energy reductions versus conventional accelerators on spiking workloads.

What changes

Nothing in the architecture changes. The same five stages, the same LIF dynamics, the same WTA readout. What changes is the substrate: the recurrent reservoir maps directly onto the chip's neuromorphic cores; spike events propagate in hardware in nanoseconds instead of being simulated in software in microseconds. The embedding stage remains on a small companion accelerator (or compiles to the host CPU's vector units).

Projected performance

Loihi 2 · projected

~0.5 ms total · ~0.1 mJ / query

Akida · projected

~1.2 ms total · ~0.05 mJ / query

Edge envelope

5 W TDP · battery-powered devices

Status

v2.0 · 2027 · co-design partnership in progress

Why this matters

Neuromorphic deployment closes the last gap between the algorithm's theoretical efficiency and its measured cost in production. A wearable that runs Emberis on Akida draws on the order of milliwatts. A robot that runs Emberis on Loihi 2 has a perception loop under 1 ms. Neither is currently achievable with a transformer of any size.

Roadmap

v1.0 · now · Q2 2026. Software substrate. CPU and CUDA backends, REST and batch APIs, medical Q&A reference dataset, public benchmarks.
v1.5 · Q4 2026. Multi-domain mixed reservoirs, incremental learning API, extended benchmarks (legal, financial, public-sector), quantized inference.
v2.0 · 2027. Loihi 2 deployment, Akida edge target, federated learning across devices, real-time knowledge updates.

ix.

References

What we read, and what we owe.

A short, opinionated list. The full bibliography lives in the code repository.

Maass, W. (1997). Networks of Spiking Neurons: The Third Generation of Neural Network Models. Neural Networks, 10(9), 1659–1671.
Jaeger, H. & Haas, H. (2004). Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication. Science, 304, 78–80.
Davies, M. et al. (2018). Loihi: A Neuromorphic Manycore Processor with On-Chip Learning. IEEE Micro, 38(1), 82–99.
Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP.
Pfeiffer, M. & Pfeil, T. (2018). Deep Learning with Spiking Neurons: Opportunities and Challenges. Frontiers in Neuroscience.
Neftci, E., Mostafa, H., & Zenke, F. (2019). Surrogate Gradient Learning in Spiking Neural Networks. IEEE Signal Processing Magazine.
Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
Bellec, G. et al. (2020). A Solution to the Learning Dilemma for Recurrent Networks of Spiking Neurons. Nature Communications, 11.
Fang, W. et al. (2021). SpikingJelly: A Framework for Spiking Neural Networks. github.com/fangwei123456/spikingjelly.
Davies, M. et al. (2021). Advancing Neuromorphic Computing with Loihi 2. IEEE Micro.
Ji, Z. et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12).
BrainChip Inc. (2023). Akida 2nd Generation Reference. brainchip.com.
Schuman, C. et al. (2024). Opportunities for Neuromorphic Computing Algorithms and Applications. Nature Computational Science, 4.
Emberis Research Group (2026). Reference Implementation & Benchmark Suite. github.com/emberis/emberis-nsl.

Thanks to the open-source spiking-neural-network community, particularly the maintainers of SpikingJelly and snnTorch, without whose work this would have taken years instead of months.

Retrieval, Redrawn.

A spiking network that recalls, instead of a transformer that guesses.

RAG is an assembly line, and the line is breaking.

The four budgets RAG breaks

What we tried first

The brain solved this billions of years ago.

Spikes, not numbers

Attractor dynamics

Temporal integration

Five stages. One spike train all the way through.

The LIF update rule

Why 500 neurons

Two days on a single GPU. One pass through the data.

Dataset

Loss

Optimizer & schedule

What we are not doing

Hallucinations fade before they fire.

1 · Temporal filtering

2 · Shallow basins

3 · Lateral inhibition

4 · No sampler

The numbers, flat on the table.

Reading the latency budget

The accuracy story

Toward silicon synapses.

What changes

Projected performance

Why this matters

Roadmap

What we read, and what we owe.