Prompt Lookup Decoding in MLX-LM: N-gram Memory vs Two-Model Speculation

Engineering and empirically evaluating prompt-lookup decoding, rolling-hash speculative memory, and two-model speculative decoding in MLX-LM on Apple Silicon

Introduction

Speculative decoding is typically framed around a two-model setup: a smaller draft model proposes candidate tokens and a larger target model verifies them in parallel. When draft-token acceptance is high, the target model produces multiple tokens per forward pass and throughput increases substantially. When acceptance is low, the overhead of the draft model dominates and throughput falls below baseline.

This post documents an implementation of prompt-lookup decoding (PLD) in MLX-LM, including both ngram-simple and ngram-mod. Unlike draft-model speculative decoding, these approaches do not require a second neural model. Instead, speculative drafts are generated directly from previously observed token trajectories.

An n-gram simply refers to a sequence of n consecutive tokens, such as a 3-token or 16-token window.

Two speculative strategies were implemented:

ngram-simple: exact prompt-history lookup
ngram-mod: rolling-hash associative memory with process-global shared state

The work extends a generalized DraftStrategy abstraction added to MLX-LM and evaluates the practical behavior of both approaches on Apple Silicon across short-turn editing workloads, long-context conversations, a direct comparison against classic two-model speculative decoding, and cross-model warm-start experiments.

The most interesting result was not the raw throughput gain itself, but that token-space reuse could outperform a much higher-acceptance two-model draft setup on edit-heavy workloads, while also enabling runtime-learned speculative memory shared across requests and reusable across models with compatible tokenizers.

Architecture

Prompt Lookup Decoding

The simpler of the two implementations, ngram-simple, performs exact prompt-history lookup.

The mechanism is straightforward: the decoder maintains a running token history, extracts the most recent n-token window, and searches backward through the earlier history to find a prior occurrence of that same window. When a match is found, it copies up to num_draft_tokens tokens that followed the earlier occurrence as a speculative continuation and submits those tokens to the verifier, which accepts the matching prefix in bulk and falls back to normal generation at the first mismatch.

In effect, the draft comes from reusing previously observed token trajectories rather than from a second neural model, which makes the approach especially effective when the prompt contains recurring patterns.

For repetitive workloads such as code editing, structured generation, or iterative refinement, token trajectories recur frequently enough that the verifier can often accept several speculative tokens in bulk.

Unlike draft-model speculative decoding, no neural forward pass is required to produce the draft itself. The speculative tokens are retrieved directly from prior token history.

Rolling-Hash Associative Memory

ngram-mod replaces exact prompt lookup with a lightweight associative memory. Instead of scanning prior token history to find a matching window, it performs an O(1) hash-table lookup from the most recent n-token context to a predicted next token. This avoids repeated backward searches through the prompt, makes lookup cost effectively independent of context length, and allows the learned transition table to be shared across requests.

Instead of scanning token history directly, the implementation stores rolling n-gram hashes:

hash([token1, token2, token3, ...]) -> next_token

Generation becomes iterative: on each step, the decoder hashes the most recent token window, uses that hash to look up a predicted next token in the table, appends the predicted token to the sequence, and then slides the window forward to repeat the same process. Over time, this turns the table into a lightweight, runtime-learned next-token predictor over recent n-gram contexts.

The implementation separates shared memory from request-local decoding state:

Layer	Class	Lifetime	Responsibility
Memory	`NgramModTable`	process-global	shared hash table, occupancy, reset
Strategy	`NgramModStrategy`	per-request	runtime decoding state

This separation is important because the speculative memory itself is shared across requests, while runtime bookkeeping remains request-local.

The shared table stores only token transitions. It does not store prompts, completions, embeddings, or semantic representations. The memory is entirely tokenizer-space associative state.

Collision Model

The implementation intentionally uses a lossy collision policy.

Hash collisions silently overwrite prior entries:

hash(ngram_A) == hash(ngram_B)

results in the newer continuation replacing the older one.

No collision chaining or tag verification is performed. Incorrect speculative continuations are rejected by the verifier model during verification.

This design choice keeps the memory extremely lightweight and cache-friendly at the cost of approximate retrieval quality.

Adaptive Gating

Speculative decoding only improves throughput when token trajectories are sufficiently repetitive. On open-ended or highly stochastic prompts, speculative overhead can outweigh the savings from accepted tokens.

To avoid regressions, the implementation ports the adaptive speculative gate from mlx-serve. A 3-gram repetition score is computed over the prompt after tokenization. If repetition falls below a threshold, speculative decoding is disabled for that request unless explicitly forced.

This is particularly important for ngram-mod, whose cold-start behavior can otherwise regress below baseline throughput.

Experimental Setup

All experiments were run on a MacBook Pro M5 chip with 16Gb of ram using MLX-LM.

Benchmarks 1 and 2 were run using:

mlx-community/Llama-3.2-3B-Instruct-4bit

This work was implemented directly in MLX-LM and raised as an upstream PR to the core repository. The implementation extends the generation path with a generalized draft strategy interface, adds token-only draft sources for ngram-simple and ngram-mod, and exposes runtime strategy selection through CLI and server arguments.

--draft-type {none,ngram-simple,ngram-mod}

with additional controls for draft length, n-gram size, adaptive gating, and server-side shared state.

The default ngram-mod configuration used:

Parameter	Value
Hash multiplier	`6364136223846793005`
Table size	`4 * 1024 * 1024` entries
Default ngram size	`16`
Chunked insert gate	`i_last + 32 < cur_len`
Low acceptance threshold	`< 0.5`
Reset streak	`3`

The shared table is allocated lazily and persists for the lifetime of the running server process. Restarting the process resets the speculative memory entirely.

Benchmark 1: Short Multi-Turn Editing

The first benchmark evaluated short iterative code edits.

Workload:

4-turn conversation
edits to a small process_user(name) function
31–53 generated tokens per turn

Results:

config	tok/s	acc%	speedup
baseline	48.63	—	1.00x
ngram-simple nd=4	73.44	55.2%	1.51x
ngram-mod nd=4	60.56	41.4%	1.25x
ngram-mod nd=6	60.95	44.2%	1.25x

The principal result was that ngram-simple decisively outperformed ngram-mod on short editing workloads.

This appears to be primarily a consequence of window size. ngram-simple uses small exact lookups (n=3), while ngram-modoperates on 16-token rolling windows. Small edits invalidate every longer window touching the modification, which dramatically reduces speculative reuse in short contexts.

For workloads dominated by iterative edits to small snippets, exact local lookup appears to be the better strategy.

Benchmark 2: Long Multi-Turn Editing

The second benchmark expanded the workload substantially.

Workload:

edits to a multi-method EmailValidator class
270–341 generated tokens per turn
~1257 generated tokens total

Results:

config	tok/s	acc%	speedup
baseline	54.09	—	1.00x
ngram-simple nd=4	91.57	62.7%	1.69x
ngram-simple nd=6	89.01	67.8%	1.65x
ngram-mod nd=6	84.69	59.4%	1.57x
ngram-mod nd=8	82.69	61.7%	1.53x

The gap narrowed considerably in longer contexts.

Per-turn breakdown revealed the more interesting behavior:

turn	simple nd=6	mod nd=6
T1	1.14x	0.92x
T2	2.01x	2.16x
T3	2.07x	2.24x
T4	2.12x	2.35x

ngram-mod loses the initial turn because the associative memory begins empty. Once sufficient token history accumulates, the rolling-hash approach overtakes exact lookup.

This behavior is consistent with the underlying mechanisms. ngram-simple performs exact local retrieval and benefits immediately from nearby repetition. ngram-mod instead behaves like a runtime-learned associative predictor whose effectiveness scales with accumulated token trajectories.

The cold-start penalty dominates overall throughput, but after sufficient history accumulation the speculative memory becomes increasingly effective.

Benchmark 3: PLD vs Two-Model Speculative Decoding

This benchmark compares PLD-style token reuse against conventional two-model speculative decoding on the same synthetic dataset. The target model was mlx-community/llama-3.1-8b-instruct-4bit; the draft model was mlx-community/llama-3.2-3b-instruct-4bit.

config	tok/s	acc%	speedup
baseline	25.66	—	1.00x
ngram-simple nd=2 n=3	43.08	48.6%	1.68x
ngram-simple nd=4 n=3	42.13	58.9%	1.64x
ngram-simple nd=6 n=3	42.07	63.4%	1.64x
ngram-simple nd=8 n=3	37.42	66.0%	1.46x
ngram-mod nd=2 n=16	34.37	45.8%	1.34x
ngram-mod nd=4 n=16	40.49	54.9%	1.58x
ngram-mod nd=6 n=16	41.00	58.9%	1.60x
ngram-mod nd=8 n=16	40.35	61.1%	1.57x
draft-model nd=2	31.13	65.3%	1.21x
draft-model nd=4	29.59	78.3%	1.15x
draft-model nd=6	28.70	83.8%	1.12x
draft-model nd=8	25.82	86.8%	1.01x

The result is notable because the two-model draft setup achieved much higher acceptance rates, reaching 86.8% at nd=8, but lower throughput. The additional cost of running the 3B draft model outweighed the benefit of higher acceptance on this workload. In contrast, ngram-simple and ngram-mod generated drafts without a second neural forward pass, so lower acceptance still translated into higher end-to-end throughput.

For this edit-heavy synthetic workload, prompt lookup decoding outperformed classic two-model speculative decoding despite accepting fewer draft tokens.

Benchmark 4: Cross-Model Warm-Start

The final benchmark evaluated whether speculative memory learned by one model could improve another model’s cold-start behavior.

Two models from the Llama-3 tokenizer family were used:

mlx-community/Llama-3.2-3B-Instruct-4bit
mlx-community/Llama-3.1-8B-Instruct-4bit

The models differ substantially in parameter count but share the same tokenizer vocabulary and token-ID assignments, which is required because the associative memory operates entirely in token-ID space.

The shared table was first populated using the 3B model and then reused by the 8B model.

Results:

config	tok/s	acc%	speedup
8B baseline	23.33	—	1.00x
8B cold table	40.33	60.5%	1.73x
8B warm table	43.58	66.5%	1.87x

Warm-starting improved:

throughput by 8.1%
acceptance by 6 percentage points

The most significant effect occurred during the first turn:

turn	cold	warm
T1	3% acceptance	27% acceptance

This demonstrates that speculative memory learned from one model can improve the cold-start behavior of another model, provided the tokenizer mappings are identical.

The effect diminishes rapidly after the first turn. By T2 and beyond, the conversation’s own history dominates speculative reuse and cross-model transfer contributes comparatively little.

Interactive Dashboard with a full sweep is here and the synthetic dataset used for these sweep runs can be found here.

Interactive Dashboard & synthetic dataset

Memory Occupancy

One unexpected result was ngram-mod hash-table occupancy.

Even after multiple long conversations across two models, occupancy remained below:

0.0003

or approximately 0.03% utilization.

The default 4M-entry ngram-mod table therefore appears dramatically oversized for these workloads. The next step is to rerun the same benchmark sweep with a smaller table size and identify the smallest configuration that preserves throughput and acceptance. That result should determine the recommended default for average interactive conversations.

Conclusion

Several operational patterns emerged consistently across the experiments.

First, ngram-simple is the stronger strategy for short iterative editing workloads. Small exact lookups are highly effective for localized code modifications and avoid the cold-start penalty inherent to rolling-hash memory.
Second, ngram-mod scales with accumulated history. Once sufficient repeated trajectories exist, the associative memory begins outperforming exact lookup on a per-turn basis.
Third, PLD-style token reuse can outperform classic two-model speculative decoding on edit-heavy workloads. In the synthetic multi-turn benchmark, the draft-model setup achieved higher acceptance rates, but the extra cost of running a 3B draft model reduced end-to-end throughput.
Fourth, adaptive gating is essential for production deployment. The worst-case behavior of ngram-mod occurs during low-repetition cold-start prompts, where speculative overhead can exceed the savings from accepted tokens. Gating removes most of these regressions automatically.
Finally, speculative memory transfer across models is possible when tokenizers match exactly. Because the memory operates entirely in token-ID space, tokenizer compatibility — not architecture compatibility — determines whether cross-model warm-starting is meaningful.

The resulting system behaves less like traditional prompt lookup and more like a lightweight runtime-learned speculative memory layered on top of autoregressive decoding. The broader takeaway is that not all speculative speedups require another neural model: for workloads with repeated token trajectories, simple token-space memory can be cheaper and faster than a draft model.