Signals Between the Tokens

In a companion post¹, I argued that RL for reasoning should target the inclination to do reasoning (a meta-cognitive disposition) rather than domain-specific reasoning performance. The core claim was that it’s not the reasoning tokens themselves that are valuable, but the ‘understanding’ that process of reasoning creates useful context for itself. RLVR rewards the final answer, but what we actually want is for models to develop the process of building good context.

That post also noted that RL has already abandoned the speed-up inherent in fully parallel training. The RLVR process involves multiple serial token-wise rollouts, which means we could, in principle, incorporate richer signals that aren’t available during standard transformer training.

This post explores what those richer signals could be.

Better Reward Signals

The Power of Consensus

Multiple RL rollouts from a good base LLM tend to agree about the true answers². Over many rollouts, bad reasoning leads to many different answers; good reasoning converges to a single one.

In the previous post, I discussed consensus in the context of epistemic humility: a model with good self-calibration would internalise the information that consensus-across-rollouts provides externally. Here the focus is different. Consensus can also be used directly as a training signal.

One self-improvement direction is to do many rollouts and use the consensus answer as a reasoning target, letting the model lever itself towards better reasoning overall. The risk is model collapse: if wrong answers start to dominate the consensus, the model can spiral downward. But when it works, this gives a richer signal than binary right/wrong from RLVR.

Multiple Independent Rollouts

Multiple independent rollouts have the computational benefit of being trivially parallelisable. Two practical issues are worth noting, though.

First, rollouts can have very different lengths, which means wasting compute waiting on the longest one. Second, if rollouts take different amounts of time, some may be generated under stale model weights, creating async training issues that need to be managed.

These are engineering problems rather than conceptual ones, but they matter for making any consensus- or rollout-based technique practical at scale.

Distillation (On-Policy Preferred)

RLVR provides one bit of feedback (correct or incorrect) for an entire reasoning chain. Distillation can provide token-wise signals, which is a dramatically higher information density.

The setup: a Student rolls out draft reasoning on-policy. A Teacher is then applied to the Student’s reasoning tokens, producing “wiser” logits. The Student learns from the gap between its own predictions and the Teacher’s.

What makes this interesting is the range of possible Teachers:

A larger model. Evaluation is fast since the Teacher processes the Student’s tokens in parallel, not autoregressively.
A model with privileged information, such as ground-truth answers, or even an example ground-truth reasoning path.
A model that sees successful rollouts: partial reasoning steps from other rollouts that reached the correct answer, used to guide “what should come next.”
A non-causal model: the Teacher views two copies of the Student’s response, allowing backward attention (more on this below).

Irrespective of the choice above, the overarching benefit is that instead of one reward signal per rollout, the Student gets dense supervision at every token. It doesn’t just learn that it got the answer wrong. It learns where its reasoning diverged from better reasoning.

Architectural Ideas: Going Beyond Token-Level Learning

Why We Can Afford Richer Architectures Now

Pre-training and SFT use fully parallel, token-level training. This maps well onto GPU hardware and is fast. RL for reasoning, on the other hand, is already serial: while backpropagation of signals can be parallel, the RLVR process itself involves multiple serial token-wise rollouts.

Since we’re already paying the serial cost, we can incorporate signals that regular transformer training can’t use. Two directions seem worth exploring.

Thinking “Between the Tokens”

Standard autoregressive generation conditions each token on the previous token’s output. What if we conditioned each step on information derived from the entire model state at the previous step - not just the output token, but representations across all layers?

Concretely: create a new layer (or something akin to a LoRA scheme) to calculate an inter-token representation, conditioned on representations at all (or selected) layers from the previous step. This new layer would be initialised so that its initial contribution would be zero, so the base model’s behaviour is preserved at the start of training. The rest of the network could be frozen.

The inter-token layer would be trained to output an overlay for the next token, or for the next reasoning span until some end-of-chunk indicator (as simple as \n or \n\n). The overlay could be the pure output of an MLP. Or, more interestingly, it could be vector-quantised into a dictionary of “atomic thinking strategies.”

This connects to the decomposition ideas from the previous post. If reasoning involves switching between a finite set of cognitive moves, the inter-token layer could learn to select the appropriate one. Stop-gradients would keep training efficient.

Thinking “Layer-Backwards”

A different direction: let lower layers in the next token observe higher-order information gathered from previous steps.

This has a family resemblance to Universal Transformers ³ and Looped Transformers⁴, but the intention is different. The “higher-order information” here means the model’s own internal signals and statistics, not a recurrence of the full computation. For example:

Logit entropy: how confident was the model at the previous step?⁵
Layer-wise dynamics: the size and correlation of updates to the representation as it passes through the transformer’s layers. This could serve as a reasoning-quality measure, using internal layer dynamics as the signal⁶.

The idea is to give the model access to a cheap self-assessment of its own processing quality, so that lower layers can adjust their behaviour based on how well or poorly the previous step went. This is, in effect, an architectural implementation of the “self-monitoring” meta-tool from the previous post.

Allowing for Non-Causal Reasoning

Causal (left-to-right) attention is a fundamental constraint of autoregressive generation. But reasoning isn’t always left-to-right. Sometimes a later insight recontextualises an earlier step.

One mechanism for getting around this: allow a Teacher to view two concatenated copies of the Student’s response. This gives the Teacher non-causal “backward attention,” where later tokens can inform the understanding of earlier ones. The Teacher’s improved logits then serve as a training signal for the Student.

This isn’t purely speculative: Google’s “Prompt Repetition” work⁷ showed that simply repeating the prompt gives models a form of backward attention that improves performance. A Teacher that is allowed to have explicit forward attention (even though performed by looking backward) could provide a much stronger version of this signal.

Connecting Mechanisms to Meta-Tools

The mechanisms described above aren’t arbitrary architectural proposals. They each map onto one of the mental meta-tools from the previous post:

Decomposition: vector-quantised inter-token overlays as “atomic thinking strategies”
Strategy switching: the inter-token layer as meta-controller, selecting thinking modes based on layer-wise dynamics
Self-monitoring: layer-backwards signals providing cheap self-assessment of reasoning quality
Self-prompting: distillation with teacher-student collapse (DINO-style⁸), teaching the model to lay track in its own context
Epistemic calibration: logit entropy and consensus signals as uncertainty awareness; risk-managed reward structures

The common thread is that all of these operate above the token level. They’re about giving the model access to information about its own reasoning process, and the ability to steer that process, rather than just predicting the next token well.

What to Try First

As for the question fo how to sequence some experiments, perhaps a reasonable path would be from the most straightforward to the most novel:

Consensus-based self-improvement is probably the lowest-effort starting point. Take a strong base model, generate many rollouts, and use consensus answers as targets. Naturally, confirm first that reasoning quality improves on the validation set, then, critically, check whether that improvement transfers to unseen domains. That transfer test is a direct check on the meta-learning hypothesis.

Dense distillation with a privileged Teacher is the next step. Train a Student with a Teacher that has access to ground-truth, and compare learning speed and generalisation against RLVR alone. The question here is whether higher-density reward signals change what the model learns, not just how fast it learns.

Inter-token overlays are the most architecturally novel proposal. Start simple: an MLP overlay initialised at zero, trained on reasoning tasks with the base model frozen. See if the overlay learns interpretable “thinking modes.” If vector-quantised, check whether the discrete codes correspond to recognisable reasoning strategies.

A non-causal Teacher would test a different assumption entirely. Implement the two-copy Teacher and measure whether it produces meaningfully different training signals from a standard causal Teacher. If backward attention helps the Teacher produce better targets, that’s evidence that left-to-right token generation is genuinely limiting reasoning quality.

Each experiment is self-contained but builds toward the broader goal from the previous post: giving models the ability to direct their own reasoning, rather than just generating plausible reasoning tokens.

This is the second of two posts. The first, “Learning to Think, Not Learning to Solve”, provides the conceptual argument for why these mechanisms matter.

Footnotes

“Learning to Think, Not Learning to Solve” ↩
Denny Zhou presentation : “Teach Language Models to Reason” ↩
Dehghani et al, 2018 - “Universal Transformers” ↩
Saunshi et al, 2025 - “Reasoning with Latent Thoughts: On the Power of Looped Transformers” ↩
See the Entropix project for entropy-based sampling approaches. ↩
Zhu et al, 2025 : “Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens” ↩
Leviathan et al, 2025 : “Prompt Repetition Improves Non-Reasoning LLMs” ↩
Caron et al, 2021 : “Emerging Properties in Self-Supervised Vision Transformers” ↩

Better Reward Signals#

The Power of Consensus#

Multiple Independent Rollouts#

Distillation (On-Policy Preferred)#

Architectural Ideas: Going Beyond Token-Level Learning#

Why We Can Afford Richer Architectures Now#

Thinking “Between the Tokens”#

Thinking “Layer-Backwards”#

Allowing for Non-Causal Reasoning#

Connecting Mechanisms to Meta-Tools#

What to Try First#

Footnotes#