Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought

Overview diagram comparing existing multimodal chain-of-thought prompting with Rationale-Enhanced Decoding.

Existing multimodal CoT often lets the final answer rely heavily on the image while underusing the generated rationale. RED decouples image-conditioned and rationale-conditioned predictions, then combines them at the logit level.

Presentation Video

Abstract

Large vision-language models (LVLMs) are commonly prompted to generate intermediate rationales before answering visual questions. However, this paper finds that standard multimodal chain-of-thought (CoT) can fail to ground the final answer in those rationales: replacing a rationale with an irrelevant one can leave performance almost unchanged.

Rationale-Enhanced Decoding (RED) addresses this failure at inference time, without additional training or architectural changes. RED composes two next-token distributions: one conditioned on the image and query, and one conditioned on the rationale and query. The resulting decoder encourages tokens that are supported by both visual evidence and the generated rationale.

Key idea

Instead of decoding with a single distribution \(p(y_i \mid y_{\lt i}, x, r, q)\), RED uses a product-of-experts form:

\[\hat{p}_\theta(y_i) \propto p_\theta(y_i \mid y_{\lt i}, x, q) \; p_\theta(y_i \mid y_{\lt i}, r, q)^\lambda\]

This is derived as the optimal solution to a KL-constrained reward maximization objective where rationale-conditional likelihood acts as the reward.

Why do we need RED?

1. LVLMs do not have much attention to rationales in the standard CoT

Attention contribution analysis shows that when image and rationale tokens are both present, LVLMs can focus mostly on the image and reduce the influence of rationale tokens.

2. LVLMs are not faithful to rationales

Swapping in rationales from other examples often preserves standard CoT performance, suggesting that the final answer is not semantically grounded in the rationale.

3. Existing methods require additional training or architectural changes

They can not be applied to off-the-shelf LVLMs and arbitrary rationale formats, including free-form descriptions and scene graphs.

Method

RED first generates a rationale using a standard multimodal CoT prompt. During answer generation, it runs two conditional predictions: an image-conditional branch \(p_\theta(y_i \mid y_{\lt i}, x, q)\) and a rationale-conditional branch \(p_\theta(y_i \mid y_{\lt i}, r, q)\).

The log-softmax logits from the two branches are added with rationale weight \(\lambda\):

\[\hat{\ell}_\theta(y_i) = \log \mathrm{softmax}(\ell_\theta(y_i \mid y_{\lt i}, x, q)) + \lambda \log \mathrm{softmax}(\ell_\theta(y_i \mid y_{\lt i}, r, q))\]

The next token is sampled or greedily selected from \(\mathrm{softmax}(\hat{\ell}_\theta)\). This simple change makes the output prefer tokens jointly supported by image evidence and rationale evidence.

Algorithm sketch

r = generate(model, image, query)
y = []
while not finished:
    logits_img = model(image, query, y)
    logits_rat = model(rationale=r, query=query, y=y)
    red_logits = log_softmax(logits_img) \
               + lambda * log_softmax(logits_rat)
    y.append(decode(red_logits))

Results

Across general visual reasoning, text-rich VQA, mathematical reasoning, hallucination benchmarks, and larger LVLMs, RED consistently improves over standard CoT and competitive plug-and-play decoding baselines.

+275.35 MME cognition delta for Gemma-3-4B with CoT + RED

+20.24 SEED-I delta for Qwen2.5-VL-7B with CCoT + RED

61.6% MMMU accuracy with Qwen2.5-VL-7B and CoT + RED

Plot showing the effect of swapping rationales on GQA accuracy for standard CoT and RED. — Intervention analysis: RED benefits from high-quality GPT-4 rationales and degrades with random rationales, indicating stronger rationale grounding.

Plot of GQA accuracy over model sizes for baseline, CCoT, and CCoT plus RED. — Scaling analysis: RED unlocks stronger performance as LVLM size increases, while standard CCoT does not consistently scale.

Qualitative examples

When the rationale contains the decisive evidence, standard CoT can still answer incorrectly. RED uses the rationale content during decoding, producing answers aligned with the generated reasoning.

BibTeX

@inproceedings{Yamaguchi_CVPR26_RED,
  title     = {Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought},
  author    = {Yamaguchi, Shin'ya and Nishida, Kosuke and Chijiwa, Daiki},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2026}
}