Home/Blog/Why Does Chain-of-Thought Improve Model Inference Ability?
Back to Blog
Mar 6, 2026Foundations

Why Does Chain-of-Thought Improve Model Inference Ability?

A formal analysis of how chain-of-thought prompting expands effective computation depth in transformers, with information-theoretic bounds and empirical evidence from reasoning benchmarks.

Share on X

Abstract

Chain-of-thought (CoT) prompting has emerged as a technique that improves the reasoning capabilities of large language models on complex tasks. This post provides a formal analysis of why CoT works from computational and information-theoretic perspectives. We formalize the distinction between direct prediction P(yx)P(y|x) and chain-of-thought prediction P(yx,z1,,zn)P(y|x, z_1, \ldots, z_n), where ziz_i represent intermediate reasoning tokens. We present the expressiveness argument showing how CoT effectively increases the serial computation depth available to transformer architectures, derive the role of autoregressive generation in enabling intermediate variable storage, and analyze the conditions under which CoT succeeds or fails. Empirical evidence from arithmetic, symbolic reasoning, and mathematical word problem benchmarks supports the theoretical framework. The analysis suggests that CoT's effectiveness stems from its function as external working memory that circumvents architectural limitations on serial computation depth.

The Inference Problem

Direct prediction vs Chain-of-Thought computation depth
Direct Prediction vs Chain-of-Thought

Consider a language model tasked with answering a question. Given input xx, the model must produce output yy. In the standard paradigm, this requires computing: \[ y^* = \arg\max_y P(y | x) \] For a transformer with TT layers, each forward pass performs TT sequential computational steps. The hidden state at layer \ell can be written as: \[ h^{(\ell)} = f^{(\ell)}(h^{(\ell-1)}, x) \] where f()f^{(\ell)} represents the computation at layer \ell (self-attention followed by a feed-forward network). The final prediction depends on h(T)h^{(T)}, which has undergone exactly TT serial transformations of the input. This creates a fundamental constraint: any computation the model performs must be expressible within TT sequential steps. While each layer has substantial width (the hidden dimension dd) allowing parallel computation, the depth is fixed at TT. Many reasoning tasks, however, require computations with depth exceeding TT. Consider multiplying two nn-digit numbers: the standard algorithm requires O(n2)O(n^2) sequential operations in the worst case, or O(nlogn)O(n \log n) with advanced methods. For sufficiently large nn, this exceeds what a fixed-depth transformer can compute in a single forward pass. The fundamental question becomes: how can a fixed-depth architecture solve problems requiring arbitrary computational depth? Chain-of-thought provides one answer.

Direct Prediction vs. Chain-of-Thought

Direct Prediction

In direct prediction, the model maps input directly to output: \[ P(y | x) = \prod_{i=1}^{|y|} P(y_i | x, y_1, \ldots, y_{i-1}) \] Each token of the output is generated conditioned on the input and previously generated output tokens. The key observation is that generating each yiy_i involves exactly one forward pass through the TT-layer network. For problems where the answer length y|y| is short (e.g., a single number), the model has minimal opportunity to perform iterative computation. The entire reasoning must occur within TT layers operating in parallel across positions.

Chain-of-Thought Prediction

Chain-of-thought introduces intermediate reasoning tokens z1,,znz_1, \ldots, z_n before the final answer: \[ P(y | x) = \sum_{z_1, \ldots, z_n} P(y | x, z_1, \ldots, z_n) \cdot P(z_1, \ldots, z_n | x) \] In practice, we don't marginalize over all possible chains—we sample or greedily decode a single chain. The factorization becomes: \[ P(y, z_1, \ldots, z_n | x) = P(y | x, z_1, \ldots, z_n) \cdot \prod_{i=1}^{n} P(z_i | x, z_1, \ldots, z_{i-1}) \] Each intermediate token ziz_i is generated with a full forward pass, and subsequent tokens can attend to it. This means the computation of yy now has access to the results of n+1n + 1 forward passes rather than just one.

The Conditioning Effect

A subtle but important point: the chain-of-thought tokens don't just provide more computation—they also condition the final prediction. Consider the conditional probability: \[ P(y | x, z_1, \ldots, z_n) \] If the reasoning tokens ziz_i contain the correct intermediate results, the conditional distribution P(yx,z1,,zn)P(y | x, z_1, \ldots, z_n) becomes sharply peaked around the correct answer. The model no longer needs to perform complex reasoning—it merely needs to read off the conclusion from its scratchpad. This is analogous to the difference between:

  1. Solving 23×4723 \times 47 from scratch
  2. Computing 161+920161 + 920 given that these are the partial products

The second task is computationally simpler, conditioning on intermediate work.

Worked Example: Arithmetic

Consider computing 23×4723 \times 47. Direct prediction requires the model to output "1081" after a single forward pass (per token) that can only attend to "23 × 47". Chain-of-thought might produce:

  • z1z_1: "First, 23 × 7 = 161"
  • z2z_2: "Then, 23 × 40 = 920"
  • z3z_3: "Finally, 161 + 920 = 1081"
  • yy: "1081"

The probability decomposes as: \[ P(\text{"1081"} | x) = P(\text{"1081"} | x, z_1, z_2, z_3) \cdot P(z_3 | x, z_1, z_2) \cdot P(z_2 | x, z_1) \cdot P(z_1 | x) \] Each factor involves a forward pass where the model can attend to all preceding reasoning steps. The intermediate results (161, 920) are explicitly stored in the context and accessible for subsequent computation.

A Second Example: Multi-Step Word Problems

Consider: "Alice has 3 apples. She buys 5 more, then gives half to Bob. How many does Alice have?" Direct prediction: The model must compute (3+5)/2=4(3 + 5) / 2 = 4 in a single forward pass. Chain-of-thought:

  • z1z_1: "Alice starts with 3 apples"
  • z2z_2: "After buying 5 more: 3 + 5 = 8 apples"
  • z3z_3: "After giving half to Bob: 8 / 2 = 4 apples"
  • yy: "4"

The decomposition isolates each arithmetic operation to a separate forward pass. When generating z3z_3, the model can attend to "8 apples" from z2z_2, reducing the problem to a single division.

The Expressiveness Argument

Effective computation depth scaling with reasoning steps
How CoT Scales Effective Computation Depth

Computational Complexity of Transformers

A transformer with TT layers, hidden dimension dd, and context length LL can be viewed as a computational model. Following the analysis of Merrill and Sabharwal (2023), a single forward pass computes a function in the complexity class TC0\mathsf{TC}^0—constant-depth threshold circuits with polynomial size. More precisely, for a fixed architecture, the transformer computes a function f:ΣΣf: \Sigma^* \to \Sigma^* where the circuit depth is O(T)O(T) and width is O(Ld)O(L \cdot d). The class TC0\mathsf{TC}^0 is powerful—it includes addition, multiplication of fixed-precision numbers, and sorting—but it does not include all polynomial-time computations. Crucially, TC0\mathsf{TC}^0 does not contain problems requiring a super-constant number of sequential steps that cannot be parallelized. This includes:

  • Evaluating Boolean formulas of unbounded depth
  • Computing iterated functions fn(x)=f(f(f(x)))f^n(x) = f(f(\ldots f(x) \ldots))
  • Simulating arbitrary circuits with depth ω(1)\omega(1)

Depth vs. Width Tradeoffs

Circuit complexity theory establishes fundamental tradeoffs between depth and width. A circuit with depth DD and polynomial width can simulate computations requiring DD sequential steps. However, increasing width alone cannot always compensate for insufficient depth. The intuition is that depth represents sequential computation time, while width represents parallel resources. Some computations are inherently sequential—they require results from step ii before step i+1i+1 can proceed. For example, iterated composition of functions—computing f(f(f(x)))f(f(\cdots f(x) \cdots)) with kk applications—requires depth Ω(k)\Omega(k) regardless of width, assuming each application of ff requires at least one computational step. No amount of parallelism can accelerate strictly sequential dependencies.

CoT as Depth Extension

Here's the key insight: autoregressive generation with KK intermediate tokens effectively transforms a depth-TT computation into a depth-KTKT computation. Let gT:XYg_T: \mathcal{X} \to \mathcal{Y} denote the function computed by a TT-layer transformer in one forward pass. When generating KK tokens autoregressively, the model computes: \[ z_1 = g_T(x), \quad z_2 = g_T(x, z_1), \quad \ldots, \quad z_K = g_T(x, z_1, \ldots, z_{K-1}) \] The final output is a function of KK sequential applications of gTg_T, where each application can use the results of all previous applications. This yields effective computational depth of KTK \cdot T. Theorem (informal): A transformer with TT layers generating KK tokens autoregressively can simulate any computation requiring O(KT)O(KT) sequential steps, given sufficient hidden dimension and appropriate training. This explains why CoT helps with tasks requiring deep sequential reasoning: it provides the necessary computational depth that a single forward pass cannot.

Formal Statement

Let AUT(T,K)\mathsf{AUT}(T, K) denote the class of functions computable by a TT-layer autoregressive transformer generating KK tokens. We have: \[ \mathsf{TC}^0 \subsetneq \mathsf{AUT}(T, K) \text{ for sufficiently large } K \] In fact, AUT(T,K)\mathsf{AUT}(T, K) with unbounded KK is Turing complete—it can simulate any computation given enough generation steps. This follows from the ability to simulate a Turing machine's tape using the context window and performing one computational step per token generated. Feng et al. (2023) formalize this precisely: for problems solvable by a Turing machine in tt steps with tape size ss, a transformer with CoT can solve them using O(t)O(t) reasoning tokens and context length O(s)O(s).

The Role of Attention in Depth Extension

Attention mechanisms are essential for depth extension to work. When generating token zk+1z_{k+1}, the model attends to all of (x,z1,,zk)(x, z_1, \ldots, z_k). This allows:

  1. Information retrieval: Earlier computation results stored in ziz_i can be retrieved
  2. State tracking: The model can track multiple variables across generation steps
  3. Conditional branching: Different attention patterns can implement control flow

Without attention (e.g., in a pure recurrent model with bounded state), the model would be limited by its hidden state size. The context window essentially provides unbounded external memory accessible via attention.

Information-Theoretic View

Effective Compute and Serial Bottlenecks

From an information-theoretic perspective, consider the mutual information between input and output. For a direct prediction: \[ I(X; Y) \leq \min(H(X), H(Y), C_{\text{forward}}) \] where CforwardC_{\text{forward}} represents the information-processing capacity of a single forward pass—bounded by the depth-width product of the network. For chain-of-thought: \[ I(X; Y) \leq I(X; Z_1, \ldots, Z_n, Y) \] The intermediate tokens can carry information that exceeds the capacity of a single forward pass, effectively increasing the channel capacity between input and output.

Data Processing Inequality and Its Circumvention

The data processing inequality states that for a Markov chain XZYX \to Z \to Y: \[ I(X; Y) \leq I(X; Z) \] Processing cannot create information. So how does CoT help? The key is that CoT doesn't violate the data processing inequality—it changes the computational structure. Without CoT, information flows as Xh(T)YX \to h^{(T)} \to Y, where h(T)h^{(T)} is the final hidden state. With CoT, information flows as: \[ X \to h_1^{(T)} \to z_1 \to h_2^{(T)} \to z_2 \to \cdots \to h_{n}^{(T)} \to z_n \to h_{n+1}^{(T)} \to Y \] Each hi(T)h_i^{(T)} has access to all previous zjz_j, creating a richer information pathway. The bottleneck is no longer a single forward pass but the accumulated computation across n+1n+1 passes.

The Scratchpad Hypothesis

Consider the human analogy: when solving a complex problem, we write intermediate results on paper rather than holding everything in working memory. The written notes serve as external storage that offloads cognitive burden. For transformers, the context window functions analogously. Each generated token becomes part of the model's "working memory" accessible via attention in subsequent forward passes. The chain-of-thought tokens form an external scratchpad. Formally, let MM denote the model's internal state capacity (bounded by hidden dimension dd) and SS denote the scratchpad size (number of reasoning tokens times vocabulary embedding dimension). The effective memory available for computation is: \[ M_{\text{effective}} = M + S \] For tasks requiring intermediate storage exceeding MM, the scratchpad is necessary for correct computation. Consider a task requiring tracking of kk variables, each with bb bits of precision. The total state requirement is kbkb bits. If the model's hidden state can only reliably encode M<kbM < kb bits, direct computation fails. But with a scratchpad, each variable can be explicitly written to context, enabling arbitrarily large kk.

Hidden State Propagation

When generating token zi+1z_{i+1}, the model has access to all previous tokens (x,z1,,zi)(x, z_1, \ldots, z_i) through attention. This means information computed during the generation of ziz_i can influence zi+1z_{i+1} in two ways:

  1. Explicit: The token ziz_i itself is attended to
  2. Implicit: The hidden states at each layer can propagate information through attention patterns

The explicit pathway is more robust—the token ziz_i is discretized and stored in a fixed vocabulary representation. The implicit pathway through attention provides continuous information flow but is more difficult to control and interpret. This dual pathway explains why well-structured chain-of-thought (with clearly stated intermediate results) outperforms vague or disorganized reasoning: explicit storage is more reliable and less prone to degradation across many steps.

When CoT Fails

Chain-of-thought is not universally beneficial. Understanding failure modes clarifies the mechanism.

Pure Recall Tasks

For factual recall—"What is the capital of France?"—CoT provides no benefit and may harm performance. The answer exists in model parameters and requires no sequential computation. Mathematically, if y=f(x)y = f(x) where ff is directly stored in model weights and computable in one forward pass, then: \[ P(y | x) \geq P(y | x, z_1, \ldots, z_n) \cdot P(z_1, \ldots, z_n | x) \] The chain-of-thought introduces additional points of failure (incorrect reasoning tokens) without providing computational benefit.

Highly Parallel Problems

Tasks with inherently parallel structure don't benefit from serial decomposition. Consider checking whether each element of a list satisfies some property independently. This requires O(1)O(1) depth with O(n)O(n) width—exactly what a transformer provides in one pass. CoT would serialize inherently parallel checks:

  • Check element 1: satisfies
  • Check element 2: satisfies
  • ...

This increases computation time without improving accuracy, since the parallel checks don't benefit from intermediate results. The key diagnostic: if a problem decomposes into independent subproblems with no information flow between them, CoT adds latency without benefit.

Insufficient Model Capability

CoT requires the model to:

  1. Identify the appropriate reasoning steps
  2. Execute each step correctly
  3. Integrate results across steps

If the model cannot perform (1)—decomposing the problem—no amount of computation helps. This explains why CoT shows larger benefits for larger models: they have better problem decomposition abilities. Formally, let π(x)\pi(x) denote the "reasoning plan" for input xx. The model must learn P(πx)P(\pi | x). If this distribution is poorly learned (small models, out-of-distribution problems), CoT fails at the planning stage before computation even begins.

Reasoning Token Errors Propagate

A critical failure mode: errors in intermediate tokens propagate to the final answer. If ziz_i is incorrect: \[ P(y_{\text{correct}} | x, z_1, \ldots, z_{i-1}, z_i^{\text{wrong}}, \ldots) < P(y_{\text{correct}} | x, z_1, \ldots, z_{i-1}, z_i^{\text{correct}}, \ldots) \] For chains of length nn where each step has accuracy pp, the overall accuracy degrades as approximately pnp^n in the worst case (when errors are independent and fatal). This explains why self-consistency (sampling multiple chains and voting) helps: it provides robustness against intermediate errors by effectively ensembling over reasoning paths.

Length Generalization Failures

CoT can fail when problems require longer chains than seen during training. The model may have learned to reason for kk steps but fails at k+mk+m steps. This is a form of distributional shift in the "reasoning length" dimension.

Empirical Evidence

GSM8K accuracy comparison: Direct vs CoT by model size
GSM8K: Direct vs Chain-of-Thought by Model Size

Arithmetic and Symbolic Reasoning

Nye et al. (2021) demonstrated that "scratchpad" prompting dramatically improves arithmetic performance. On 8-digit addition:

  • Direct: ~0% accuracy
  • With scratchpad: >95% accuracy

The scratchpad allows the model to perform carry operations sequentially, storing intermediate sums in the context. Without the scratchpad, the model must somehow compute all carries in parallel within a single forward pass—a task exceeding its architectural capacity. Similar patterns emerge in symbolic reasoning tasks:

  • Parity of a bit string: CoT allows iterative XOR
  • Last letter concatenation: CoT enables sequential character extraction
  • Coin flip tracking: CoT maintains explicit state

GSM8K

The GSM8K benchmark (Cobbe et al., 2021) contains grade-school math word problems requiring multi-step reasoning. Results show clear CoT benefits: | Method | Accuracy | |--------|----------| | Direct answer (175B) | 18% | | Chain-of-thought (175B) | 57% | | CoT + self-consistency | 74% | | CoT + verifier (trained) | 78% | The problems require multi-step arithmetic wrapped in natural language—exactly the setting where CoT provides computational depth benefits. The jump from 18% to 57% cannot be explained by prompt engineering alone; it reflects a fundamental computational capability gained through intermediate generation.

BIG-Bench Hard

Suzgun et al. (2022) evaluated CoT on challenging BIG-Bench tasks:

  • Average improvement with CoT: +13.5 percentage points
  • Largest gains on: multi-step arithmetic, logical deduction, tracking shuffled objects
  • Smallest gains on: sarcasm detection, movie recommendations (no sequential computation needed)

The pattern matches theoretical predictions: CoT helps tasks requiring serial computation depth. The correlation between "number of reasoning steps required" and "CoT improvement" is strong across the benchmark.

Scaling Analysis

Wei et al. (2022) showed that CoT benefits emerge with scale:

  • Models <10B parameters: minimal CoT benefit
  • Models 10-100B parameters: moderate CoT benefit
  • Models >100B parameters: substantial CoT benefit

This supports the interpretation that CoT requires baseline capabilities in problem decomposition that emerge at scale. Small models cannot identify the correct reasoning steps; thus, additional computation from CoT cannot help. The emergence pattern suggests a capability threshold: below some model size, CoT is useless; above it, CoT provides increasing returns.

Ablation Studies

Several ablation studies isolate the mechanisms: Token content matters: Replacing reasoning tokens with "...." (same length, no content) eliminates benefits. This confirms that explicit information storage, not just additional forward passes, drives improvement. Structure matters: Unstructured reasoning ("hmm, let me think...") underperforms structured step-by-step decomposition. This supports the explicit storage hypothesis. Relevance matters: Irrelevant reasoning tokens can hurt performance, consistent with the error propagation analysis.

Practical Implications

When to Use CoT

Based on the analysis, CoT is most beneficial when:

  1. The task requires multi-step reasoning with dependencies between steps
  2. Intermediate results need to be stored and referenced
  3. The model has sufficient capability to decompose the problem
  4. The reasoning chain is relatively short (limiting error propagation)
  5. The task is not purely parallel or purely recall-based

Prompt Engineering

Effective CoT prompts should:

  • Explicitly state intermediate results (leverage the explicit storage pathway)
  • Use consistent notation for values being tracked
  • Decompose into steps that are individually tractable for the model
  • Demonstrate the target reasoning structure via few-shot examples

The format of reasoning matters. "Let x = 5. Then x + 3 = 8." is more reliable than "thinking about numbers... probably 8."

Architectural Implications

The analysis suggests architectural research directions:

  • Adaptive computation: Allow models to allocate more compute to harder problems dynamically
  • External memory: Provide explicit scratchpad mechanisms beyond context windows
  • Error correction: Build in mechanisms to detect and recover from reasoning errors (verification heads, backtracking)
  • Latent reasoning: Enable "thinking" in continuous space rather than discrete tokens to reduce error propagation

Computational Cost

CoT increases inference cost linearly with chain length. For a chain of KK tokens, cost scales as O(KLT)O(K \cdot L \cdot T) where LL is context length and TT is model depth. This is acceptable when improved accuracy justifies additional compute, but not for latency-sensitive applications requiring simple computations. A practical heuristic: estimate the "reasoning depth" required and use CoT only when it exceeds the model's layer count divided by some constant factor.

Conclusion

Chain-of-thought prompting improves model inference by addressing a fundamental architectural limitation: fixed computational depth in transformer forward passes. By generating intermediate reasoning tokens, CoT effectively extends the serial computation available from TT layers to KTKT effective steps, where KK is the number of reasoning tokens. The mechanism operates through two pathways: explicit storage of intermediate results in generated tokens, and implicit information flow through attention patterns. The explicit pathway is more robust and benefits from well-structured reasoning chains. CoT is not universally beneficial—it fails for pure recall tasks, highly parallel problems, and when models lack decomposition capabilities. Error propagation through reasoning chains remains a practical limitation, motivating techniques like self-consistency and learned verifiers. The theoretical framework—viewing CoT as computation depth extension and external working memory—explains empirical patterns across arithmetic, symbolic reasoning, and mathematical word problem benchmarks. It also suggests that future architectures with adaptive computation or explicit memory mechanisms may reduce reliance on lengthy reasoning chains while preserving their benefits. Understanding why CoT works enables more principled application: use it for tasks requiring serial depth, structure chains for explicit intermediate storage, and consider self-consistency for robustness against reasoning errors.

References

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). Training verifiers to solve math word problems. *arXiv preprint arXiv:2110.14168*. Feng, G., Zhang, B., Gu, Y., Ye, H., He, D., & Wang, L. (2023). Towards revealing the mystery behind chain of thought: A theoretical perspective. *Advances in Neural Information Processing Systems*, 36. Merrill, W., & Sabharwal, A. (2023). The parallelism tradeoff: Limitations of log-precision transformers. *Transactions of the Association for Computational Linguistics*, 11, 531-545. Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Biber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., & Odena, A. (2021). Show your work: Scratchpads for intermediate computation with language models. *arXiv preprint arXiv:2112.00114*. Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., & Wei, J. (2022). Challenging BIG-bench tasks and whether chain-of-thought can solve them. *arXiv preprint arXiv:2210.09261*. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models. *International Conference on Learning Representations*. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. *Advances in Neural Information Processing Systems*, 35, 24824-24837.