Chain-of-Thought Reasoning: Why CoT Improves Inference

Abstract

Chain-of-thought (CoT) prompting improves the reasoning capabilities of large language models on many complex tasks. A useful formal view compares direct prediction $P(y|x)$ with chain-of-thought prediction $P(y|x, z_1, \ldots, z_n)$ , where $z_i$ represent intermediate reasoning tokens. CoT effectively increases the serial computation depth available to transformer architectures, while autoregressive generation gives the model a place to store intermediate variables. Empirical evidence from arithmetic, symbolic reasoning, and mathematical word problem benchmarks supports the framework. CoT's effectiveness stems from its function as external working memory that works around architectural limits on serial computation depth.

The Inference Problem

Direct prediction vs Chain-of-Thought computation depth — Direct Prediction vs Chain-of-Thought

Consider a language model tasked with answering a question. Given input $x$ , the model must produce output $y$ . In the standard paradigm, this requires computing:

y^* = \arg\max_y P(y | x)

For a transformer with $T$ layers, each forward pass performs $T$ sequential computational steps. The hidden state at layer $\ell$ can be written as:

h^{(\ell)} = f^{(\ell)}(h^{(\ell-1)}, x)

where $f^{(\ell)}$ represents the computation at layer $\ell$ (self-attention followed by a feed-forward network). The final prediction depends on $h^{(T)}$ , which has undergone exactly $T$ serial transformations of the input. This creates a fundamental constraint: any computation the model performs must be expressible within $T$ sequential steps. While each layer has substantial width (the hidden dimension $d$ ) allowing parallel computation, the depth is fixed at $T$ . Many reasoning tasks, however, require computations with depth exceeding $T$ . Consider multiplying two $n$ -digit numbers: the standard algorithm requires $O(n^2)$ sequential operations in the worst case, or $O(n \log n)$ with advanced methods. For sufficiently large $n$ , this exceeds what a fixed-depth transformer can compute in a single forward pass. The fundamental question becomes: how can a fixed-depth architecture solve problems requiring arbitrary computational depth? Chain-of-thought provides one answer.

Direct Prediction vs. Chain-of-Thought

Direct Prediction

In direct prediction, the model maps input directly to output:

P(y | x) = \prod_{i=1}^{|y|} P(y_i | x, y_1, \ldots, y_{i-1})

Each token of the output is generated conditioned on the input and previously generated output tokens. The key observation is that generating each $y_i$ involves exactly one forward pass through the $T$ -layer network. For problems where the answer length $|y|$ is short (e.g., a single number), the model has minimal opportunity to perform iterative computation. The entire reasoning must occur within $T$ layers operating in parallel across positions.

Chain-of-Thought Prediction

Chain-of-thought introduces intermediate reasoning tokens $z_1, \ldots, z_n$ before the final answer:

P(y | x) = \sum_{z_1, \ldots, z_n} P(y | x, z_1, \ldots, z_n) \cdot P(z_1, \ldots, z_n | x)

In practice, we don't marginalize over all possible chains—we sample or greedily decode a single chain. The factorization becomes:

P(y, z_1, \ldots, z_n | x) = P(y | x, z_1, \ldots, z_n) \cdot \prod_{i=1}^{n} P(z_i | x, z_1, \ldots, z_{i-1})

Each intermediate token $z_i$ is generated with a full forward pass, and subsequent tokens can attend to it. This means the computation of $y$ now has access to the results of $n + 1$ forward passes rather than just one.

The Conditioning Effect

A subtle but important point: the chain-of-thought tokens don't just provide more computation—they also condition the final prediction. Consider the conditional probability:

P(y | x, z_1, \ldots, z_n)

If the reasoning tokens $z_i$ contain the correct intermediate results, the conditional distribution $P(y | x, z_1, \ldots, z_n)$ becomes sharply peaked around the correct answer. The model no longer needs to perform complex reasoning—it merely needs to read off the conclusion from its scratchpad. This is analogous to the difference between:

Solving $23 \times 47$ from scratch
Computing $161 + 920$ given that these are the partial products

The second task is computationally simpler, conditioning on intermediate work.

Worked Example: Arithmetic

Consider computing $23 \times 47$ . Direct prediction requires the model to output "1081" after a single forward pass (per token) that can only attend to "23 × 47". Chain-of-thought might produce:

$z_1$ : "First, 23 × 7 = 161"
$z_2$ : "Then, 23 × 40 = 920"
$z_3$ : "Finally, 161 + 920 = 1081"
$y$ : "1081"

The probability decomposes as:

P(\text{"1081"} | x) = P(\text{"1081"} | x, z_1, z_2, z_3) \cdot P(z_3 | x, z_1, z_2) \cdot P(z_2 | x, z_1) \cdot P(z_1 | x)

Each factor involves a forward pass where the model can attend to all preceding reasoning steps. The intermediate results (161, 920) are explicitly stored in the context and accessible for subsequent computation.

A Second Example: Multi-Step Word Problems

Consider: "Alice has 3 apples. She buys 5 more, then gives half to Bob. How many does Alice have?" Direct prediction: The model must compute $(3 + 5) / 2 = 4$ in a single forward pass. Chain-of-thought:

$z_1$ : "Alice starts with 3 apples"
$z_2$ : "After buying 5 more: 3 + 5 = 8 apples"
$z_3$ : "After giving half to Bob: 8 / 2 = 4 apples"
$y$ : "4"

The decomposition isolates each arithmetic operation to a separate forward pass. When generating $z_3$ , the model can attend to "8 apples" from $z_2$ , reducing the problem to a single division.

The Expressiveness Argument

Effective computation depth scaling with reasoning steps — How CoT Scales Effective Computation Depth

Computational Complexity of Transformers

A transformer with $T$ layers, hidden dimension $d$ , and context length $L$ can be viewed as a computational model. Following the analysis of Merrill and Sabharwal (2023), a single forward pass computes a function in the complexity class $\mathsf{TC}^0$ —constant-depth threshold circuits with polynomial size. More precisely, for a fixed architecture, the transformer computes a function $f: \Sigma^* \to \Sigma^*$ where the circuit depth is $O(T)$ and width is $O(L \cdot d)$ . The class $\mathsf{TC}^0$ is powerful—it includes addition, multiplication of fixed-precision numbers, and sorting—but it does not include all polynomial-time computations. Crucially, $\mathsf{TC}^0$ does not contain problems requiring a super-constant number of sequential steps that cannot be parallelized. This includes:

Evaluating Boolean formulas of unbounded depth
Computing iterated functions $f^n(x) = f(f(\ldots f(x) \ldots))$
Simulating arbitrary circuits with depth $\omega(1)$

Depth vs. Width Tradeoffs

Circuit complexity theory establishes fundamental tradeoffs between depth and width. A circuit with depth $D$ and polynomial width can simulate computations requiring $D$ sequential steps. However, increasing width alone cannot always compensate for insufficient depth. The intuition is that depth represents sequential computation time, while width represents parallel resources. Some computations are inherently sequential—they require results from step $i$ before step $i+1$ can proceed. For example, iterated composition of functions—computing $f(f(\cdots f(x) \cdots))$ with $k$ applications—requires depth $\Omega(k)$ regardless of width, assuming each application of $f$ requires at least one computational step. No amount of parallelism can accelerate strictly sequential dependencies.

CoT as Depth Extension

Here's the key insight: autoregressive generation with $K$ intermediate tokens effectively transforms a depth- $T$ computation into a depth- $KT$ computation. Let $g_T: \mathcal{X} \to \mathcal{Y}$ denote the function computed by a $T$ -layer transformer in one forward pass. When generating $K$ tokens autoregressively, the model computes:

z_1 = g_T(x), \quad z_2 = g_T(x, z_1), \quad \ldots, \quad z_K = g_T(x, z_1, \ldots, z_{K-1})

The final output is a function of $K$ sequential applications of $g_T$ , where each application can use the results of all previous applications. This yields effective computational depth of $K \cdot T$ . Theorem (informal): A transformer with $T$ layers generating $K$ tokens autoregressively can simulate any computation requiring $O(KT)$ sequential steps, given sufficient hidden dimension and appropriate training. This explains why CoT helps with tasks requiring deep sequential reasoning: it provides the necessary computational depth that a single forward pass cannot.

Formal Statement

Let $\mathsf{AUT}(T, K)$ denote the class of functions computable by a $T$ -layer autoregressive transformer generating $K$ tokens. We have:

\mathsf{TC}^0 \subsetneq \mathsf{AUT}(T, K) \text{ for sufficiently large } K

In fact, $\mathsf{AUT}(T, K)$ with unbounded $K$ is Turing complete—it can simulate any computation given enough generation steps. This follows from the ability to simulate a Turing machine's tape using the context window and performing one computational step per token generated. Feng et al. (2023) formalize this precisely: for problems solvable by a Turing machine in $t$ steps with tape size $s$ , a transformer with CoT can solve them using $O(t)$ reasoning tokens and context length $O(s)$ .

The Role of Attention in Depth Extension

Attention mechanisms are essential for depth extension to work. When generating token $z_{k+1}$ , the model attends to all of $(x, z_1, \ldots, z_k)$ . This allows:

Information retrieval: Earlier computation results stored in $z_i$ can be retrieved
State tracking: The model can track multiple variables across generation steps
Conditional branching: Different attention patterns can implement control flow

Without attention (e.g., in a pure recurrent model with bounded state), the model would be limited by its hidden state size. The context window essentially provides unbounded external memory accessible via attention.

Information-Theoretic View

Effective Compute and Serial Bottlenecks

From an information-theoretic perspective, consider the mutual information between input and output. For a direct prediction:

I(X; Y) \leq \min(H(X), H(Y), C_{\text{forward}})

where $C_{\text{forward}}$ represents the information-processing capacity of a single forward pass—bounded by the depth-width product of the network. For chain-of-thought:

I(X; Y) \leq I(X; Z_1, \ldots, Z_n, Y)

The intermediate tokens can carry information that exceeds the capacity of a single forward pass, effectively increasing the channel capacity between input and output.

Data Processing Inequality and Its Circumvention

The data processing inequality states that for a Markov chain $X \to Z \to Y$ :

I(X; Y) \leq I(X; Z)

Processing cannot create information. So how does CoT help? The key is that CoT doesn't violate the data processing inequality—it changes the computational structure. Without CoT, information flows as $X \to h^{(T)} \to Y$ , where $h^{(T)}$ is the final hidden state. With CoT, information flows as:

X \to h_1^{(T)} \to z_1 \to h_2^{(T)} \to z_2 \to \cdots \to h_{n}^{(T)} \to z_n \to h_{n+1}^{(T)} \to Y

Each $h_i^{(T)}$ has access to all previous $z_j$ , creating a richer information pathway. The bottleneck is no longer a single forward pass but the accumulated computation across $n+1$ passes.

The Scratchpad Hypothesis

Consider the human analogy: when solving a complex problem, we write intermediate results on paper rather than holding everything in working memory. The written notes serve as external storage that offloads cognitive burden. For transformers, the context window functions analogously. Each generated token becomes part of the model's "working memory" accessible via attention in subsequent forward passes. The chain-of-thought tokens form an external scratchpad. Formally, let $M$ denote the model's internal state capacity (bounded by hidden dimension $d$ ) and $S$ denote the scratchpad size (number of reasoning tokens times vocabulary embedding dimension). The effective memory available for computation is:

M_{\text{effective}} = M + S

For tasks requiring intermediate storage exceeding $M$ , the scratchpad is necessary for correct computation. Consider a task requiring tracking of $k$ variables, each with $b$ bits of precision. The total state requirement is $kb$ bits. If the model's hidden state can only reliably encode $M < kb$ bits, direct computation fails. But with a scratchpad, each variable can be explicitly written to context, enabling arbitrarily large $k$ .

Hidden State Propagation

When generating token $z_{i+1}$ , the model has access to all previous tokens $(x, z_1, \ldots, z_i)$ through attention. This means information computed during the generation of $z_i$ can influence $z_{i+1}$ in two ways:

Explicit: The token $z_i$ itself is attended to
Implicit: The hidden states at each layer can propagate information through attention patterns

The explicit pathway is more robust—the token $z_i$ is discretized and stored in a fixed vocabulary representation. The implicit pathway through attention provides continuous information flow but is more difficult to control and interpret. This dual pathway explains why well-structured chain-of-thought (with clearly stated intermediate results) outperforms vague or disorganized reasoning: explicit storage is more reliable and less prone to degradation across many steps.

When CoT Adds Cost Without Improving Accuracy

Chain-of-thought helps most when the task benefits from serial decomposition. The edge cases clarify the mechanism.

Pure Recall Tasks

For factual recall—"What is the capital of France?"—CoT provides no benefit and may harm performance. The answer exists in model parameters and requires no sequential computation. Mathematically, if $y = f(x)$ where $f$ is directly stored in model weights and computable in one forward pass, then:

P(y | x) \geq P(y | x, z_1, \ldots, z_n) \cdot P(z_1, \ldots, z_n | x)

The chain-of-thought introduces extra opportunities for incorrect reasoning tokens without providing computational benefit.

Highly Parallel Problems

Tasks with inherently parallel structure don't benefit from serial decomposition. Consider checking whether each element of a list satisfies some property independently. This requires $O(1)$ depth with $O(n)$ width—exactly what a transformer provides in one pass. CoT would serialize inherently parallel checks:

Check element 1: satisfies
Check element 2: satisfies
...

This increases computation time without improving accuracy, since the parallel checks don't benefit from intermediate results. The key diagnostic: if a problem decomposes into independent subproblems with no information flow between them, CoT adds latency without benefit.

Insufficient Model Capability

CoT requires the model to:

Identify the appropriate reasoning steps
Execute each step correctly
Integrate results across steps

If the model cannot perform (1)—decomposing the problem—no amount of computation helps. This explains why CoT shows larger benefits for larger models: they have better problem decomposition abilities. Formally, let $\pi(x)$ denote the "reasoning plan" for input $x$ . The model must learn $P(\pi | x)$ . If this distribution is poorly learned (small models, out-of-distribution problems), CoT fails at the planning stage before computation even begins.

Reasoning Token Errors Propagate

Errors in intermediate tokens propagate to the final answer. If $z_i$ is incorrect:

P(y_{\text{correct}} | x, z_1, \ldots, z_{i-1}, z_i^{\text{wrong}}, \ldots) < P(y_{\text{correct}} | x, z_1, \ldots, z_{i-1}, z_i^{\text{correct}}, \ldots)

For chains of length $n$ where each step has accuracy $p$ , the overall accuracy degrades as approximately $p^n$ in the worst case (when errors are independent and fatal). This explains why self-consistency (sampling multiple chains and voting) helps: it provides robustness against intermediate errors by effectively ensembling over reasoning paths.

Length Generalization Limits

CoT becomes brittle when problems require longer chains than seen during training. The model may have learned to reason for $k$ steps while becoming unreliable at $k+m$ steps. This is a form of distributional shift in the "reasoning length" dimension.

Empirical Evidence

GSM8K accuracy comparison: Direct vs CoT by model size — GSM8K: Direct vs Chain-of-Thought by Model Size

Arithmetic and Symbolic Reasoning

Nye et al. (2021) demonstrated that "scratchpad" prompting dramatically improves arithmetic performance. On 8-digit addition:

Direct: ~0% accuracy
With scratchpad: >95% accuracy

The scratchpad allows the model to perform carry operations sequentially, storing intermediate sums in the context. Without the scratchpad, the model must somehow compute all carries in parallel within a single forward pass—a task exceeding its architectural capacity. Similar patterns emerge in symbolic reasoning tasks:

Parity of a bit string: CoT allows iterative XOR
Last letter concatenation: CoT enables sequential character extraction
Coin flip tracking: CoT maintains explicit state

GSM8K

The GSM8K benchmark (Cobbe et al., 2021) contains grade-school math word problems requiring multi-step reasoning. Results show clear CoT benefits:

Method	Accuracy
Direct answer (175B)	18%
Chain-of-thought (175B)	57%
CoT + self-consistency	74%
CoT + verifier (trained)	78%

The problems require multi-step arithmetic wrapped in natural language—exactly the setting where CoT provides computational depth benefits. The jump from 18% to 57% cannot be explained by prompt engineering alone; it reflects a fundamental computational capability gained through intermediate generation.

BIG-Bench Hard

Suzgun et al. (2022) evaluated CoT on challenging BIG-Bench tasks:

Average improvement with CoT: +13.5 percentage points
Largest gains on: multi-step arithmetic, logical deduction, tracking shuffled objects
Smallest gains on: sarcasm detection, movie recommendations (no sequential computation needed)

The pattern matches theoretical predictions: CoT helps tasks requiring serial computation depth. The correlation between "number of reasoning steps required" and "CoT improvement" is strong across the benchmark.

Scaling Analysis

Wei et al. (2022) showed that CoT benefits emerge with scale:

Models <10B parameters: minimal CoT benefit
Models 10-100B parameters: moderate CoT benefit
Models >100B parameters: substantial CoT benefit

This supports the interpretation that CoT requires baseline capabilities in problem decomposition that emerge at scale. Small models cannot identify the correct reasoning steps; thus, additional computation from CoT cannot help. The emergence pattern suggests a capability threshold: below some model size, CoT is useless; above it, CoT provides increasing returns.

Ablation Studies

Several ablation studies isolate the mechanisms: Token content matters: Replacing reasoning tokens with "...." (same length, no content) eliminates benefits. This confirms that explicit information storage, not just additional forward passes, drives improvement. Structure matters: Unstructured reasoning ("hmm, let me think...") underperforms structured step-by-step decomposition. This supports the explicit storage hypothesis. Relevance matters: Irrelevant reasoning tokens can hurt performance, consistent with the error propagation analysis.

Practical Implications

When to Use CoT

Based on the analysis, CoT is most beneficial when:

The task requires multi-step reasoning with dependencies between steps
Intermediate results need to be stored and referenced
The model has sufficient capability to decompose the problem
The reasoning chain is relatively short (limiting error propagation)
The task is not purely parallel or purely recall-based

Prompt Engineering

Effective CoT prompts should:

Explicitly state intermediate results (leverage the explicit storage pathway)
Use consistent notation for values being tracked
Decompose into steps that are individually tractable for the model
Demonstrate the target reasoning structure via few-shot examples

The format of reasoning matters. "Let x = 5. Then x + 3 = 8." is more reliable than "thinking about numbers... probably 8."

Architectural Implications

The analysis suggests architectural research directions:

Adaptive computation: Allow models to allocate more compute to harder problems dynamically
External memory: Provide explicit scratchpad mechanisms beyond context windows
Error correction: Build in mechanisms to detect and recover from reasoning errors (verification heads, backtracking)
Latent reasoning: Enable "thinking" in continuous space rather than discrete tokens to reduce error propagation

Computational Cost

CoT increases inference cost linearly with chain length. For a chain of $K$ tokens, cost scales as $O(K \cdot L \cdot T)$ where $L$ is context length and $T$ is model depth. This is acceptable when improved accuracy justifies additional compute, but not for latency-sensitive applications requiring simple computations. A practical heuristic: estimate the "reasoning depth" required and use CoT only when it exceeds the model's layer count divided by some constant factor.

Conclusion

Chain-of-thought prompting improves model inference by addressing a fundamental architectural limitation: fixed computational depth in transformer forward passes. By generating intermediate reasoning tokens, CoT effectively extends the serial computation available from $T$ layers to $KT$ effective steps, where $K$ is the number of reasoning tokens. The mechanism operates through two pathways: explicit storage of intermediate results in generated tokens, and implicit information flow through attention patterns. The explicit pathway is more robust and benefits from well-structured reasoning chains. CoT is not universally beneficial—it fails for pure recall tasks, highly parallel problems, and when models lack decomposition capabilities. Error propagation through reasoning chains remains a practical limitation, motivating techniques like self-consistency and learned verifiers. The theoretical framework—viewing CoT as computation depth extension and external working memory—explains empirical patterns across arithmetic, symbolic reasoning, and mathematical word problem benchmarks. It also suggests that future architectures with adaptive computation or explicit memory mechanisms may reduce reliance on lengthy reasoning chains while preserving their benefits. Understanding why CoT works enables more principled application: use it for tasks requiring serial depth, structure chains for explicit intermediate storage, and consider self-consistency for robustness against reasoning errors.

References

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). Training verifiers to solve math word problems. *arXiv preprint arXiv:2110.14168*. Feng, G., Zhang, B., Gu, Y., Ye, H., He, D., & Wang, L. (2023). Towards revealing the mystery behind chain of thought: A theoretical perspective. *Advances in Neural Information Processing Systems*, 36. Merrill, W., & Sabharwal, A. (2023). The parallelism tradeoff: Limitations of log-precision transformers. *Transactions of the Association for Computational Linguistics*, 11, 531-545. Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Biber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., & Odena, A. (2021). Show your work: Scratchpads for intermediate computation with language models. *arXiv preprint arXiv:2112.00114*. Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., & Wei, J. (2022). Challenging BIG-bench tasks and whether chain-of-thought can solve them. *arXiv preprint arXiv:2210.09261*. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models. *International Conference on Learning Representations*. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. *Advances in Neural Information Processing Systems*, 35, 24824-24837.