Hallucination and Context Filling: A Toy Model

Have you ever noticed that LLMs sometimes seem more likely to “hallucinate” or generate nonsensical information when their context window is packed with information? This post dives into a toy model to explore a potential reason for this. We’ll focus on how a specific mechanism—where the influence of individual pieces of information is scaled down by $1/N$ as more (say, $N$ ) pieces arrive—can lead to a less “peaked” (flatter) probability distribution for the next word, possibly making the model more uncertain and prone to hallucination.

The Toy Model: Core Components

Let’s lay out the building blocks of our simplified model.

Context Richness and Logit Influence Vectors

Imagine the LLM processing $N$ distinct “contextual features” or pieces of information. As the input context gets richer (e.g., a longer prompt or conversation history), $N$ naturally increases. Each piece of information, let’s call it feature $k$ (for $k=1, \dots, N$ ), tries to nudge the model’s prediction for the next word. We represent this nudge as an “effective logit influence vector” $\mathbf{p}_k$ . This vector has a value for every word in the LLM’s vocabulary (size $V$ ), telling us how feature $k$ wants to increase or decrease the pre-softmax logit for each potential next word.

The Key Mechanism: Postulating the $1/N$ Scaling

The heart of our toy model is how these individual influences combine. We can motivate the specific form of this combination by considering a common pattern in neural architectures: the normalized aggregation of information from multiple sources. For instance, attention mechanisms weight and sum value vectors, and Mixture of Experts (MoE) models combine expert outputs using gating weights that often sum to one (or are normalized).

Let’s model a scenario where $N$ distinct contextual cues have been processed into intermediate representations (e.g., hidden states from different attention heads or parallel processing pathways), denoted $\mathbf{h}_k^{\text{proc}} \in \mathbb{R}^D$ . These are then combined using weights $\alpha_k$ before a final linear projection $W_{out} \in \mathbb{R}^{V \times D}$ maps them to the logit space:

\mathbf{L}^{(N)}_{\text{raw}} = W_{out} \left( \sum_{k=1}^N \alpha_k \mathbf{h}_k^{\text{proc}} \right) = \sum_{k=1}^N \alpha_k (W_{out} \mathbf{h}_k^{\text{proc}})

The weights $\alpha_k$ could be generated by a softmax normalization of scores $s_k$ for each cue $k$ (i.e., $\alpha_k = \exp(s_k) / \sum_j \exp(s_j)$ ). If, for the specific set of $N$ broad contextual cues being aggregated, their scores $s_k$ are approximately equal (e.g., they are deemed undifferentiated in their relevance for this particular aggregation step, or the scoring mechanism is operating in a regime of low sensitivity for these cues), then $\alpha_k \approx 1/N$ for all $k$ .

Under this condition of roughly uniform weighting, the raw combined logit vector becomes:

\mathbf{L}^{(N)}_{\text{raw}} \approx \sum_{k=1}^N \frac{1}{N} (W_{out} \mathbf{h}_k^{\text{proc}}) = \frac{1}{N} \sum_{k=1}^N (W_{out} \mathbf{h}_k^{\text{proc}})

Based on this, we define the “effective logit influence vector” of cue $k$ as its contribution after processing by $W_{out}$ and then centering it around its mean effect: $\mathbf{p}_k = W_{out} \mathbf{h}_k^{\text{proc}} - \mathbb{E}[W_{out} \mathbf{h}_k^{\text{proc}}]$ . Assuming the final logit vector $\mathbf{L}^{(N)}$ that we model is also centered (i.e., its expectation is zero, consistent with using centered $\mathbf{p}_k$ for easier variance analysis later), it is then postulated to be the arithmetic mean of these centered effective logit influence vectors:

\mathbf{L}^{(N)} = \frac{1}{N} \sum_{k=1}^{N} \mathbf{p}_k

This explicit $1/N$ scaling of the centered influences is crucial. It means that as more information comes in, each individual piece of information has proportionally less say in the deviations from the mean logit, preventing any single piece from unduly dominating the variance. If we just summed the centered influences (a $1/1$ scaling), the variance of the sum would typically grow with $N$ . If we scaled by $1/\sqrt{N}$ (common when averaging independent, identically distributed random variables), the variance of the sum might stay constant if the influences were uncorrelated. The $1/N$ scaling, however, is a stronger form of down-weighting that, as we’ll see, can actively reduce the variance of the combined logits under certain conditions.

Simplifying Assumptions for Logit Influences ( $\mathbf{p}_k$ )

To make our model tractable, we’ll make a few statistical assumptions about these influence vectors $\mathbf{p}_k$ :

Zero Mean Influence: On average, each $\mathbf{p}_k$ is zero ( $\mathbb{E}[\mathbf{p}_k] = \mathbf{0}$ ). This just means we’re looking at influences as deviations from some baseline, which simplifies our math, especially variance calculations.
Common Covariance: The “shape” of the random fluctuation of each $\mathbf{p}_k$ is the same, described by a common covariance matrix: $\mathrm{Cov}(\mathbf{p}_k) = \mathbf{\Sigma}_p = \mathbb{E}[\mathbf{p}_k \mathbf{p}_k^T]$ .
Inter-Feature Covariance: The way influences from different features ( $k$ and $l$ ) vary together is described by $\mathrm{Cov}(\mathbf{p}_k, \mathbf{p}_l) = \mathbf{C}_{kl} = \mathbb{E}[\mathbf{p}_k \mathbf{p}_l^T]$ .

Deriving the Expected Variance of Logits

Our main goal is to see how the “spread” of the final logit values $L_i^{(N)}$ across the vocabulary changes as $N$ increases. We measure this spread using the sample variance: $\mathrm{Var}_i(L_i^{(N)}) = \frac{1}{V} \sum_{i=1}^{V} (L_i^{(N)} - \bar{L}^{(N)})^2$ , where $\bar{L}^{(N)}$ is the average logit value.

Covariance of the Scaled Logit Vector ( $\mathbf{L}^{(N)}$ ): We start with our formula $\mathbf{L}^{(N)} = \frac{1}{N} \sum \mathbf{p}_k$ . Using standard properties of covariance (specifically, its bilinearity and how it behaves with sums of random vectors), we find the covariance matrix of $\mathbf{L}^{(N)}$ :
$\mathbf{\Sigma}_{\mathbf{L}^{(N)}} = \mathrm{Cov}\left(\frac{1}{N} \sum_{k=1}^{N} \mathbf{p}_k\right) = \frac{1}{N^2} \mathrm{Cov}\left(\sum_{k=1}^{N} \mathbf{p}_k\right)$
Expanding the covariance of the sum (akin to $\mathrm{Var}(X+Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) + 2\mathrm{Cov}(X,Y)$ , but for vectors):
$\mathbf{\Sigma}_{\mathbf{L}^{(N)}} = \frac{1}{N^2} \left( \sum_{k=1}^{N} \mathrm{Cov}(\mathbf{p}_k) + \sum_{k \neq l} \mathrm{Cov}(\mathbf{p}_k, \mathbf{p}_l) \right)$
Plugging in our assumptions ( $\mathrm{Cov}(\mathbf{p}_k) = \mathbf{\Sigma}_p$ and $\mathrm{Cov}(\mathbf{p}_k, \mathbf{p}_l) = \mathbf{C}_{kl}$ ):
$\mathbf{\Sigma}_{\mathbf{L}^{(N)}} = \frac{1}{N^2} \left( N \mathbf{\Sigma}_p + \sum_{k \neq l} \mathbf{C}_{kl} \right)$
Expected Sample Variance of Logits: Since we assumed $\mathbb{E}[\mathbf{p}_k] = \mathbf{0}$ , it follows that $\mathbb{E}[L_i^{(N)}] = 0$ for any logit $i$ . This simplifies things. The expected sample variance of the logits can be related to the covariance matrix $\mathbf{\Sigma}_{\mathbf{L}^{(N)}}$ using a function $S(\mathbf{A}) = \frac{1}{V} \mathrm{Tr}(\mathbf{A}) - \frac{1}{V^2} \mathbf{1}^T \mathbf{A} \mathbf{1}$ . This function $S(\mathbf{A})$ essentially measures the average variance of the components of a zero-mean random vector whose covariance is $\mathbf{A}$ . (It’s derived from the general sample variance formula $S(\mathbf{A}) = \frac{1}{V} \sum_i \mathbb{E}[(A_i - \bar{A})^2]$ , which simplifies nicely when the mean $\mathbb{E}[A_i]$ is zero). So, we have:
$\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = S(\mathbf{\Sigma}_{\mathbf{L}^{(N)}})$
Because $S(\cdot)$ is a linear function of its matrix argument (it involves traces and linear combinations), we can write:
$\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = \frac{1}{N^2} \left( N S(\mathbf{\Sigma}_p) + \sum_{k \neq l} S(\mathbf{C}_{kl}) \right)$
Introducing Simplified Correlation: Let’s define $V_p = S(\mathbf{\Sigma}_p)$ as the inherent expected sample variance if we only had a single feature’s influence. To make the sum of cross-feature terms $S(\mathbf{C}_{kl})$ easier to handle, we make another simplifying assumption: we assume that the “shape” of these cross-covariances is somewhat similar, differing mainly by a scalar factor. Specifically, we let $S(\mathbf{C}_{kl}) = \rho_{kl} V_p$ . Here, $\rho_{kl}$ is a scalar “effective correlation” that captures how much the typical spread pattern of influence from feature $k$ aligns with that of feature $l$ , normalized by $V_p$ . Now, let $\bar{\rho} = \frac{1}{N(N-1)} \sum_{k \neq l} \rho_{kl}$ be the average of these effective correlations over all distinct pairs of features (this term is zero if $N=1$ ). The sum then becomes $\sum_{k \neq l} S(\mathbf{C}_{kl}) = N(N-1)\bar{\rho} V_p$ .
Final Result for Expected Logit Variance: Substituting this back into our equation for the expected sample variance:
$\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = \frac{1}{N^2} \left( N V_p + N(N-1)\bar{\rho} V_p \right)$
Factoring out $N V_p$ and simplifying the terms within the parenthesis leads to:
$\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = V_p \left( \frac{N + N(N-1)\bar{\rho}}{N^2} \right) = V_p \left( \frac{1 + (N-1)\bar{\rho}}{N} \right)$
This looks a bit complicated, but it can be rewritten in a more insightful way:
$\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = V_p \left( \frac{1 - \bar{\rho}}{N} + \bar{\rho} \right) \quad (\text{for } N \ge 1)$
This is our key result for how the expected spread of logits changes!

What the Model Shows

This final equation is quite revealing. It tells us how the expected variance (or “spread”) of the logits across the vocabulary changes based on the number of contextual features ( $N$ ) and their average effective correlation ( $\bar{\rho}$ ):

Independent Features ( $\bar{\rho} = 0$ ): If the influences from different contextual features are effectively uncorrelated ( $\bar{\rho} = 0$ ), then $\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = V_p/N$ .

Interpretation: The logit variance shrinks proportionally to $1/N$ . If new information is entirely “fresh” and unrelated to what the model has already processed, its primary effect is to reduce the logit spread. While the logits themselves become more concentrated around their mean, this doesn’t mean the model becomes more certain about a single token. Rather, as we’ll see when connecting to softmax, this reduced variance across the set of all logits leads to them being more similar to each other, resulting in a flatter probability distribution and thus greater uncertainty.
Positively Correlated Features ( $0 < \bar{\rho} < 1$ ): The variance still decreases as $N$ grows, thanks to the $(1-\bar{\rho})/N$ term which still pushes it down. However, as $N$ gets very large, the variance doesn’t go to zero. Instead, it approaches a floor: $\lim_{N\to\infty} \mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = V_p \bar{\rho}$ .

Interpretation: Shared or redundant information (positive correlation) limits how much the $1/N$ scaling can reduce logit variance. While the variance still decreases with $N$ (as long as $\bar{\rho} < 1$ ), it approaches a floor of $V_p \bar{\rho}$ . This means that even with a large amount of information, if it’s partially correlated, the logits won’t become as similar to each other (their differences won’t shrink as much) as they would with purely independent information. The resulting softmax distribution, while potentially becoming flatter and indicating more uncertainty compared to small $N$ (if initial $V_p$ was high), will not flatten indefinitely as it does when $\bar{\rho}=0$ .
Perfectly Correlated Features ( $\bar{\rho} = 1$ ): If all contextual influences are perfectly correlated in their effect on logit spread ( $\bar{\rho} = 1$ ), then $\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = V_p$ .

Interpretation: If all contextual influences are perfectly correlated ( $\bar{\rho} = 1$ ), the expected logit variance $\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right]$ remains constant at $V_p$ , regardless of $N$ . The $1/N$ scaling, in this scenario, effectively processes the same underlying signal repeatedly. The logit spread (and thus the similarity between logits) doesn’t change from the single-feature case. This implies that adding more, perfectly redundant information neither increases nor decreases the model’s certainty or uncertainty (as reflected by the flatness of the softmax distribution) compared to having just one piece of that information.

The takeaway is this: as long as new contextual features bring at least some new, uncorrelated influence ( $\bar{\rho} < 1$ ), our $1/N$ scaling mechanism will cause the expected variance of the logits to decrease as context richness $N$ grows.

Quantifying Flatness of Final Distribution

We focus on the common scenario of partially correlated features ( $0 < \bar{\rho}_{eff} < 1$ ) to see how increasing context richness ( $N$ ) can lead to a flatter softmax distribution, potentially increasing hallucination. Recall that $\bar{\rho}_{eff}$ is the effective average correlation influencing logit differences.

To quantify this flattening more directly, we can consider the expected Chi-Squared distance, $\mathbb{E}[\chi^2(P,U)]$ , between the model’s softmax output distribution $P=(p_1, \dots, p_V)$ and a perfectly uniform distribution $U=(1/V, \dots, 1/V)$ .

When the logits $L_k^{(N)}$ are relatively small (which occurs when their variance is small, as for large $N$ ), we can approximate $p_k(N) - 1/V \approx (L_k^{(N)} - \bar{L}^{(N)})/V$ . This leads to:

\chi^2(P,U) = V \sum_{k=1}^V (p_k - 1/V)^2 \approx V \sum_{k=1}^V \left( \frac{L_k^{(N)} - \bar{L}^{(N)}}{V} \right)^2 = \mathrm{Var}_k(L_k^{(N)})

Thus, the expected deviation from a uniform distribution scales with the expected variance of the logits themselves:

\mathbb{E}[\chi^2(P,U)] \approx \mathbb{E}[\mathrm{Var}_k(L_k^{(N)})] = V_p \left( \frac{1 - \bar{\rho}}{N} + \bar{\rho} \right)

For the partially correlated case ( $0 < \bar{\rho} < 1$ ), this means:

\mathbb{E}[\chi^2(P,U)] \approx \frac{V_p(1-\bar{\rho})}{N} + V_p\bar{\rho}

A smaller $\mathbb{E}[\chi^2(P,U)]$ indicates that $P$ is closer to uniform (flatter). As $N$ increases, this value decreases (if $\bar{\rho} < 1$ ), signifying a flatter distribution and thus higher model uncertainty, approaching a limit set by $V_p\bar{\rho}$ .

This argument shows that for partially correlated features, increasing context richness ( $N$ ) causes the variance of logit differences to decrease (due to the $\sim C_2/N$ term), increasing the likelihood of logits being numerically close. This, in turn, leads to a flatter softmax probability distribution, indicating higher model uncertainty, although the extent of this flattening is limited by the non-zero correlation (via $C_1$ ).

Conclusion

This toy model, centered on the idea of $1/N$ scaling of logit influences from various contextual features, paints a plausible picture of how increasing context richness ( $N$ ) could, somewhat counterintuitively, lead to trouble. The key result, $\mathbb{E}[\text{Var}_i(L_i^{(N)})] = V_p ( (1 - \bar{\rho})/N + \bar{\rho} )$ , shows that the expected spread of pre-softmax logits tends to decrease as more features are added, especially if these features are diverse (low average correlation $\bar{\rho}$ ). This reduction in logit variance makes the logits cluster together, leading to a flatter softmax probability distribution. This flatter distribution signifies higher model uncertainty, offering a mathematical pathway to understanding why an LLM might become more prone to hallucination when its context window is filled with a large amount of information. While a simplification, this model provides a conceptual framework for thinking about the delicate balance of information aggregation in LLMs.