Have you ever noticed that LLMs sometimes seem more likely to “hallucinate” or generate nonsensical information when their context window is packed with information? This post dives into a toy model to explore a potential reason for this. We’ll focus on how a specific mechanism—where the influence of individual pieces of information is scaled down by 1/N1/N as more (say, NN) pieces arrive—can lead to a less “peaked” (flatter) probability distribution for the next word, possibly making the model more uncertain and prone to hallucination.

The Toy Model: Core Components

Let’s lay out the building blocks of our simplified model.

Context Richness and Logit Influence Vectors

Imagine the LLM processing NN distinct “contextual features” or pieces of information. As the input context gets richer (e.g., a longer prompt or conversation history), NN naturally increases. Each piece of information, let’s call it feature kk (for k=1,,Nk=1, \dots, N), tries to nudge the model’s prediction for the next word. We represent this nudge as an “effective logit influence vector” pk\mathbf{p}_k. This vector has a value for every word in the LLM’s vocabulary (size VV), telling us how feature kk wants to increase or decrease the pre-softmax logit for each potential next word.

The Key Mechanism: Postulating the 1/N1/N Scaling

The heart of our toy model is how these individual influences combine. We can motivate the specific form of this combination by considering a common pattern in neural architectures: the normalized aggregation of information from multiple sources. For instance, attention mechanisms weight and sum value vectors, and Mixture of Experts (MoE) models combine expert outputs using gating weights that often sum to one (or are normalized).

Let’s model a scenario where NN distinct contextual cues have been processed into intermediate representations (e.g., hidden states from different attention heads or parallel processing pathways), denoted hkprocRD\mathbf{h}_k^{\text{proc}} \in \mathbb{R}^D. These are then combined using weights αk\alpha_k before a final linear projection WoutRV×DW_{out} \in \mathbb{R}^{V \times D} maps them to the logit space:

Lraw(N)=Wout(k=1Nαkhkproc)=k=1Nαk(Wouthkproc)\mathbf{L}^{(N)}_{\text{raw}} = W_{out} \left( \sum_{k=1}^N \alpha_k \mathbf{h}_k^{\text{proc}} \right) = \sum_{k=1}^N \alpha_k (W_{out} \mathbf{h}_k^{\text{proc}})

The weights αk\alpha_k could be generated by a softmax normalization of scores sks_k for each cue kk (i.e., αk=exp(sk)/jexp(sj)\alpha_k = \exp(s_k) / \sum_j \exp(s_j)). If, for the specific set of NN broad contextual cues being aggregated, their scores sks_k are approximately equal (e.g., they are deemed undifferentiated in their relevance for this particular aggregation step, or the scoring mechanism is operating in a regime of low sensitivity for these cues), then αk1/N\alpha_k \approx 1/N for all kk.

Under this condition of roughly uniform weighting, the raw combined logit vector becomes:

Lraw(N)k=1N1N(Wouthkproc)=1Nk=1N(Wouthkproc)\mathbf{L}^{(N)}_{\text{raw}} \approx \sum_{k=1}^N \frac{1}{N} (W_{out} \mathbf{h}_k^{\text{proc}}) = \frac{1}{N} \sum_{k=1}^N (W_{out} \mathbf{h}_k^{\text{proc}})

Based on this, we define the “effective logit influence vector” of cue kk as its contribution after processing by WoutW_{out} and then centering it around its mean effect: pk=WouthkprocE[Wouthkproc]\mathbf{p}_k = W_{out} \mathbf{h}_k^{\text{proc}} - \mathbb{E}[W_{out} \mathbf{h}_k^{\text{proc}}]. Assuming the final logit vector L(N)\mathbf{L}^{(N)} that we model is also centered (i.e., its expectation is zero, consistent with using centered pk\mathbf{p}_k for easier variance analysis later), it is then postulated to be the arithmetic mean of these centered effective logit influence vectors:

L(N)=1Nk=1Npk\mathbf{L}^{(N)} = \frac{1}{N} \sum_{k=1}^{N} \mathbf{p}_k

This explicit 1/N1/N scaling of the centered influences is crucial. It means that as more information comes in, each individual piece of information has proportionally less say in the deviations from the mean logit, preventing any single piece from unduly dominating the variance. If we just summed the centered influences (a 1/11/1 scaling), the variance of the sum would typically grow with NN. If we scaled by 1/N1/\sqrt{N} (common when averaging independent, identically distributed random variables), the variance of the sum might stay constant if the influences were uncorrelated. The 1/N1/N scaling, however, is a stronger form of down-weighting that, as we’ll see, can actively reduce the variance of the combined logits under certain conditions.

Simplifying Assumptions for Logit Influences (pk\mathbf{p}_k)

To make our model tractable, we’ll make a few statistical assumptions about these influence vectors pk\mathbf{p}_k:

  1. Zero Mean Influence: On average, each pk\mathbf{p}_k is zero (E[pk]=0\mathbb{E}[\mathbf{p}_k] = \mathbf{0}). This just means we’re looking at influences as deviations from some baseline, which simplifies our math, especially variance calculations.
  2. Common Covariance: The “shape” of the random fluctuation of each pk\mathbf{p}_k is the same, described by a common covariance matrix: Cov(pk)=Σp=E[pkpkT]\mathrm{Cov}(\mathbf{p}_k) = \mathbf{\Sigma}_p = \mathbb{E}[\mathbf{p}_k \mathbf{p}_k^T].
  3. Inter-Feature Covariance: The way influences from different features (kk and ll) vary together is described by Cov(pk,pl)=Ckl=E[pkplT]\mathrm{Cov}(\mathbf{p}_k, \mathbf{p}_l) = \mathbf{C}_{kl} = \mathbb{E}[\mathbf{p}_k \mathbf{p}_l^T].

Deriving the Expected Variance of Logits

Our main goal is to see how the “spread” of the final logit values Li(N)L_i^{(N)} across the vocabulary changes as NN increases. We measure this spread using the sample variance: Vari(Li(N))=1Vi=1V(Li(N)Lˉ(N))2\mathrm{Var}_i(L_i^{(N)}) = \frac{1}{V} \sum_{i=1}^{V} (L_i^{(N)} - \bar{L}^{(N)})^2, where Lˉ(N)\bar{L}^{(N)} is the average logit value.

  1. Covariance of the Scaled Logit Vector (L(N)\mathbf{L}^{(N)}): We start with our formula L(N)=1Npk\mathbf{L}^{(N)} = \frac{1}{N} \sum \mathbf{p}_k. Using standard properties of covariance (specifically, its bilinearity and how it behaves with sums of random vectors), we find the covariance matrix of L(N)\mathbf{L}^{(N)}:

    ΣL(N)=Cov(1Nk=1Npk)=1N2Cov(k=1Npk)\mathbf{\Sigma}_{\mathbf{L}^{(N)}} = \mathrm{Cov}\left(\frac{1}{N} \sum_{k=1}^{N} \mathbf{p}_k\right) = \frac{1}{N^2} \mathrm{Cov}\left(\sum_{k=1}^{N} \mathbf{p}_k\right)

    Expanding the covariance of the sum (akin to Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\mathrm{Var}(X+Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) + 2\mathrm{Cov}(X,Y), but for vectors):

    ΣL(N)=1N2(k=1NCov(pk)+klCov(pk,pl))\mathbf{\Sigma}_{\mathbf{L}^{(N)}} = \frac{1}{N^2} \left( \sum_{k=1}^{N} \mathrm{Cov}(\mathbf{p}_k) + \sum_{k \neq l} \mathrm{Cov}(\mathbf{p}_k, \mathbf{p}_l) \right)

    Plugging in our assumptions (Cov(pk)=Σp\mathrm{Cov}(\mathbf{p}_k) = \mathbf{\Sigma}_p and Cov(pk,pl)=Ckl\mathrm{Cov}(\mathbf{p}_k, \mathbf{p}_l) = \mathbf{C}_{kl}):

    ΣL(N)=1N2(NΣp+klCkl)\mathbf{\Sigma}_{\mathbf{L}^{(N)}} = \frac{1}{N^2} \left( N \mathbf{\Sigma}_p + \sum_{k \neq l} \mathbf{C}_{kl} \right)
  2. Expected Sample Variance of Logits: Since we assumed E[pk]=0\mathbb{E}[\mathbf{p}_k] = \mathbf{0}, it follows that E[Li(N)]=0\mathbb{E}[L_i^{(N)}] = 0 for any logit ii. This simplifies things. The expected sample variance of the logits can be related to the covariance matrix ΣL(N)\mathbf{\Sigma}_{\mathbf{L}^{(N)}} using a function S(A)=1VTr(A)1V21TA1S(\mathbf{A}) = \frac{1}{V} \mathrm{Tr}(\mathbf{A}) - \frac{1}{V^2} \mathbf{1}^T \mathbf{A} \mathbf{1}. This function S(A)S(\mathbf{A}) essentially measures the average variance of the components of a zero-mean random vector whose covariance is A\mathbf{A}. (It’s derived from the general sample variance formula S(A)=1ViE[(AiAˉ)2]S(\mathbf{A}) = \frac{1}{V} \sum_i \mathbb{E}[(A_i - \bar{A})^2], which simplifies nicely when the mean E[Ai]\mathbb{E}[A_i] is zero). So, we have:

    E[Vari(Li(N))]=S(ΣL(N))\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = S(\mathbf{\Sigma}_{\mathbf{L}^{(N)}})

    Because S()S(\cdot) is a linear function of its matrix argument (it involves traces and linear combinations), we can write:

    E[Vari(Li(N))]=1N2(NS(Σp)+klS(Ckl))\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = \frac{1}{N^2} \left( N S(\mathbf{\Sigma}_p) + \sum_{k \neq l} S(\mathbf{C}_{kl}) \right)
  3. Introducing Simplified Correlation: Let’s define Vp=S(Σp)V_p = S(\mathbf{\Sigma}_p) as the inherent expected sample variance if we only had a single feature’s influence. To make the sum of cross-feature terms S(Ckl)S(\mathbf{C}_{kl}) easier to handle, we make another simplifying assumption: we assume that the “shape” of these cross-covariances is somewhat similar, differing mainly by a scalar factor. Specifically, we let S(Ckl)=ρklVpS(\mathbf{C}_{kl}) = \rho_{kl} V_p. Here, ρkl\rho_{kl} is a scalar “effective correlation” that captures how much the typical spread pattern of influence from feature kk aligns with that of feature ll, normalized by VpV_p. Now, let ρˉ=1N(N1)klρkl\bar{\rho} = \frac{1}{N(N-1)} \sum_{k \neq l} \rho_{kl} be the average of these effective correlations over all distinct pairs of features (this term is zero if N=1N=1). The sum then becomes klS(Ckl)=N(N1)ρˉVp\sum_{k \neq l} S(\mathbf{C}_{kl}) = N(N-1)\bar{\rho} V_p.

  4. Final Result for Expected Logit Variance: Substituting this back into our equation for the expected sample variance:

    E[Vari(Li(N))]=1N2(NVp+N(N1)ρˉVp)\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = \frac{1}{N^2} \left( N V_p + N(N-1)\bar{\rho} V_p \right)

    Factoring out NVpN V_p and simplifying the terms within the parenthesis leads to:

    E[Vari(Li(N))]=Vp(N+N(N1)ρˉN2)=Vp(1+(N1)ρˉN)\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = V_p \left( \frac{N + N(N-1)\bar{\rho}}{N^2} \right) = V_p \left( \frac{1 + (N-1)\bar{\rho}}{N} \right)

    This looks a bit complicated, but it can be rewritten in a more insightful way:

    E[Vari(Li(N))]=Vp(1ρˉN+ρˉ)(for N1)\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = V_p \left( \frac{1 - \bar{\rho}}{N} + \bar{\rho} \right) \quad (\text{for } N \ge 1)

    This is our key result for how the expected spread of logits changes!

What the Model Shows

This final equation is quite revealing. It tells us how the expected variance (or “spread”) of the logits across the vocabulary changes based on the number of contextual features (NN) and their average effective correlation (ρˉ\bar{\rho}):

  1. Independent Features (ρˉ=0\bar{\rho} = 0): If the influences from different contextual features are effectively uncorrelated (ρˉ=0\bar{\rho} = 0), then E[Vari(Li(N))]=Vp/N\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = V_p/N.

    Interpretation: The logit variance shrinks proportionally to 1/N1/N. If new information is entirely “fresh” and unrelated to what the model has already processed, its primary effect is to reduce the logit spread. While the logits themselves become more concentrated around their mean, this doesn’t mean the model becomes more certain about a single token. Rather, as we’ll see when connecting to softmax, this reduced variance across the set of all logits leads to them being more similar to each other, resulting in a flatter probability distribution and thus greater uncertainty.

  2. Positively Correlated Features (0<ρˉ<10 < \bar{\rho} < 1): The variance still decreases as NN grows, thanks to the (1ρˉ)/N(1-\bar{\rho})/N term which still pushes it down. However, as NN gets very large, the variance doesn’t go to zero. Instead, it approaches a floor: limNE[Vari(Li(N))]=Vpρˉ\lim_{N\to\infty} \mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = V_p \bar{\rho}.

    Interpretation: Shared or redundant information (positive correlation) limits how much the 1/N1/N scaling can reduce logit variance. While the variance still decreases with NN (as long as ρˉ<1\bar{\rho} < 1), it approaches a floor of VpρˉV_p \bar{\rho}. This means that even with a large amount of information, if it’s partially correlated, the logits won’t become as similar to each other (their differences won’t shrink as much) as they would with purely independent information. The resulting softmax distribution, while potentially becoming flatter and indicating more uncertainty compared to small NN (if initial VpV_p was high), will not flatten indefinitely as it does when ρˉ=0\bar{\rho}=0.

  3. Perfectly Correlated Features (ρˉ=1\bar{\rho} = 1): If all contextual influences are perfectly correlated in their effect on logit spread (ρˉ=1\bar{\rho} = 1), then E[Vari(Li(N))]=Vp\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] = V_p.

    Interpretation: If all contextual influences are perfectly correlated (ρˉ=1\bar{\rho} = 1), the expected logit variance E[Vari(Li(N))]\mathbb{E}\left[\mathrm{Var}_i(L_i^{(N)})\right] remains constant at VpV_p, regardless of NN. The 1/N1/N scaling, in this scenario, effectively processes the same underlying signal repeatedly. The logit spread (and thus the similarity between logits) doesn’t change from the single-feature case. This implies that adding more, perfectly redundant information neither increases nor decreases the model’s certainty or uncertainty (as reflected by the flatness of the softmax distribution) compared to having just one piece of that information.

The takeaway is this: as long as new contextual features bring at least some new, uncorrelated influence (ρˉ<1\bar{\rho} < 1), our 1/N1/N scaling mechanism will cause the expected variance of the logits to decrease as context richness NN grows.

Quantifying Flatness of Final Distribution

We focus on the common scenario of partially correlated features (0<ρˉeff<10 < \bar{\rho}_{eff} < 1) to see how increasing context richness (NN) can lead to a flatter softmax distribution, potentially increasing hallucination. Recall that ρˉeff\bar{\rho}_{eff} is the effective average correlation influencing logit differences.

To quantify this flattening more directly, we can consider the expected Chi-Squared distance, E[χ2(P,U)]\mathbb{E}[\chi^2(P,U)], between the model’s softmax output distribution P=(p1,,pV)P=(p_1, \dots, p_V) and a perfectly uniform distribution U=(1/V,,1/V)U=(1/V, \dots, 1/V).

When the logits Lk(N)L_k^{(N)} are relatively small (which occurs when their variance is small, as for large NN), we can approximate pk(N)1/V(Lk(N)Lˉ(N))/Vp_k(N) - 1/V \approx (L_k^{(N)} - \bar{L}^{(N)})/V. This leads to:

χ2(P,U)=Vk=1V(pk1/V)2Vk=1V(Lk(N)Lˉ(N)V)2=Vark(Lk(N))\chi^2(P,U) = V \sum_{k=1}^V (p_k - 1/V)^2 \approx V \sum_{k=1}^V \left( \frac{L_k^{(N)} - \bar{L}^{(N)}}{V} \right)^2 = \mathrm{Var}_k(L_k^{(N)})

Thus, the expected deviation from a uniform distribution scales with the expected variance of the logits themselves:

E[χ2(P,U)]E[Vark(Lk(N))]=Vp(1ρˉN+ρˉ)\mathbb{E}[\chi^2(P,U)] \approx \mathbb{E}[\mathrm{Var}_k(L_k^{(N)})] = V_p \left( \frac{1 - \bar{\rho}}{N} + \bar{\rho} \right)

For the partially correlated case (0<ρˉ<10 < \bar{\rho} < 1), this means:

E[χ2(P,U)]Vp(1ρˉ)N+Vpρˉ\mathbb{E}[\chi^2(P,U)] \approx \frac{V_p(1-\bar{\rho})}{N} + V_p\bar{\rho}

A smaller E[χ2(P,U)]\mathbb{E}[\chi^2(P,U)] indicates that PP is closer to uniform (flatter). As NN increases, this value decreases (if ρˉ<1\bar{\rho} < 1), signifying a flatter distribution and thus higher model uncertainty, approaching a limit set by VpρˉV_p\bar{\rho}.

This argument shows that for partially correlated features, increasing context richness (NN) causes the variance of logit differences to decrease (due to the C2/N\sim C_2/N term), increasing the likelihood of logits being numerically close. This, in turn, leads to a flatter softmax probability distribution, indicating higher model uncertainty, although the extent of this flattening is limited by the non-zero correlation (via C1C_1).

Conclusion

This toy model, centered on the idea of 1/N1/N scaling of logit influences from various contextual features, paints a plausible picture of how increasing context richness (NN) could, somewhat counterintuitively, lead to trouble. The key result, E[Vari(Li(N))]=Vp((1ρˉ)/N+ρˉ)\mathbb{E}[\text{Var}_i(L_i^{(N)})] = V_p ( (1 - \bar{\rho})/N + \bar{\rho} ), shows that the expected spread of pre-softmax logits tends to decrease as more features are added, especially if these features are diverse (low average correlation ρˉ\bar{\rho}). This reduction in logit variance makes the logits cluster together, leading to a flatter softmax probability distribution. This flatter distribution signifies higher model uncertainty, offering a mathematical pathway to understanding why an LLM might become more prone to hallucination when its context window is filled with a large amount of information. While a simplification, this model provides a conceptual framework for thinking about the delicate balance of information aggregation in LLMs.