Blog banner

Towards a Testable Theory of Emergence

ai
machine-learning
transformers
emergence
theory
A proposal for testing emergence as heldout but reachable coherence under recursive retrieval.
Author

Soma S Dhavala

Published

June 30, 2026

Abstract. What exactly are large language models doing? One answer says they are memory devices: vast compressors of their training data, retrieving and remixing what they have already seen. That answer is partly right, but it creates a puzzle. If these systems are “just” memory and compression, why do they generalize to new compositions, concepts, and tasks that were not stored as exact training examples? This essay works toward a testable theory of emergence, but not emergence defined merely by model size. A capability is retrieval-emergent when a target behavior was not stored as a joint point, but is nevertheless reachable from dense constituent neighborhoods through context-sensitive retrieval. Model size matters because it can supply enough memory, dimension, and depth for those neighborhoods to become dense and separable; size is a precondition, not the definition. The argument has three parts. First, a simple analytical result shows what would be required for a heldout composite to be recovered from constituent coverage when the composition map is smooth enough. Second, the core operations of transformers can be read as retrieval or regression: matrix multiplication scores similarities, attention is Nadaraya-Watson kernel regression, and MLP blocks are key-value memories. Third, integer-only alien-symbol experiments test the definition in a controlled setting: a proof sanity check calibrates the geometry, a routing baseline marks what does not count as composition, and two single-label tasks show heldout composition when constituent support is dense. These experiments are supportive, not decisive. The conclusion is deliberately middle-ground: these systems are not classical symbolic reasoners, but they are not merely stochastic parrots either. They may be geometry-shaped retrieval engines whose reach can, in principle, be measured.

The puzzle

The question motivating this essay is simple: what exactly are LLMs doing?

One plausible answer is that they are memory systems. They compress a huge training corpus into weights and activations, then answer new prompts by retrieving, interpolating, and recombining patterns from that compressed store. Much of the transformer supports this interpretation: dot products score similarity, attention retrieves values from context, and feed-forward layers behave like key-value memories.

The compression view is not just a metaphor. Shannon’s source-coding theory ties probability, prediction, and optimal code length together (Shannon, 1948). The minimum-description-length tradition treats learning regularities in data as finding shorter descriptions of that data (Rissanen, 1978). More recently, Delétang et al. (2023) argue directly that language modeling is compression: a good next-token predictor can be turned into a lossless compressor, and strong language models are therefore strong general-purpose predictors/compressors. Cheng et al. (2023) add the geometric side of the same story, connecting information-theoretic compression in language models to low-dimensional structure in representation space. So the starting point of this essay is not to deny the compression view. It is to ask why compression can be productive.

But if that is all we say, we inherit a harder question. If LLMs are memory devices and compressors of the training data, why do they appear to generalize? Why can they respond to new compositions, solve variants of tasks, follow unfamiliar instructions, or produce useful behavior in regions that were not stored as exact examples?

There are two bad answers. One is to say the models must therefore be classical reasoners, manipulating explicit symbols and rules behind the scenes. The other is to say there is no real generalization at all, only stochastic parroting over memorized text (Bender et al., 2021). The first answer overstates the mechanism. The second understates the consequence of learning a rich high-dimensional geometry.

The proposal here is the middle answer: generalization can arise when retrieval reaches a coherent region that was not itself stored as a point. The model is still a retrieval engine, but retrieval over a learned geometry can do more than copy. It can interpolate into new but reachable regions.

That is the phenomenon this essay tries to make testable.

What emergence means here

In this essay, emergence is not used to mean magic, consciousness, irreducibility, or a sudden jump in a benchmark curve. It is used in a narrower, task-relative sense.

A behavior is emergent when it appears at the level of a composite system even though it is not present as an explicitly stored joint item in the parts. For LLMs, the relevant “parts” are not clean symbolic atoms. They are distributed regions, directions, memories, and activations shaped by many contexts. A new behavior is therefore emergent, in the sense used here, when a query reaches a coherent target region that was not stored as a single point, but that becomes reachable because the constituent regions are sufficiently dense, separable, and contextually aligned.

This notion has three important consequences.

  1. Emergence is relative to a decomposition. We must say what the constituents are, what the context is, and what target behavior counts as success.
  2. Emergence is not the same as scale. Scale may supply enough memory and representational dimension for the behavior to become reachable, but the scale threshold is evidence, not the definition.
  3. Emergence should be testable. If the claimed behavior is genuinely emergent in this sense, then exact joint examples should be absent, constituent support should matter, correct context should help, and wrong-context or random-label controls should fail.

So the blog’s working notion is deliberately modest: emergence is heldout but reachable coherence. The rest of the essay tries to make that phrase precise.

The Proposal

The central proposal is:

Some emergence is recursive, context-gated retrieval reaching a coherent region that was not explicitly stored as a point.

This is a candidate theory of emergence first and a theory of approximate retrieval second. Approximate retrieval is the proposed mechanism. Emergence is the phenomenon to be explained: coherent novelty produced by retrieval over a learned representation space.

The distinction matters because “retrieval” can sound deflationary. If a model retrieves, perhaps nothing new is happening. The claim here is different. Memorization is retrieval of a stored point. Retrieval-emergence is retrieval into a region that was not stored as a point, but that becomes reachable because the learned geometry contains enough support for the parts.

The running example is deliberately small: red triangle. A model may treat it literally, as a triangle with red bound as an attribute, or symbolically, as a sign-like object in the neighborhood of road signs and warnings. In a controlled setting, the exact phrase could be held out while “red” and “triangle” remain well represented. If the model still reaches a coherent representation of the combination, the important question is not whether it has “reasoned” in a classical sense. The question is whether its geometry makes the heldout composition reachable.

Scale is a symptom, not a definition

Wei et al. (2022) define an emergent ability as one that is “not present in smaller models but is present in larger models.” That definition was useful because it made scale plots operational. But it names a behavioral pattern, not the underlying phenomenon.

Model size can be necessary. A small model may lack the capacity to store enough neighborhoods, separate the relevant directions, or run enough retrieval refinements. A larger model may cross a threshold where the same query suddenly lands in a coherent region. But the threshold is evidence that something in the geometry changed. It is not, by itself, the definition of emergence. Some reported jumps may even be artifacts of harsh, discontinuous metrics rather than discontinuities in the model, since smoother metrics can turn the same curve gradual (Schaeffer et al., 2023) — which is consistent with the present view, in which the underlying quantity (geometric reachability) changes smoothly while a thresholded score snaps.

Here is the definition I want to test instead.

Let \(h(x) \in \mathbb{R}^d\) be the representation of an input. Let

\[ S = \{(k_i, v_i)\}_{i=1}^n \]

be an implicit memory: keys and values supplied by weights, context tokens, cached representations, or some combination of these. For a context \(c\), define a retrieval operator

\[ R_c(q;S)=\sum_i w_i(q,c)v_i, \qquad w_i(q,c)=\frac{\kappa_c(q,k_i)}{\sum_j \kappa_c(q,k_j)} . \]

The kernel \(\kappa_c\) is allowed to depend on context. That dependence is important: the same phrase can retrieve from different neighborhoods in a geometry of shapes, road signs, mathematics, or metaphors.

For the rest of the definition, \(a\) and \(b\) denote the constituent concepts being composed, and \(c\) denotes the context that gates the retrieval neighborhood. In the running example, \(a\) could be red, \(b\) could be triangle, and \(c\) could be a visual-attribute context or a road-sign context. The notation \(T_{a,b,c} \subset \mathbb{R}^d\) denotes the acceptable target region for that composite under that context: not one exact vector, but the region of representations that decode to the intended behavior.

A capability for a composite query \(q=(a,b,c)\) is retrieval-emergent when four conditions hold:

  1. Joint holdout. The target composite is not stored as a point. There is no memory item within radius \(\epsilon\) of the target region \(T_{a,b,c}\).

  2. Constituent support. The constituents \(a\) and \(b\) are well represented. Their neighborhoods are dense, stable, and separable enough to retrieve from.

  3. Reachability. Recursive retrieval under context \(c\) lands inside, or close to, the acceptable target region:

    \[ \operatorname{dist}(R_c^{(m)}(q;S), T_{a,b,c}) \le \delta . \]

  4. Coherence. The decoded behavior is usable and stable under controls. Correct constituent and context interventions move it predictably; wrong-constituent and random controls do not.

This definition does not mention parameter count. Parameter count enters indirectly: it can reduce retrieval error, increase representational dimension, improve separation, and allow more retrieval steps. Those are mechanisms by which an emergent capability can become possible.

A small analytical result

The definition above is not merely metaphorical. In a simplified setting, we can show that emergence by retrieval is possible.

Let \(A\) and \(B\) be compact metric spaces of constituents. Let \(C\) be a context space. Suppose the desired composite representation is generated by a Lipschitz map

\[ \phi: A \times B \times C \to \mathbb{R}^d . \]

For a fixed context \(c\), assume

\[ \|\phi(a,b,c)-\phi(a',b',c)\| \le L_A d_A(a,a') + L_B d_B(b,b') . \]

Here \(d_A\) and \(d_B\) are the metrics on the constituent spaces, and \(L_A\) and \(L_B\) are the corresponding Lipschitz constants. They measure how sensitive the composite representation is to perturbing the first constituent versus the second while holding the context fixed.

Assume the model’s memory contains \(\epsilon_A\)- and \(\epsilon_B\)-nets for the constituent spaces. That is, for every \(a \in A\) there is a stored \(\hat a\) with \(d_A(a,\hat a) \le \epsilon_A\), and for every \(b \in B\) there is a stored \(\hat b\) with \(d_B(b,\hat b) \le \epsilon_B\). Also assume that the memory contains no stored examples of the joint composite \(\phi(a,b,c)\).

If retrieval recovers the nearest constituent representatives \(\hat a\) and \(\hat b\), then

\[ \|\phi(\hat a,\hat b,c)-\phi(a,b,c)\| \le L_A d_A(a,\hat a) + L_B d_B(b,\hat b) \le L_A\epsilon_A + L_B\epsilon_B . \]

Therefore, if the task accepts any representation within margin

\[ \delta > L_A\epsilon_A + L_B\epsilon_B , \]

then the heldout composite succeeds even though the composite itself was never stored.

This establishes a narrow but useful possibility: a retrieval engine can produce a coherent heldout composition when constituent coverage is dense enough and the composition map is smooth enough. The role of scale is now explicit. More memory and more capacity can shrink \(\epsilon_A\) and \(\epsilon_B\) and can create enough dimension to keep the relevant neighborhoods separated. A scale threshold occurs when the retrieval error crosses the task margin \(\delta\).

The proof does not show that every large-model ability works this way. It does not prove that transformers learn the required \(\phi\). It shows something more basic: the proposed mechanism is analytically possible, and it gives measurable quantities to look for.

Why transformers are plausible retrieval engines

With that positioning in place, the mechanism can be stated more directly. Transformers supply three concrete retrieval mechanisms.

Matrix multiplication scores similarity

For a matrix multiplication \(y = Wx\), each coordinate is

\[ y_i = \langle w_i, x\rangle . \]

The rows of \(W\) score the query \(x\) by inner product. If we only care about the largest scores, this is maximum inner-product search, a standard retrieval problem (Shrivastava & Li, 2014). If we use the scores to form a weighted sum of values, it becomes regression over a memory.

There are two sharper readings.

First, ridge regression has a dual form. The learned weight vector can be written as

\[ w = \sum_i \alpha_i x_i , \]

so prediction is

\[ \hat y = \langle w,x\rangle = \sum_i \alpha_i \langle x_i,x\rangle . \]

That is a similarity-weighted sum over stored training inputs. This is exact for kernel and linear ridge settings (Schölkopf et al., 2001), and suggestive for learned layers.

Second, every matrix has an SVD:

\[ Wx = \sum_r \sigma_r u_r \langle v_r,x\rangle . \]

The right singular vectors behave like keys, the left singular vectors like values, and the singular values like gains. This is not a high-capacity associative memory, because the SVD basis is orthogonal and rank-limited, but it shows that even a plain linear map has a key-value reading.

Attention is kernel regression

A single attention head (Vaswani et al., 2017) computes

\[ \operatorname{attn}(q;K,V) = \sum_j \frac{\exp(\langle q,k_j\rangle/\sqrt d)} {\sum_\ell \exp(\langle q,k_\ell\rangle/\sqrt d)} v_j . \]

This is the Nadaraya-Watson kernel regression estimator with an exponential kernel (Nadaraya, 1964; Watson, 1964; Tsai et al., 2019). In a sharp, low-temperature limit, it approaches nearest-neighbor lookup. In a smooth, high-temperature limit, it averages over a neighborhood. Attention therefore lives on a dial between retrieval and interpolation.

MLP blocks are key-value memories

A transformer MLP has the form

\[ f(x)=W_{\text{out}}\sigma(W_{\text{in}}x). \]

The rows of \(W_{\text{in}}\) act as keys. The activation vector gives match strengths. The columns of \(W_{\text{out}}\) supply values. Geva et al. (2021) analyze feed-forward layers in exactly this key-value-memory language, and model-editing work such as ROME locates factual associations in related middle-layer MLP computations (Meng et al., 2022).

These readings are not isolated curiosities; several independent derivations converge on the same conclusion. Prof. Richard Baraniuk and collaborators show that deep networks with piecewise-linear nonlinearities are max-affine spline operators that partition the input space and act affinely on each region (Balestriero & Baraniuk, 2018), and separately derive self-attention as the support-vector expansion of a support-vector regression problem (Nguyen et al., 2024). Prof. Yi Ma and collaborators derive attention- and projection-like operators from a principle of compression toward parsimonious, low-dimensional structure (Chan et al., 2022; Yu et al., 2023). That spline, regression, and compression arguments all land on the same retrieval-and-regression reading is part of why I treat it as more than analogy.

Together these mechanisms justify the phrase approximate retrieval engine. But a testable theory of emergence needs one further step: composition across depth. Each layer retrieves, transforms, and passes a new query upward. The strong conjecture is that depth performs recursive retrieval over intermediate representations. Emergence occurs when that recursion lands in a coherent region that no single stored point occupied.

A small intuition: the red triangle

“Red” is not a single coordinate. It is a region shaped by apples, roses, carpets, warning lights, red tape, red lines, and many other contexts. “Triangle” is another region, shaped by geometry, diagrams, signs, pyramids, and mathematical language.

The phrase “red triangle” can be literal or symbolic:

  1. In a shapes-and-colors context, it should retrieve a visual-attribute composition: triangle plus red.
  2. In a roads-and-signs context, it should retrieve a conventional sign region: warning, yield, hazard, give way.

In a pretrained model, the phrase was almost certainly seen. So this example is not evidence of holdout by itself. It is a small, inspectable intuition for what the theory means: a query can land in different coherent regions depending on constituent support and context.

The controlled experiments below replace human-language words with alien integer symbols so that the holdout condition is real. The same logic suggests four controls:

  1. heldout joint composites should be absent from training,
  2. constituents should be well supported in other contexts and combinations,
  3. correct context should improve reachability,
  4. wrong-context and random-label controls should fail.

What the experiments show

The experiments below are organized as an evidence ladder:

  1. a deterministic proof sanity check,
  2. a learned context-dependent routing baseline,
  3. a fixed-functional composition experiment,
  4. a constituent-modulated composition experiment.

The distinction matters. The first check only calibrates the proof. The routing baseline removes language confounds but does not establish composition. The last two experiments use a single output label that depends jointly on both constituents, so they are the relevant evidence for the emergence claim.

The abstraction is intentional. If the question is whether a model can produce a new composite behavior without storing that composite as a training example, then natural language is a dangerous first test bed: familiar words, pretrained tokenizers, and hidden corpus facts all make it hard to know what was actually new. The experiments therefore strip the problem down to its ingredients. There are constituents, a context, a heldout joint input, and a target behavior. Composition means that the target behavior depends on the constituents separately and on the context that binds them.

This makes the experimental objective narrower but cleaner. The experiments do not try to show that a toy model is an LLM. They ask whether the proposed definition has operational content: can we hold out a joint behavior, keep its parts well supported, and then recover the correct behavior only when the context and constituent structure are coherent?

1. Proof sanity check

This check builds two continuous constituent spaces, \(A = B = [0,1]\), and a binary context \(c \in \{0,1\}\). It should be read as a sanity check for the analytical proof, not as evidence that anything has been learned. The composite representation is the deterministic, context-gated map \(\phi:[0,1]^2\times\{0,1\}\to\mathbb{R}^4\),

\[ \phi(a,b,c) = \Big(\,a,\;\; b,\;\; \mathbb{1}[c=0]\,(a+b),\;\; \mathbb{1}[c=1]\,(a\,b)\,\Big), \]

so context \(0\) composes the constituents additively and context \(1\) multiplicatively. Both branches are Lipschitz on \([0,1]^2\) with constants \(L_A = L_B = \sqrt{2}\). The experiment does not learn addition or multiplication, and it does not establish emergence in a model; it only checks the geometry the proof assumes.

The memory is an \(\varepsilon\)-net over the constituents: a uniform grid \(G_m = \{0,\tfrac{1}{m-1},\dots,1\}\) of \(m\) points per space, with covering radius \(\varepsilon = \tfrac{1}{2(m-1)}\). Crucially it stores constituent representatives only – no joint composite \(\phi(a,b,c)\) is ever stored. Retrieval snaps a query to its nearest grid points and applies the known map, and we score the reconstruction against the analytically generated truth:

\[ \hat a = \arg\min_{g\in G_m}|a-g|,\qquad \hat b = \arg\min_{g\in G_m}|b-g|,\qquad e = \big\lVert\,\phi(\hat a,\hat b,c)-\phi(a,b,c)\,\big\rVert . \]

The bound the proof predicts is then

\[ e \;\le\; L_A\,\varepsilon + L_B\,\varepsilon \;=\; \frac{\sqrt{2}}{m-1}, \]

and a query counts as solved when \(e \le \delta\) for the task margin \(\delta = 0.12\). Two controls mirror the later experiments: a wrong-context reconstruction \(\phi(\hat a,\hat b,\,1-c)\) and random-constituent draws, both of which should fail.

The result is intentionally unsurprising. With a coarse grid, the nearest stored representatives \(\hat a\) and \(\hat b\) can be far from the true constituents \(a\) and \(b\), so even exact addition or multiplication can produce \(\phi(\hat a,\hat b,c)\) outside the task margin around \(\phi(a,b,c)\). As the grid becomes finer, the covering radius shrinks, the approximation error falls, and the deterministic map produces outputs closer to the true composite. That is just the Lipschitz-style argument in numerical form.

This check remains useful only as calibration. It verifies that the experiments implement the definitions consistently: constituent support density is varied, reachability is measured as distance to the target representation, and wrong-context/random-constituent controls fail. It should not be treated as a substantive finding. The substantive question starts when the composition rule is not supplied at evaluation time, but must be learned from data.

Proof calibration - task margin delta = 0.12

eps bound mean err p95 err success wrong ctx random guaranteed
support
3 0.250 0.707 0.246 0.432 0.115 0.007 0.013 False
4 0.167 0.471 0.160 0.288 0.289 0.007 0.018 False
6 0.100 0.283 0.098 0.174 0.733 0.007 0.018 False
8 0.071 0.202 0.070 0.124 0.940 0.005 0.025 False
12 0.045 0.129 0.044 0.077 1.000 0.009 0.031 False
16 0.033 0.094 0.032 0.058 1.000 0.007 0.028 True
24 0.022 0.061 0.021 0.037 1.000 0.007 0.022 True
32 0.016 0.046 0.016 0.028 1.000 0.008 0.025 True
48 0.011 0.030 0.010 0.018 1.000 0.008 0.032 True

2. Learned integer alien-symbol routing baseline

The second experiment removes the remaining language confound. The model sees only integer token IDs. The input format is [context, A-token, B-token], where the two constituent families and two contexts are all synthetic. A deterministic set of joint (A-token, B-token) pairs is held out from training for both contexts. In the dense condition, every heldout constituent still appears in many other non-heldout combinations: the audit reports 438 dense training examples, 74 heldout examples, full constituent support coverage, and 13-14 examples per constituent/context cell.

Concretely, there are \(N_A = N_B = 16\) alien constituents and two contexts \(c\in\{0,1\}\). Each input is a triple of integer token IDs

\[ x = \big(c,\;\; \tau_A(a),\;\; \tau_B(b)\big),\qquad \tau_A(a)=2+a,\;\; \tau_B(b)=18+b,\;\; a,b\in\{0,\dots,15\}. \]

A joint pair is held out, in both contexts, by a deterministic predicate

\[ H(a,b) = \big[\,(3a+5b)\bmod 7 = 0\,\big], \]

which removes the exact triples while leaving every constituent supported elsewhere. The target has two output slots, and context gates each one independently:

\[ y_1(a,c)=\begin{cases} a, & c=0,\\[2pt] 16+\pi_A(a), & c=1,\end{cases} \qquad y_2(b,c)=\begin{cases} b, & c=0,\\[2pt] 16+\pi_B(b), & c=1,\end{cases} \]

with \(\pi_A,\pi_B\) fixed random permutations of \(\{0,\dots,15\}\).

This is the crucial caveat. Because \(y_1\) depends only on \((a,c)\) and \(y_2\) only on \((b,c)\), the target factorizes: context \(0\) is the identity on each constituent index, and context \(1\) is an independent permutation of each. That is context-gated routing, not composition.

That means the current experiment does not establish nontrivial composition between \(A\) and \(B\). If there were only a single identity context, the heldout pair would not matter much: the model could succeed by learning to copy or classify each constituent independently. With two contexts, the task is stronger because the interpretation is context-gated, but it is still separable. Success shows that a model can learn two context-dependent constituent lookups and apply them simultaneously to a jointly heldout input. It does not show that the model has learned an interaction between the constituents.

The baseline is still useful, but only in a limited way. It verifies that the no-language routing setup works: the symbols are made up, the model starts from random weights, the exact joint triples are absent from training, sparse support fails, wrong-context evaluation fails, and random labels fail. This is a necessary control before a stronger experiment, not the stronger experiment itself.

The model is tiny and trained from scratch. It has an embedding table and two context-gated constituent paths: one path reads (context, A-token) and predicts the first output slot; the other reads (context, B-token) and predicts the second output slot. This architecture is intentionally aligned with the routing baseline. That alignment is useful for checking the support story, but it also explains why this is not yet a strong composition test.

The main results are:

Vocabulary = 34, classes = 32. Dense train = 438, heldout = 74, coverage = 1.000, mean support = 13.69. A/B support per context ranges 13-14 / 13-14. Holdout audit passed: no heldout joint triple appears in training, and every constituent is supported.

Routing baseline - heldout performance by regime

train coverage mean sup seen heldout margin wrong ctx
regime
untrained 0 0.000 0.000 0.000 -0.329
sparse support 32 0.688 1.455 1.000 0.446 -8.934
dense support 438 1.000 13.688 1.000 1.000 11.826 0.000
random labels 438 1.000 13.688 0.032 0.000 -6.000

This is the pattern the routing baseline predicts. The untrained model has no heldout competence. Sparse support can fit its small training set but has negative heldout margin and only partial heldout accuracy. Dense support reaches perfect heldout performance, although the exact joint triples were never seen. Wrong-context evaluation collapses, showing that context gates the lookup. Random labels collapse despite identical input density, showing that density alone is not enough; there must be compressible constituent structure.

The sparse condition matters because it separates training fit from support. The model can memorize its small training set, but many constituent/context cells are weakly supported, so heldout performance is unreliable. The dense condition supplies the missing constituent support without adding the heldout joint triples. The random-label condition then checks that support alone is not magic: when the labels do not share constituent structure, dense exposure does not produce heldout performance.

3. Fixed-functional composition

The first nontrivial composition experiment removes the factorization. There are \(N_A = N_B = 64\) constituents over the field \(\mathbb{F}_{17}\). Each alien \(A\) token carries a hidden value \(u_A(a) \in \mathbb{F}_{17}\) and each alien \(B\) token a hidden value \(v_B(b) \in \mathbb{F}_{17}\); the model never sees these values, only the permuted integer token IDs \(\tau_A(a),\tau_B(b)\) – a fixed random relabelling that carries no arithmetic. The single output label is the interaction passed through a fixed label permutation \(\sigma\) of \(\mathbb{F}_{17}\),

\[ y(a,b,c) = \sigma\big(F_c(u_A(a),\,v_B(b))\big), \]

with the context selecting the interaction \(F_c\):

\[ F_0(u,v)=uv \pmod {17}, \]

\[ F_1(u,v)=u(2v+3)+5 \pmod {17}. \]

The joint holdout is again deterministic, now also gated by context,

\[ H(a,b,c) = \big[\,(7a+11b+3c)\bmod 10 = 0\,\big], \]

so the tested triples never appear in training. Now the target cannot be decomposed into one prediction for \(A\) and one prediction for \(B\). \(A\) alone is insufficient, \(B\) alone is insufficient, and the heldout pair is absent from training. Dense support means that each constituent appears in many other joint combinations, not that the tested composite was seen.

The main results are:

Output classes = 17, chance = 0.059. Dense train = 7372, heldout = 820, coverage = 1.000, mean support = 57.59. One-sided controls: A-only = 0.063, B-only = 0.061.

Heldout performance by regime

train coverage mean sup seen heldout margin wrong ctx
regime
untrained 0 0.000 0.000 0.057 -0.120
sparse density 368 0.973 2.956 1.000 0.126 -13.111
dense support 7372 1.000 57.594 1.000 1.000 12.235 0.063
random labels 7372 1.000 57.594 1.000 0.061 -12.923

Support sweep

train coverage mean sup heldout margin
frac
0.050 368 0.973 2.956 0.126 -13.111
0.100 737 1.000 5.758 0.133 -12.054
0.200 1474 1.000 11.516 0.143 -10.056
0.400 2948 1.000 23.031 0.323 -3.356
0.700 5160 1.000 40.312 0.896 7.061
1.000 7372 1.000 57.594 1.000 12.235

One-sided controls fail: the \(A\)-only majority baseline reaches 0.063 heldout accuracy and the \(B\)-only baseline reaches 0.061, both near the chance rate of \(1/17 \approx 0.059\). The support sweep shows the threshold pattern:

This is the cleanest version of the weak claim: a global interaction law can be learned from dense constituent support and applied to heldout joint inputs.

4. Constituent-modulated composition

The stronger experiment keeps the same protocol – permuted token IDs \(\tau_A,\tau_B\), a permuted output label \(y=\sigma\big(F_c(\cdot)\big)\), and the same context-gated joint holdout \(H(a,b,c)=\big[(7a+11b+3c)\bmod 10=0\big]\) – but makes the functional itself depend on a constituent. Each \(A_i\) is operator-like, carrying hidden parameters \((\alpha_i,\beta_i,\gamma_i)\in\mathbb{F}_{17}^3\); each \(B_j\) carries a hidden value \(x_j \in \mathbb{F}_{17}\). Context selects the family of interpretation:

\[ F_0(A_i,B_j)=\alpha_i x_j+\beta_i \pmod {17}, \]

\[ F_1(A_i,B_j)=\alpha_i x_j^2+\beta_i x_j+\gamma_i \pmod {17}. \]

The model receives only integer token IDs and predicts a single permuted output label. This is closer to the intended theory: a constituent is not merely an argument to a fixed function; it helps parameterize the functional applied to the other constituent.

The results are:

Output classes = 17, chance = 0.059. Dense train = 7372, heldout = 820, coverage = 1.000, mean support = 57.59. One-sided controls: A-only = 0.005, B-only = 0.070.

Heldout performance by regime

train coverage mean sup seen heldout margin wrong ctx
regime
untrained 0 0.000 0.000 0.038 -0.126
sparse density 368 0.973 2.956 1.000 0.070 -13.478
dense support 7372 1.000 57.594 1.000 0.923 5.092 0.054
random labels 7372 1.000 57.594 1.000 0.062 -13.062

Support sweep

train coverage mean sup heldout margin
frac
0.050 368 0.973 2.956 0.070 -13.478
0.100 737 1.000 5.758 0.076 -13.430
0.200 1474 1.000 11.516 0.071 -12.742
0.400 2948 1.000 23.031 0.159 -8.693
0.700 5160 1.000 40.312 0.428 -2.185
1.000 7372 1.000 57.594 0.923 5.092

The one-sided controls again fail: \(A\)-only reaches 0.005 and \(B\)-only reaches 0.070. The support sweep is more demanding than in the fixed-functional case:

This is the strongest result of the four experiments. The exact joint triples are absent, the target is non-factorized, one-sided predictors fail, wrong context fails, random labels fail, and the heldout margin becomes positive only when constituent support is dense. That is the definition’s empirical signature in a small controlled setting.

But this should still be read conservatively. These experiments do not falsify the theory; they show that the definition can be satisfied in small, controlled systems without relying on language priors or memorized joint examples. That is useful, but it is not irrefutable evidence for the theory in general. The final task is also deliberately simple: the constituent-dependent functional is low-degree and its parameters enter in a structured way. It is closer to an affine or polynomial operator family than to the open-ended functional modulation seen in natural language. So the correct claim is modest: the experiments find no contradiction yet, and they establish a clean foothold for stronger tests.

The next evidence should come from harder ablations: less factorized architectures, harder operator families, more random seeds, adversarial holdouts, explicit capacity sweeps, and internal reachability measurements. A convincing case would show the same threshold pattern when the model has fewer architectural hints and the functional family is richer.

What the definition asks us to test

The related work establishes that compression, memory, retrieval, and geometry are all legitimate lenses on transformers. The proposed definition adds a test: look for heldout but reachable coherence.

  1. Is the joint behavior absent as a stored point?
  2. Are the constituents densely represented?
  3. Does recursive retrieval land near the right target region?
  4. Do causal interventions move the behavior in the predicted directions?
  5. Does the geometric signal improve smoothly before the thresholded behavior appears?
  6. Does the result survive a non-factorized target, where the answer cannot be decomposed into independent predictions for \(A\) and \(B\)?

This framing also sits between two unhelpful extremes.

It is weaker than saying LLMs are classical reasoning systems. No symbolic planner is being posited. The mechanism is retrieval, interpolation, and recursive refinement over learned representations.

It is stronger than saying LLMs are merely stochastic parrots. A model that can retrieve into a jointly-heldout but coherent region is doing more than copying a stored phrase. It is using a geometry built from many stored contexts to reach a new, usable combination.

Falsifiable Predictions

A testable theory needs predictions that can fail. This proposal makes the following ones.

  1. Constituent density. Performance on a novel composite should track the density and separation of its constituents, not merely exposure to the composite itself.
  2. Reachability. A geometric distance-to-target measure should predict success before a discrete task metric flips from failure to success.
  3. Context gating. Context and activation steering should move the query toward different neighborhoods and produce different readings.
  4. Coherence controls. Wrong-context, random-label, and one-sided controls should fail even when the dense-support condition succeeds.
  5. Scale as error reduction. Increasing capacity or support density should shrink retrieval error. Behavioral emergence should appear when that error crosses the task margin.

These predictions can fail. If novel-composite performance is independent of constituent density, the definition is wrong or incomplete. If reachability margin does not track heldout success, the mechanism is weak. If wrong-context and random-label controls succeed as well as the dense-support condition, the theory is not isolating emergence.

Stronger Tests

The current alien-symbol experiments are deliberately small and structured. That is useful for isolating the definition, but it is not yet a claim about arbitrary transformers. The next stronger experiment is architectural: replace the compact MLP with a tiny transformer, keep the same integer-only non-factorized tasks, sweep model capacity and constituent density, and inspect whether an internal reachability margin predicts heldout success.

A later bridge back to LLMs should keep the same discipline: invented tokens, explicit joint holdouts, constituent-density sweeps, wrong-context controls, random-label controls, one-sided controls, and a pre-registered reachability metric. Without those controls, it is too easy to confuse memorized corpus facts with emergence.

Is this just generalization?

Perhaps yes. But then the interesting question becomes: what kind of generalization?

In traditional machine learning, generalization already means success on inputs not seen exactly during training. In that sense, the emergence described here is not a separate miracle. It is a structured form of generalization: interpolation over a learned geometry, constrained by constituent support and context.

The reason it feels like emergence is that this particular form of generalization cannot happen immediately. The interpolated region may exist only weakly at first. The constituents may not be dense enough, the relevant neighborhoods may not be separable enough, the context may not gate retrieval sharply enough, or the recursive computation may not have enough depth to land inside the coherent target region. As scale, data, capacity, or depth increase, the underlying geometric quantity may improve gradually. But the behavior becomes visible only when the retrieval error crosses the task margin.

That is why it can look like an “emergent capability.” The model may be changing smoothly underneath, while the measured behavior snaps from failure to success because the metric only notices whether the output is coherent enough.

So perhaps emergence and generalization are not two different things. Perhaps emergence is what generalization looks like when the unseen input is a meaningful composite, the parts are richly represented, and the metric only notices success after the model crosses a coherence threshold.

The open question is then not whether LLMs generalize or retrieve. They do both. The sharper question is: when does interpolation over compressed memory become a new capability?

Reproducibility

The continuous retrieval check uses numpy; the learned alien-symbol experiments use torch. No pretrained model, tokenizer, or external corpus is used. The code cells (hidden in the rendered view) run end-to-end on CPU in roughly a minute with fixed random seeds; re-running regenerates the tables above.

References

  • Balestriero, R., & Baraniuk, R. G. (2018). A Spline Theory of Deep Networks. ICML 2018, PMLR 80, 374-383. https://proceedings.mlr.press/v80/balestriero18b.html
  • Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? FAccT 2021, 610-623. https://doi.org/10.1145/3442188.3445922
  • Chan, K. H. R., Yu, Y., You, C., Qi, H., Wright, J., & Ma, Y. (2022). ReduNet: A White-Box Deep Network from the Principle of Maximizing Rate Reduction. Journal of Machine Learning Research, 23(114), 1-103. https://jmlr.org/papers/v23/21-0631.html
  • Cheng, E., Kervadec, C., & Baroni, M. (2023). Bridging Information-Theoretic and Geometric Compression in Language Models. arXiv:2310.13620. https://arxiv.org/abs/2310.13620
  • Choraria, M., Gerogiannis, A., Jayaraman, V., Mani, A., & Varshney, L. R. (2026). Context-Gated Associative Retrieval: From Theory to Transformers. arXiv:2605.10970. https://arxiv.org/abs/2605.10970
  • Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models. arXiv:2309.08600. https://arxiv.org/abs/2309.08600
  • Delétang, G., Ruoss, A., Duquenne, P.-A., Catt, E., Genewein, T., Mattern, C., Grau-Moya, J., Wenliang, L. K., Aitchison, M., Orseau, L., Hutter, M., & Veness, J. (2023). Language Modeling Is Compression. arXiv:2309.10668. https://arxiv.org/abs/2309.10668
  • Elhage, N., Hume, T., Olsson, C., Schiebinger, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., & Olah, C. (2022). Toy Models of Superposition. Transformer Circuits Thread. https://transformer-circuits.pub/2022/toy_model/index.html
  • Fodor, J. A., & Pylyshyn, Z. W. (1988). Connectionism and Cognitive Architecture: A Critical Analysis. Cognition, 28(1-2), 3-71. https://doi.org/10.1016/0010-0277(88)90031-5
  • Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). Transformer Feed-Forward Layers Are Key-Value Memories. EMNLP 2021, 5484-5495. https://aclanthology.org/2021.emnlp-main.446/
  • Hopfield, J. J. (1982). Neural Networks and Physical Systems with Emergent Collective Computational Abilities. Proceedings of the National Academy of Sciences, 79(8), 2554-2558. https://doi.org/10.1073/pnas.79.8.2554
  • Kambhampati, S. (2024a). Can Large Language Models Reason and Plan? Annals of the New York Academy of Sciences, 1534(1), 15-18. https://doi.org/10.1111/nyas.15125
  • Kambhampati, S., Valmeekam, K., Guan, L., Verma, M., Stechly, K., Bhambri, S., Saldyt, L., & Murthy, A. (2024b). Position: LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks. ICML 2024, PMLR 235, 22895-22907. https://proceedings.mlr.press/v235/kambhampati24a.html
  • Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., & Lewis, M. (2020). Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020. https://arxiv.org/abs/1911.00172
  • Krotov, D., & Hopfield, J. J. (2016). Dense Associative Memory for Pattern Recognition. NIPS 2016. https://arxiv.org/abs/1606.01164
  • Lake, B. M., & Baroni, M. (2018). Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks. ICML 2018. https://arxiv.org/abs/1711.00350
  • Marks, S., & Tegmark, M. (2024). The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets. COLM 2024. https://arxiv.org/abs/2310.06824
  • Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. NeurIPS 2022. https://arxiv.org/abs/2202.05262
  • Nadaraya, E. A. (1964). On Estimating Regression. Theory of Probability & Its Applications, 9(1), 141-142. https://doi.org/10.1137/1109020
  • Nguyen, T. M., Nguyen, T., Ho, N., Bertozzi, A. L., Baraniuk, R. G., & Osher, S. J. (2024). A Primal-Dual Framework for Transformers and Neural Networks. arXiv:2406.13781. https://arxiv.org/abs/2406.13781
  • Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., et al. (2022). In-Context Learning and Induction Heads. arXiv:2209.11895. https://arxiv.org/abs/2209.11895
  • Park, K., Choe, Y. J., & Veitch, V. (2024). The Linear Representation Hypothesis and the Geometry of Large Language Models. ICML 2024. https://arxiv.org/abs/2311.03658
  • Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Adler, T., et al., & Hochreiter, S. (2021). Hopfield Networks is All You Need. ICLR 2021. https://arxiv.org/abs/2008.02217
  • Rissanen, J. (1978). Modeling by Shortest Data Description. Automatica, 14(5), 465-471.
  • Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage? NeurIPS 2023. https://arxiv.org/abs/2304.15004
  • Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). A Generalized Representer Theorem. COLT/EuroCOLT 2001, LNCS 2111, 416-426. https://doi.org/10.1007/3-540-44581-1_27
  • Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379-423; 27(4), 623-656. https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf
  • Shrivastava, A., & Li, P. (2014). Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). NIPS 2014. https://papers.neurips.cc/paper_files/paper/2014/hash/c98e7c4b8f20d384e3ad857d0ee226cc-Abstract.html
  • Tsai, Y.-H. H., Bai, S., Yamada, M., Morency, L.-P., & Salakhutdinov, R. (2019). Transformer Dissection: An Unified Understanding for Transformer’s Attention via the Lens of Kernel. EMNLP-IJCNLP 2019, 4344-4353. https://aclanthology.org/D19-1443/
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. NIPS 2017. https://arxiv.org/abs/1706.03762
  • Watson, G. S. (1964). Smooth Regression Analysis. Sankhya: The Indian Journal of Statistics, Series A, 26(4), 359-372. https://www.jstor.org/stable/25049340
  • Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., et al., & Fedus, W. (2022). Emergent Abilities of Large Language Models. Transactions on Machine Learning Research. https://arxiv.org/abs/2206.07682
  • Yu, Y., Buchanan, S., Pai, D., Chu, T., Wu, Z., Tong, S., Haeffele, B., & Ma, Y. (2023). White-Box Transformers via Sparse Rate Reduction. NeurIPS 2023. https://arxiv.org/abs/2306.01129
  • Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., et al. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405. https://arxiv.org/abs/2310.01405