World Models May Unlock Genuine Scientific Discovery Where Language Models Cannot
Every major scientific breakthrough shares a hidden mechanism: someone recognized that the formal structure of one field mapped precisely onto an unsolved problem in another. World models, which learn abstract representations rather than surface patterns, may be the first architecture capable of doing this at combinatorial scale — but only if we build them to verify structural truth, not just generate beautiful correspondences.
Beyond the Next Token
Ask a large language model to find connections between thermodynamics and information theory. It will produce a fluent, confident paragraph about entropy appearing in both fields, perhaps noting that disorder in physical systems resembles uncertainty in communication channels. The paragraph will sound like a genuine insight. It is not. The model has matched tokens (words that appear in similar contexts across its training corpus) without checking whether the mathematical structures underneath are actually preserved. It has no way to verify that the second law of thermodynamics maps onto a channel capacity theorem, or under what boundary conditions the analogy breaks. It has produced what Hermann Hesse, writing in 1943, would have recognized immediately: a move in the Glass Bead Game.
Hesse’s final novel describes Castalia, a future province devoted to an intellectual practice that synthesizes music, mathematics, and philosophy into a single symbolic language. The game’s players compose elaborate correspondences between Bach fugues and algebraic proofs. The audience applauds. Nothing is discovered. Castalia’s achievement is aesthetic synthesis: pattern-matching across domains that produces elegance without generating testable claims about the world. The novel’s protagonist eventually abandons the game, recognizing that beautiful correspondences are sterile when divorced from empirical reality.1
Large language models are Glass Bead Machines. They find cross-domain correspondences faster than any human researcher and across more literatures than any team could read in a lifetime, but they operate on token-level representations that encode surface features (distributional similarity, lexical co-occurrence) rather than the relational structures that would make those correspondences scientifically productive. This is an architectural limitation, not an intelligence limitation. The representation layer is wrong for the task.
Proof techniques transfer; metaphors do not
The task itself, however, is real and valuable. Look at where the biggest scientific breakthroughs actually came from. Claude Shannon built information theory by recognizing that Boltzmann’s entropy formalism, developed for statistical mechanics, mapped onto the problem of quantifying information in a communication channel. The mapping was not metaphorical. Shannon’s entropy formula is Boltzmann’s entropy formula with different variables. The proof techniques transferred: results about thermodynamic efficiency yielded results about channel capacity.2
Black and Scholes derived their options pricing formula by recognizing that the problem of valuing a financial derivative was governed by the same partial differential equation as heat diffusion — with one critical difference. In the heat equation, information propagates forward from a known initial condition. In the financial problem, information propagates backward from a known terminal payoff. Same PDE, reversed boundary conditions, transferred solution technique. The Gaussian kernel that solves heat diffusion became the cumulative normal distribution in the Black-Scholes formula. The Chicago Board Options Exchange opened the month the paper was published, and traders were using the equation within weeks.3
These were not lucky analogies. They were precise structural transfers: the relational architecture of one domain (entities, transformations, composition rules, constraints) mapped onto another domain with sufficient fidelity that proof techniques and solution methods carried across. For every Black-Scholes, though, there are a hundred structural analogies that looked promising and went nowhere. Econophysics, the wholesale import of physics models into economics, has generated far more false correspondences than productive ones. Long-Term Capital Management, whose founders included Scholes himself, collapsed in 1998 partly because the structural mapping between physical and financial systems breaks down at the tails of the distribution, where volatility is not constant and markets are not continuous. The analogy was structural enough to work most of the time and wrong enough to be catastrophic when it mattered most. The failure was not in finding the bridge but in not checking where it breaks. Dedre Gentner’s structure-mapping research, running since 1983, has demonstrated experimentally that productive analogies preserve systems of relations (A causes B, B inhibits C) rather than object attributes (both things are round, both things are hot). Her work shows that people can be trained to attend to relational structure over surface similarity, and that doing so measurably improves the quality of analogical transfer.4
The question for AI is whether this mechanism — structural transfer at the level of formal relations — can be systematized and scaled. Not whether AI can find analogies. It already finds them by the thousand, and most are worthless. The question is whether AI can find analogies that are structurally real: mappings where the compositional relationships are preserved, where a transformation in domain A corresponds to a transformation in domain B, and where the correspondence generates predictions that survive empirical testing.
Tokens encode co-occurrence, not composition
Current LLMs cannot do this natively because they lack the right representation layer. Token prediction learns which words follow which other words. It captures distributional patterns that are often correlated with structural relationships but are not the same thing. An LLM trained on physics and finance papers might learn that “entropy” and “diffusion” appear in both literatures. It cannot verify that the second-order partial differential equation governing heat flow is structurally isomorphic to the equation governing option value evolution, because it does not represent equations as mathematical objects with compositional properties. It represents them as sequences of tokens.
World models offer an architectural path past this limitation. The Joint-Embedding Predictive Architecture (JEPA) family, developed by Yann LeCun’s research group and now the basis for his new venture AMI Labs (launched December 2025, seeking €500M at a €3B valuation), operates on a different principle. Rather than predicting the next token in a sequence, JEPA learns to predict relationships between abstract representations. It encodes inputs into a latent space where surface-level details are discarded and higher-order structure is preserved, then predicts what should come next in that abstract space rather than in pixel or token space.5
V-JEPA, the video variant, demonstrates why this matters for scientific synthesis. Trained entirely through self-supervised learning on unlabeled video, V-JEPA developed the capacity to detect violations of intuitive physics — it showed measurably higher prediction error when presented with physically impossible events, outperforming models that predict pixels or tokens. The model learned something about the relational structure of how objects behave, not just the surface patterns of how videos look.6 If a model can learn abstract relational representations of physical domains from raw data, it could in principle learn abstract relational representations of scientific domains from domain-specific corpora — not summaries of the domain, but its formal structure.
Five stages from skeleton to prediction
The architecture this implies for scientific synthesis is a pipeline, not a monolithic model. It has five stages.
Stage one: relational skeleton extraction. A domain-specific world model learns abstract representations from the domain’s literature, experimental data, and simulation outputs. The output is a relational schema, a graph of entities, transformations, and constraints, stripped of domain-specific vocabulary. This is the automated version of what Gentner calls “structural alignment”: representing a domain in terms of its relational architecture rather than its surface features.
Stage two: combinatorial bridge detection. Relational schemas from different domains are compared for structural isomorphisms using formal tools: graph homomorphism algorithms, representation similarity analysis, or category-theoretic functors that check whether morphisms (transformations) in one domain map onto morphisms in another. A human researcher can compare perhaps ten domains pairwise. A system operating natively in representation space can compare thousands.
Stage three: morphism verification. For each candidate bridge, the system checks whether compositional structure is preserved. If A→B→C in domain one maps to X→Y→Z in domain two, does the composite mapping A→C correspond to X→Z? Where does the correspondence break? Black-Scholes required noticing that the heat equation mapping held except for a time-reversal in the boundary conditions. A bridge that doesn’t check composition is an aesthetic match, not a structural one.
Stage four: directional prediction. For each verified bridge, the system identifies results, theorems, or techniques that exist in the source domain but have not been applied in the target domain. This is where the synthesis becomes generative rather than taxonomic. The prediction is asymmetric: one domain has a tool the other domain needs.
Stage five: empirical routing. Predictions are sent to domain-specific simulators, automated laboratories, or human experts for evaluation. A prediction scored as “novel and testable” by a target-domain expert is the system’s output: not a correspondence, not an analogy, but a falsifiable scientific claim generated through structural transfer.
Each validated cycle feeds back: successful bridges update the relational schemas, and new predictions open bridging opportunities that were not previously visible. This is where compounding occurs. The aesthetic version of synthesis does not compound; it produces more analogies at the same depth. Structural synthesis, because it builds verified bridges that constrain and extend the representation space, can accelerate.
Start with LLMs; scale with world models
Here is the honest assessment of what exists today. No one has built this full pipeline. The components are scattered across research communities that do not typically interact. Structure-mapping engines have been computational since the 1980s.7 Causal world models that build and revise internal representations through active experimentation are an active research frontier. A recent paper proposes “Scientific AI” agents that construct transferable internal models through recursive epistemic loops of hypothesis generation, causal inference, and calibration.8 JEPA architectures are producing abstract representations that capture relational structure from raw data.5 Category-theoretic approaches to knowledge representation provide formal languages for cross-domain mapping.9 But the integrated system (world-model-based skeleton extraction, structural bridge detection, morphism verification, prediction generation, empirical test) does not exist.
What does exist, and what practitioners can build today, is a meaningful approximation using current LLMs as orchestrated components within a pipeline that compensates for their structural blindness. The LLM handles the parts it does well: generating candidate relational schemas from domain literature (a structured prompting task), and articulating known results in source domains that might transfer (a retrieval and articulation task). Formal tools handle the parts that require structural rigor: graph comparison libraries for bridge detection, symbolic algebra systems (SymPy, SageMath) for morphism verification, constraint solvers for checking compositional preservation. An agent framework like LangGraph can orchestrate this as a multi-step pipeline where the LLM generates candidates and formal tools verify them.
This LLM-based version has two limitations that world models would resolve. First, the relational schemas it produces are inferred from text, surface representations of structure rather than structure learned directly from phenomena. A world model trained on domain data would learn relational structure from the phenomena themselves, with higher fidelity. Second, the LLM pipeline requires human-guided iteration and does not scale combinatorially. It is a research tool, not a discovery engine. World models operating natively in representation space are what take this from “useful prototype” to “systematic synthesis at the scale where breakthrough becomes expected rather than serendipitous.”
What the sceptics get right
Three serious counterarguments deserve direct engagement.
First: emergence might handle this. Sufficiently large models trained on sufficiently diverse data might develop internal representations that capture relational structure, making special-purpose architecture unnecessary. This has some empirical support; large models do appear to develop something like internal world models. But even if internal representations become relational, the evaluation layer in current systems does not check for structural preservation. A model might internally represent domain structure and still produce outputs that are aesthetic rather than structural, because nothing in the training objective or inference pipeline rewards morphism-checking over surface coherence. The architecture proposed here addresses the evaluation loop, not just the representation.
Second: formal representation does not scale. Structure-mapping engines require hand-coded relational schemas, reintroducing a knowledge-engineering bottleneck. The hybrid approach, using LLMs or world models to generate candidate schemas and then applying structural checking as a verification layer, avoids this bottleneck while preserving rigor. The neural component does the breadth. The formal component does the depth.
Third: no benchmark exists for this. How do you evaluate a system that generates novel cross-domain predictions? Existing benchmarks test within-domain performance. A retrospective benchmark is buildable: encode historical cases of successful structural transfer (Shannon, Black-Scholes, Hodgkin-Huxley) as held-out test cases, provide the system with source and target domain formalizations, and measure whether it generates the known bridging prediction. This is imperfect and backward-looking, but it is a start, and the absence of a benchmark is a research opportunity, not a refutation.
Solutions living in the wrong department
The unsolved problems waiting for this approach are not abstract. Antibiotic resistance in bacterial populations exhibits threshold dynamics that parallel percolation phase transitions in materials science, a field with mature predictive tools for locating critical thresholds in network systems. Protein design (the inverse folding problem) shares structural architecture with adjoint optimization in aerospace engineering, where engineers routinely solve the problem of finding an input geometry that produces a desired output behavior. Earthquake dynamics and neural avalanches share the skeleton of self-organized criticality, and perturbation-response techniques used to detect when neural systems drift from criticality might transfer to seismic early warning. In each case, the formal tools exist in one department and the unsolved problem sits in another, three buildings away on the same university campus.
The default trajectory, scaling language models with access to more scientific papers, adding retrieval augmentation, hoping structural understanding emerges, will produce Castalia at industrial scale. More correspondences. More analogies. More elegant recombinations. And very little genuine discovery, because the system never checks whether the beautiful pattern it found is a functor or a coincidence.
World models, by operating in abstract representation space, offer the first plausible architecture for playing the game at the right level of abstraction. But there is a subtler point that the Hesse analogy illuminates. Joseph Knecht did not leave Castalia because the Glass Bead Game was worthless. He left because it was almost valuable — close enough to real synthesis that its practitioners could not see the gap. The most dangerous outcome for AI-driven science is not failure. It is a system that produces a thousand beautiful, structurally unverified correspondences per hour, each one plausible enough that researchers spend years chasing false bridges instead of building the verification pipeline that would tell them which ones are real. You could start building that pipeline tomorrow. The LLM prototype is a weekend project for anyone with a LangGraph setup and access to a symbolic algebra library. The world model version is a research program. The choice between them is being made right now, by default, every time someone ships a science copilot that matches tokens instead of structures.
-
Hesse, H. (1943). Das Glasperlenspiel. Fretz & Wasmuth. Published in English as The Glass Bead Game (trans. R. Winston & C. Winston, 1969). Holt, Rinehart and Winston. ↩
-
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. Shannon’s use of the term “entropy” and its relationship to Boltzmann’s thermodynamic entropy is discussed in the paper’s introduction and in subsequent correspondence with John von Neumann. ↩
-
Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of Political Economy, 81(3), 637–654. The relationship between the Black-Scholes equation and the heat diffusion equation is treated in detail in Wilmott, P. (2006). Paul Wilmott on Quantitative Finance (2nd ed.). John Wiley & Sons. ↩
-
Gentner, D. (1983). Structure-mapping: A theoretical framework for analogy. Cognitive Science, 7(2), 155–170. For experimental evidence on training structural alignment, see Gentner, D., & Markman, A. B. (1997). Structure mapping in analogy and similarity. American Psychologist, 52(1), 45–56. ↩
-
LeCun, Y. (2022). A path towards autonomous machine intelligence (Version 0.9.2). OpenReview. https://openreview.net/pdf?id=BZ5a1r-kVsf. On AMI Labs: LeCun departed Meta in December 2025 to found AMI Labs in Paris, seeking €500M at a €3B pre-launch valuation, with the explicit goal of building AI systems that understand physics and maintain persistent memory rather than predicting token sequences. ↩ ↩2
-
Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Balestriero, R. (2024). V-JEPA: Latent video prediction for visual representation learning. Meta AI Research. The intuitive physics evaluation demonstrated higher prediction error for physically impossible events compared to pixel-prediction and text-based models. ↩
-
Falkenhainer, B., Forbus, K. D., & Gentner, D. (1989). The structure-mapping engine: Algorithm and examples. Artificial Intelligence, 41(1), 1–63. ↩
-
Scientific AI: Toward recursive epistemic agents for causal discovery and general intelligence. (2025). Preprints.org. https://www.preprints.org/manuscript/202507.1154. The paper defines intelligence as causal discovery and proposes agents that construct transferable internal models through recursive hypothesis generation and calibration loops. ↩
-
Spivak, D. I. (2014). Category Theory for the Sciences. MIT Press. For applied category theory in knowledge representation, see also Fong, B., & Spivak, D. I. (2019). An Invitation to Applied Category Theory: Seven Sketches in Compositionality. Cambridge University Press. ↩