Skip to main content
Seminar | Mathematics and Computer Science

World Models, User Models and Self Models in Artificial Intelligence Systems

CS Seminar

Abstract: Modern language models (LMs) are increasingly capable. Despite their increasing capability, they still suffer from persistent failures: they hallucinate facts, adapt poorly to users and produce unfaithful explanations. Rather than viewing these failures as inevitable outcomes of neural networks, we present evidence that LMs learn to build structured internal models of the world, the user and themselves, and that these can be leveraged to build more reliable agents.

First, we use interpretability techniques to show that LMs indeed build latent representations of world state, and we characterize the algorithms they use to track state changes. Next, we augment LMs with external Bayesian frameworks for interactive user modeling, enabling them to proactively elicit and track user preferences. Finally, we develop training methods to equip LMs with self-models, enabling them to produce faithful explanations of their own computations. Together, these lines of work allow future artificial intelligence (AI) systems to maintain coherent and updateable beliefs, to adapt to individual users and to communicate their reasoning transparently to humans, pointing towards a future of collaborative AI systems which augment rather than replace human capabilities.

Bio: Belinda Z. Li recently completed her Ph.D. at Massachusetts Institute of Technology. Belinda is a recipient of a Rising Stars in Electrical Engineering and Computer Sciences Award, a Clare Boothe Luce Fellowship and a National Defense Science and Engineering Graduate Fellowship Fellowship.

See upcoming and previous presentations at CS Seminar Series.