Join us for the reading group on ML and cryptography organized by Shafi Goldwasser, Yael Kalai, Jonathan Shafer and Vinod Vaikuntanathan. For questions, feel free to contact Jonathan.
Large language models (LLMs) sometimes generate statements that are plausible but factually incorrect—a phenomenon commonly called “hallucination.” We argue that these errors are not mysterious failures of architecture or reasoning, but rather predictable consequences of standard training and evaluation incentives.
We show (i) that hallucinations can be viewed as classification errors: when pretrained models cannot reliably distinguish a false statement from a true one, they may produce the false option rather than saying I don’t know; (ii) that optimization of benchmark performance encourages guessing rather than abstaining, since most evaluation metrics penalize expressing uncertainty; and (iii) that a possible mitigation path lies in revising existing benchmarks to reward calibrated abstention, thus realigning incentives in model development.
Joint work with Santosh Vempala (Georgia Tech) and Ofir Nachum & Edwin Zhang (OpenAI)
Previous Talks
Statistically Undetectable Backdoors in Deep Neural Networks
Neekon Vafa (MIT)
October 21, 2025
In this talk, I will show how an adversarial model trainer can plant backdoors in a large class of deep, feedforward neural networks. These backdoors are statistically undetectable in the white-box setting, meaning that the backdoored and honestly trained models are close in total variation distance, even given the full descriptions of the models (e.g., all of the weights). The backdoor provides access to (invariance-based) adversarial examples for every input. However, without the backdoor, no one can generate any such adversarial examples, assuming the worst-case hardness of shortest vector problems on lattices. Our main technical tool relies on a cryptographic perspective on the ubiquitous Johnson-Lindenstrauss lemma.
This talk is based on upcoming work with Andrej Bogdanov and Alon Rosen.
What Can Cryptography Tell Us About AI?
Greg Gluch (Simons Institute, UC Berkeley)
October 14, 2025
I will present three results that use cryptographic assumptions to characterize both the limits and possibilities of AI safety. First, we show that AI alignment cannot be achieved using only black-box filters of harmful content. Second, we prove a separation between mitigation and detection at inference time, where mitigation refines an LLM’s output using additional computation to compute a safer or more accurate result. Third, we conduct a meta-analysis of watermarks, adversarial defenses, and transferable attacks, showing that for every learning task, at least one of these three schemes must exist.
Each result carries a broader message: the first argues for the necessity of weight access in AI auditing; the second provides a rule of thumb for allocating inference-time resources when safety is the goal; and the third offers an explanation for why adversarial examples often transfer across different LLMs.
Bio. Greg Gluch is a postdoctoral researcher at the Simons Institute, UC Berkeley, working with Shafi Goldwasser, and currently a long-term visitor at MIT. He received his PhD from EPFL, advised by Rüdiger Urbanke and Michael Kapralov. His research spans AI safety—covering topics such as adversarial examples, alignment, and verifiability—and quantum interactive proofs and their links to physics.
Assume some untrusted powerful data analyst claims to have drawn many samples from an unknown distribution, ran some complicated analysis over the samples, and reveals to us their conclusion. Say that we are granted few samples from the same distribution. Can we verify that the results of the analyses are approximately correct?
In this talk we review a recent line of work that shows that the answer to this question is positive, and present the constructions of proof-systems that allow a probabilistic verifier to ascertain that results of an analysis are approximately correct, while drawing fewer samples and using less computational resources than would be needed to replicate the analysis. We focus on distribution testing problems: verifying that an unknown distribution is close to having a claimed property.
Moving from data science to AI: Can similar tools and interactive proofs provide a theoretical framework for addressing questions of verification in the context of AI systems? We will open the floor for discussion and ideas.
A Survey of Cryptographic Watermarks for AI-Generated Content
Generative AI watermarks are hidden patterns embedded in AI-generated content to facilitate its detection. A recent line of work draws on cryptography to define desired properties of watermarks and realize these properties with surprisingly simple constructions. In this talk, I will survey common definitions and approaches, focusing on those from the cryptography community. I'll end with some open questions that I find interesting.