Stuart M. Shieber

· Area Chair, Computer ScienceVerified

Harvard University · Computer Science

Active 1982–2025

h-index54

Citations12.0k

Papers23718 last 5y

Funding$1.1M

Faculty page

See your match with Stuart M. Shieber — sign in to PhdFit.Sign in

About

Stuart M. Shieber is the James O. Welch, Jr. and Virginia B. Welch Professor of Computer Science at Harvard University. He serves as the Area Chair for Computer Science and is an affiliate of the Department of Linguistics and the Department of Philosophy. His primary teaching area is Computer Science. Shieber's research areas include applied mathematics, artificial intelligence, machine learning, computational and data science, computational linguistics, and natural-language processing. He has been recognized for his contributions to the field of computational linguistics, notably being named an ACL Fellow for his work. His academic and research activities are based at Harvard's School of Engineering and Applied Sciences, located at 150 Western Ave, Allston, MA.

Research topics

Computer Science
Artificial Intelligence
Natural Language Processing
Machine Learning
Mathematics
Psychology
Political Science
Sociology
Social psychology
Data science
Statistics
Econometrics
Cognitive psychology

Selected publications

Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls
ArXiv.org · 2025-09-30
preprintOpen access
Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via \emph{implicit chain-of-thought}, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to ``cache'' and ``retrieve'' pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the ``running sum'' via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue.
Publisher OA PDF DOI
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
arXiv (Cornell University) · 2024-05-23 · 3 citations
preprintOpen accessSenior author
When leveraging language models for reasoning tasks, generating explicit chain-of-thought (CoT) steps often proves essential for achieving high accuracy in final outputs. In this paper, we investigate if models can be taught to internalize these CoT steps. To this end, we propose a simple yet effective method for internalizing CoT steps: starting with a model trained for explicit CoT reasoning, we gradually remove the intermediate steps and finetune the model. This process allows the model to internalize the intermediate reasoning steps, thus simplifying the reasoning process while maintaining high performance. Our approach enables a GPT-2 Small model to solve 9-by-9 multiplication with up to 99% accuracy, whereas standard training cannot solve beyond 4-by-4 multiplication. Furthermore, our method proves effective on larger language models, such as Mistral 7B, achieving over 50% accuracy on GSM8K without producing any intermediate steps.
Publisher OA PDF DOI
string2string: A Modern Python Library for String-to-String Algorithms
2024-01-01 · 3 citations
articleOpen access
Publisher OA PDF DOI
string2string: A Modern Python Library for String-to-String Algorithms
arXiv (Cornell University) · 2023-04-27
preprintOpen access
We introduce string2string, an open-source library that offers a comprehensive suite of efficient algorithms for a broad range of string-to-string problems. It includes traditional algorithmic solutions as well as recent advanced neural approaches to tackle various problems in string alignment, distance measurement, lexical and semantic search, and similarity analysis -- along with several helpful visualization tools and metrics to facilitate the interpretation and analysis of these methods. Notable algorithms featured in the library include the Smith-Waterman algorithm for pairwise local alignment, the Hirschberg algorithm for global alignment, the Wagner-Fisher algorithm for edit distance, BARTScore and BERTScore for similarity analysis, the Knuth-Morris-Pratt algorithm for lexical search, and Faiss for semantic search. Besides, it wraps existing efficient and widely-used implementations of certain frameworks and metrics, such as sacreBLEU and ROUGE, whenever it is appropriate and suitable. Overall, the library aims to provide extensive coverage and increased flexibility in comparison to existing libraries for strings. It can be used for many downstream applications, tasks, and problems in natural-language processing, bioinformatics, and computational social sciences. It is implemented in Python, easily installable via pip, and accessible through a simple API. Source code, documentation, and tutorials are all available on our GitHub page: https://github.com/stanfordnlp/string2string.
Publisher OA PDF DOI
Implicit Chain of Thought Reasoning via Knowledge Distillation
arXiv (Cornell University) · 2023-11-02 · 1 citations
preprintOpen accessSenior author
To augment language models with the ability to reason, researchers usually prompt or finetune them to produce chain of thought reasoning steps before producing the final answer. However, although people use natural language to reason effectively, it may be that LMs could reason more effectively with some intermediate computation that is not in natural language. In this work, we explore an alternative reasoning approach: instead of explicitly producing the chain of thought reasoning steps, we use the language model's internal hidden states to perform implicit reasoning. The implicit reasoning steps are distilled from a teacher model trained on explicit chain-of-thought reasoning, and instead of doing reasoning "horizontally" by producing intermediate words one-by-one, we distill it such that the reasoning happens "vertically" among the hidden states in different layers. We conduct experiments on a multi-digit multiplication task and a grade school math problem dataset and find that this approach enables solving tasks previously not solvable without explicit chain-of-thought, at a speed comparable to no chain-of-thought.
Publisher OA PDF DOI
Design Galleries: A General Approach to Setting Parameters for Computer Graphics and Animation
ACM eBooks · 2023-08-01 · 57 citations
book-chapterOpen accessSenior author
Image rendering maps scene parameters to output pixel values; animation maps motion-control parameters to trajectory values. Because these mapping functions are usually multidimensional, nonlinear, and discontinuous, finding input parameters that yield desirable output values is often a painful process of manual tweaking. Interactive evolution and inverse design are two general methodologies for computer-assisted parameter setting in which the computer plays a prominent role. In this paper we present another such methodology. Design Gallery<sup>TM</sup> (DG) interfaces present the user with the broadest selection, automatically generated and organized, of perceptually different graphics or animations that can be produced by varying a given input-parameter vector. The principal technical challenges posed by the DG approach are dispersion, finding a set of input-parameter vectors that optimally disperses the resulting output-value vectors, and arrangement, organizing the resulting graphics for easy and intuitive browsing by the user. We describe the use of DG interfaces for several parameter-setting problems: light selection and placement for image rendering, both standard and image-based; opacity and color transfer-function specification for volume rendering; and motion control for particle-system and articulated-figure animation
Publisher OA PDF DOI
The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications
arXiv (Cornell University) · 2022-07-08 · 14 citations
preprintOpen accessSenior author
Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Although the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, ML offers a promising set of techniques for evaluating novelty, summarizing contributions, and embedding semantics. In this paper, we introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale, well-structured, and multi-purpose corpus of English-language patent applications filed to the United States Patent and Trademark Office (USPTO) between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. Unlike previously proposed patent datasets in NLP, HUPD contains the inventor-submitted versions of patent applications--not the final versions of granted patents--thereby allowing us to study patentability at the time of filing using NLP methods for the first time. It is also novel in its inclusion of rich structured metadata alongside the text of patent filings: By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks that leverage variation in structured covariates. As a case study on the types of research HUPD makes possible, we introduce a new task to the NLP community--namely, binary classification of patent decisions. We additionally show the structured metadata provided in the dataset enables us to conduct explicit studies of concept shifts for this task. Finally, we demonstrate how HUPD can be used for three additional tasks: multi-class classification of patent subject areas, language modeling, and summarization.
Publisher OA PDF DOI
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
arXiv (Cornell University) · 2022 · 548 citations
- Computer Science
- Artificial Intelligence
- Computer Science
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
DOI
Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models
2021-01-01
preprintOpen access
Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, Yonatan Belinkov. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
Publisher OA PDF DOI
Causal Analysis of Syntactic Agreement Mechanisms in Neural Language\n Models
arXiv (Cornell University) · 2021-06-10 · 1 citations
preprintOpen access
Targeted syntactic evaluations have demonstrated the ability of language\nmodels to perform subject-verb agreement given difficult contexts. To elucidate\nthe mechanisms by which the models accomplish this behavior, this study applies\ncausal mediation analysis to pre-trained neural language models. We investigate\nthe magnitude of models' preferences for grammatical inflections, as well as\nwhether neurons process subject-verb agreement similarly across sentences with\ndifferent syntactic structures. We uncover similarities and differences across\narchitectures and model sizes -- notably, that larger models do not necessarily\nlearn stronger preferences. We also observe two distinct mechanisms for\nproducing subject-verb agreement depending on the syntactic structure of the\ninput sentence. Finally, we find that language models rely on similar sets of\nneurons when given sentences with similar syntactic structure.\n
Publisher OA PDF DOI

Recent grants

Synchronous Grammars and the Syntax-Semantics Interface
NSF · $79k · 2008–2010
CRI: Infrastructure for Multi-Agent Decision-Making Research
NSF · $703k · 2005–2010
Human-Centered Compression for Collaborative Text Input
NSF · $322k · 2003–2007

Frequent coauthors

Yonatan Belinkov
46 shared
Sebastian Gehrmann
35 shared
Tal Linzen
28 shared
Aaron Mueller
27 shared
Matthew Finlayson
27 shared
Alexander M. Rush
24 shared
Joe Marks
Harvard University Press
21 shared
Fernando C. N. Pereira
17 shared

Labs

Stuart M. Shieber LabPI

Awards & honors

ACL Fellow (2017)
Siebel Scholars Program (2017)

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Stuart M. Shieber

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you