Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Stuart M. Shieber

Stuart M. Shieber

· Area Chair, Computer ScienceVerified

Harvard University · Computer Science

Active 1982–2025

h-index54
Citations12.0k
Papers23718 last 5y
Funding$1.1M
See your match with Stuart M. Shieber — sign in to PhdFit.Sign in

About

Stuart M. Shieber is the James O. Welch, Jr. and Virginia B. Welch Professor of Computer Science at Harvard University. He serves as the Area Chair for Computer Science and is an affiliate of the Department of Linguistics and the Department of Philosophy. His primary teaching area is Computer Science. Shieber's research areas include applied mathematics, artificial intelligence, machine learning, computational and data science, computational linguistics, and natural-language processing. He has been recognized for his contributions to the field of computational linguistics, notably being named an ACL Fellow for his work. His academic and research activities are based at Harvard's School of Engineering and Applied Sciences, located at 150 Western Ave, Allston, MA.

Research topics

  • Computer Science
  • Artificial Intelligence
  • Natural Language Processing
  • Machine Learning
  • Mathematics
  • Psychology
  • Political Science
  • Sociology
  • Social psychology
  • Data science
  • Statistics
  • Econometrics
  • Cognitive psychology

Selected publications

  • Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls

    ArXiv.org · 2025-09-30

    preprintOpen access

    Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via \emph{implicit chain-of-thought}, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to ``cache'' and ``retrieve'' pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the ``running sum'' via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue.

  • From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

    arXiv (Cornell University) · 2024-05-23 · 3 citations

    preprintOpen accessSenior author

    When leveraging language models for reasoning tasks, generating explicit chain-of-thought (CoT) steps often proves essential for achieving high accuracy in final outputs. In this paper, we investigate if models can be taught to internalize these CoT steps. To this end, we propose a simple yet effective method for internalizing CoT steps: starting with a model trained for explicit CoT reasoning, we gradually remove the intermediate steps and finetune the model. This process allows the model to internalize the intermediate reasoning steps, thus simplifying the reasoning process while maintaining high performance. Our approach enables a GPT-2 Small model to solve 9-by-9 multiplication with up to 99% accuracy, whereas standard training cannot solve beyond 4-by-4 multiplication. Furthermore, our method proves effective on larger language models, such as Mistral 7B, achieving over 50% accuracy on GSM8K without producing any intermediate steps.

  • string2string: A Modern Python Library for String-to-String Algorithms

    2024-01-01 · 3 citations

    articleOpen access
  • string2string: A Modern Python Library for String-to-String Algorithms

    arXiv (Cornell University) · 2023-04-27

    preprintOpen access

    We introduce string2string, an open-source library that offers a comprehensive suite of efficient algorithms for a broad range of string-to-string problems. It includes traditional algorithmic solutions as well as recent advanced neural approaches to tackle various problems in string alignment, distance measurement, lexical and semantic search, and similarity analysis -- along with several helpful visualization tools and metrics to facilitate the interpretation and analysis of these methods. Notable algorithms featured in the library include the Smith-Waterman algorithm for pairwise local alignment, the Hirschberg algorithm for global alignment, the Wagner-Fisher algorithm for edit distance, BARTScore and BERTScore for similarity analysis, the Knuth-Morris-Pratt algorithm for lexical search, and Faiss for semantic search. Besides, it wraps existing efficient and widely-used implementations of certain frameworks and metrics, such as sacreBLEU and ROUGE, whenever it is appropriate and suitable. Overall, the library aims to provide extensive coverage and increased flexibility in comparison to existing libraries for strings. It can be used for many downstream applications, tasks, and problems in natural-language processing, bioinformatics, and computational social sciences. It is implemented in Python, easily installable via pip, and accessible through a simple API. Source code, documentation, and tutorials are all available on our GitHub page: https://github.com/stanfordnlp/string2string.

  • Implicit Chain of Thought Reasoning via Knowledge Distillation

    arXiv (Cornell University) · 2023-11-02 · 1 citations

    preprintOpen accessSenior author

    To augment language models with the ability to reason, researchers usually prompt or finetune them to produce chain of thought reasoning steps before producing the final answer. However, although people use natural language to reason effectively, it may be that LMs could reason more effectively with some intermediate computation that is not in natural language. In this work, we explore an alternative reasoning approach: instead of explicitly producing the chain of thought reasoning steps, we use the language model's internal hidden states to perform implicit reasoning. The implicit reasoning steps are distilled from a teacher model trained on explicit chain-of-thought reasoning, and instead of doing reasoning "horizontally" by producing intermediate words one-by-one, we distill it such that the reasoning happens "vertically" among the hidden states in different layers. We conduct experiments on a multi-digit multiplication task and a grade school math problem dataset and find that this approach enables solving tasks previously not solvable without explicit chain-of-thought, at a speed comparable to no chain-of-thought.

  • Design Galleries: A General Approach to Setting Parameters for Computer Graphics and Animation

    ACM eBooks · 2023-08-01 · 57 citations

    book-chapterOpen accessSenior author

    Image rendering maps scene parameters to output pixel values; animation maps motion-control parameters to trajectory values. Because these mapping functions are usually multidimensional, nonlinear, and discontinuous, finding input parameters that yield desirable output values is often a painful process of manual tweaking. Interactive evolution and inverse design are two general methodologies for computer-assisted parameter setting in which the computer plays a prominent role. In this paper we present another such methodology. Design Gallery<sup>TM</sup> (DG) interfaces present the user with the broadest selection, automatically generated and organized, of perceptually different graphics or animations that can be produced by varying a given input-parameter vector. The principal technical challenges posed by the DG approach are dispersion, finding a set of input-parameter vectors that optimally disperses the resulting output-value vectors, and arrangement, organizing the resulting graphics for easy and intuitive browsing by the user. We describe the use of DG interfaces for several parameter-setting problems: light selection and placement for image rendering, both standard and image-based; opacity and color transfer-function specification for volume rendering; and motion control for particle-system and articulated-figure animation

  • The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications

    arXiv (Cornell University) · 2022-07-08 · 14 citations

    preprintOpen accessSenior author

    Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Although the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, ML offers a promising set of techniques for evaluating novelty, summarizing contributions, and embedding semantics. In this paper, we introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale, well-structured, and multi-purpose corpus of English-language patent applications filed to the United States Patent and Trademark Office (USPTO) between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. Unlike previously proposed patent datasets in NLP, HUPD contains the inventor-submitted versions of patent applications--not the final versions of granted patents--thereby allowing us to study patentability at the time of filing using NLP methods for the first time. It is also novel in its inclusion of rich structured metadata alongside the text of patent filings: By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks that leverage variation in structured covariates. As a case study on the types of research HUPD makes possible, we introduce a new task to the NLP community--namely, binary classification of patent decisions. We additionally show the structured metadata provided in the dataset enables us to conduct explicit studies of concept shifts for this task. Finally, we demonstrate how HUPD can be used for three additional tasks: multi-class classification of patent subject areas, language modeling, and summarization.

  • Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    arXiv (Cornell University) · 2022 · 548 citations

    • Computer Science
    • Artificial Intelligence
    • Computer Science

    Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

  • Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models

    2021-01-01

    preprintOpen access

    Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, Yonatan Belinkov. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.

  • Causal Analysis of Syntactic Agreement Mechanisms in Neural Language\n Models

    arXiv (Cornell University) · 2021-06-10 · 1 citations

    preprintOpen access

    Targeted syntactic evaluations have demonstrated the ability of language\nmodels to perform subject-verb agreement given difficult contexts. To elucidate\nthe mechanisms by which the models accomplish this behavior, this study applies\ncausal mediation analysis to pre-trained neural language models. We investigate\nthe magnitude of models' preferences for grammatical inflections, as well as\nwhether neurons process subject-verb agreement similarly across sentences with\ndifferent syntactic structures. We uncover similarities and differences across\narchitectures and model sizes -- notably, that larger models do not necessarily\nlearn stronger preferences. We also observe two distinct mechanisms for\nproducing subject-verb agreement depending on the syntactic structure of the\ninput sentence. Finally, we find that language models rely on similar sets of\nneurons when given sentences with similar syntactic structure.\n

Recent grants

Frequent coauthors

  • Yonatan Belinkov

    46 shared
  • Sebastian Gehrmann

    35 shared
  • Tal Linzen

    28 shared
  • Aaron Mueller

    27 shared
  • Matthew Finlayson

    27 shared
  • Alexander M. Rush

    24 shared
  • Joe Marks

    Harvard University Press

    21 shared
  • Fernando C. N. Pereira

    17 shared

Labs

  • Stuart M. Shieber LabPI

Awards & honors

  • ACL Fellow (2017)
  • Siebel Scholars Program (2017)
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Stuart M. Shieber

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup