Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…

David Smith

· Adjunct Associate ProfessorVerified

University of Massachusetts Amherst · International Relations

Active 1960–2026

h-index37
Citations6.3k
Papers25520 last 5y
Funding
See your match with David Smith — sign in to PhdFit.Sign in

About

David A. Smith is an Associate Professor at the Khoury College of Computer Sciences at Northeastern University. His research focus is on natural language processing and computational linguistics, with applications to machine translation, information retrieval, and the social sciences and humanities. He is a founding member of the NULab for Texts, Maps, and Networks, Northeastern's research center dedicated to digital humanities and computational social sciences. Professor Smith has contributed to various projects funded by organizations such as the Mellon Foundation, the Institute for Museum and Library Sciences, and the NEH, focusing on topics including historical and multilingual OCR in the humanities, Arabic-script OCR, tracking news and ideas across countries and languages, and analyzing text reuse in historical and political texts. His work has been featured in prominent outlets like Wired and The Economist. He has a background that includes positions at the University of Massachusetts Amherst, Johns Hopkins University, and Tufts University, where he was involved in projects related to natural language processing, digital libraries, and machine translation. His scholarly contributions include numerous publications in conferences and journals, and he has authored a forthcoming book on virality in nineteenth-century newspapers.

Research topics

  • Political Science
  • Computer Science
  • Artificial Intelligence
  • Information Retrieval
  • Data Mining
  • Natural Language Processing
  • Machine Learning
  • Law
  • Economics
  • Macroeconomics
  • Market economy
  • International trade
  • Archaeology
  • Commerce
  • Library science
  • History
  • Programming language
  • International economics

Selected publications

  • Review: <i>The Battle of Manila: Poisoned Victory in the Pacific War</i> , by Nicholas Evan Sarantakes

    Pacific Historical Review · 2026-01-01

    article1st authorCorresponding
  • Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training

    2025-01-01 · 2 citations

    articleOpen access

    Due to the sensitive nature of personally identifiable information (PII), its owners may have the authority to control its inclusion or request its removal from large-language model (LLM) training.Beyond this, PII may be added or removed from training datasets due to evolving dataset curation techniques, because they were newly scraped for retraining, or because they were included in a new downstream fine-tuning stage.We find that the amount and ease of PII memorization is a dynamic property of a model that evolves throughout training pipelines and depends on commonly altered design choices.We characterize three such novel phenomena: (1) similar-appearing PII seen later in training can elicit memorization of earlier-seen sequences in what we call assisted memorization, and this is a significant factor (in our settings, up to 1/3); (2) adding PII can increase memorization of other PII significantly (in our settings, as much as 7.5); and (3) removing PII can lead to other PII being memorized.Model creators should consider these first-and second-order privacy risks when training models to avoid the risk of new PII regurgitation.* Equal senior authorship. 1 See https://platform.

  • Through the Lens of History: Methods for Analyzing Temporal Variation in Content and Framing of State-run Chinese Newspapers

    2025-01-01

    articleOpen accessSenior author

    Shijia Liu, David A. Smith. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025.

  • Copyright, Privacy, and Public Access in News Archives: a proof of concept on the Boston Globe photograph morgue

    AI & Society · 2025-03-19 · 1 citations

    articleOpen accessSenior author

    Abstract Whether supplementing written articles in newspapers or playing a leading role in photo-reporting, photography has achieved an influencial role in the delivery of information and framing of narratives to mass audiences. Photojournalism archives represent a unique source of historical data and public records about local, national, and international events, political movements, demonstrations, and urban development. This paper outlines a data archaeology project that leverages artificial intelligence (AI) for organizing and searching through photojournalism collections, based on the Boston Globe photograph morgue. While the primary goal of the project is to foster public access to news archives, the large-scale digitization and recovery of information from photojournalism collections raises ethical questions about intellectual property and the right to identity protection when records are made available online. We present a proof of concept that tackles these issues by means of AI, while still offering equitable access to journalism archives that are often kept inaccessible within private media institutions. The first part of the paper discusses how machine learning can resolve the lack of resources to parse through data on digital surrogates. After providing an introduction to the use of ML to facilitate access to information in the Boston Globe photograph morgue, we outline two partially automated computational tasks: (1) an AI toolkit for transcribing archivists’ notes and to recover photographers’ names and creation dates, which can be used by librarians and archivists to assess copyright on records; (2) a pipeline for face detection and blurring that detects areas where identifiable people are present and allows for anonymization. As news archiving is confronted with challenges derived from the “digital heap” of orphaned data, privatization, and other barriers to journalism records, this report explores an ethical approach to structuring data in news archives for public access, by preserving intellectual property and privacy.

  • Detecting Manuscript Annotations in Historical Print: Negative Evidence and Evaluation Metrics

    2024-01-01 · 1 citations

    articleOpen accessSenior author
  • Mind the Gap:Analyzing Lacunae with Transformer-Based Transcription

    Lecture notes in computer science · 2024-01-01

    book-chapterSenior author
  • MONSTERMASH: Multidirectional, Overlapping, Nested, Spiral Text Extraction for Recognition Models of Arabic-Script Handwriting

    Lecture notes in computer science · 2024-01-01 · 2 citations

    book-chapterSenior author
  • Labor organizing at chokepoints along Amazon’s supply chain: Locating geo-strategic nodes

    Environment and Planning A Economy and Space · 2024-02-23 · 9 citations

    article

    Amazon seems to be creating a new hybrid model of capitalism combining some elements of classical Fordist vertical integration, or even the over hundred-year-old “Taylorism” of scientific management, with 21st century elements of labor “flexibility” and reliance on gig labor and subcontracting. This hybrid model offers opportunities for organized labor to gain a foothold within some of Amazon’s vertically integrated nodes as the firm lengthens its corporate commodity chain to grow increasingly close to consumers. Building on earlier work on opportunities for, and constraints on, labor in a variety of global commodity chains, our empirical cases examine how Amazon’s corporate strategies may open opportunities for labor in three illustrative cases ensconced in fulfillment centers—the Fordist vertical integration side of the model—in the Inland Empire and Otay Mesa (both in southern California) and Northern Kentucky.

  • Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription

    arXiv (Cornell University) · 2024-06-28

    preprintOpen accessSenior author

    Historical documents frequently suffer from damage and inconsistencies, including missing or illegible text resulting from issues such as holes, ink problems, and storage damage. These missing portions or gaps are referred to as lacunae. In this study, we employ transformer-based optical character recognition (OCR) models trained on synthetic data containing lacunae in a supervised manner. We demonstrate their effectiveness in detecting and restoring lacunae, achieving a success rate of 65%, compared to a base model lacking knowledge of lacunae, which achieves only 5% restoration. Additionally, we investigate the mechanistic properties of the model, such as the log probability of transcription, which can identify lacunae and other errors (e.g., mistranscriptions due to complex writing or ink issues) in line images without directly inspecting the image. This capability could be valuable for scholars seeking to distinguish images containing lacunae or errors from clean ones. Although we explore the potential of attention mechanisms in flagging lacunae and transcription errors, our findings suggest it is not a significant factor. Our work highlights a promising direction in utilizing transformer-based OCR models for restoring or analyzing damaged historical documents.

  • The Political Economy of Trade, Work, and Economy: De-globalization – or Re-globalization?

    New Global Studies · 2024-04-05 · 2 citations

    articleOpen access1st authorCorresponding

    Abstract What forces will shape the global future? We begin with discussion of the central roles of globalization and the ecologically destructive Anthropocene and then move onto more current popular and political debates about questions of unchallengeable globalization versus de-globalization and re-globalization. We side with the former. The broad story is how historical global capitalism, with different leading core states or hegemons, inexorably pushed global society into an increasingly tight related connected world-economy, meshed together by commodity webs and supply chains that linked increasingly far-flung locations, geologies, landscapes, and ecosystems. The vision is one of a world-system, embedded to a large degree on market and nation-state capitalism and political power, conflict, and cooperation, that grows more and more tightly integrated, spatially widespread, and ecologically destructive as it expanded for six hundred years. We disagree with a fundamental “break” from the old political economy view. In fact, we are confident that today’s current Anthropocene global consciousness remains – with major concern with climate change and worldwide pandemics. There is little doubt that worldwide globalization is not only needed but essentially inescapable.

Frequent coauthors

Labs

  • David A. Smith LabPI

Education

  • Ph.D., Natural Language Processing

    Johns Hopkins University

  • M.S., Perseus Project

    Tufts University

  • B.S.

    University of Massachusetts, Amherst

  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with David Smith

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup