David Smith

· Adjunct Associate ProfessorVerified

University of Massachusetts Amherst · International Relations

Active 1960–2026

h-index37

Citations6.3k

Papers25520 last 5y

Funding—

Faculty page

See your match with David Smith — sign in to PhdFit.Sign in

About

David A. Smith is an Associate Professor at the Khoury College of Computer Sciences at Northeastern University. His research focus is on natural language processing and computational linguistics, with applications to machine translation, information retrieval, and the social sciences and humanities. He is a founding member of the NULab for Texts, Maps, and Networks, Northeastern's research center dedicated to digital humanities and computational social sciences. Professor Smith has contributed to various projects funded by organizations such as the Mellon Foundation, the Institute for Museum and Library Sciences, and the NEH, focusing on topics including historical and multilingual OCR in the humanities, Arabic-script OCR, tracking news and ideas across countries and languages, and analyzing text reuse in historical and political texts. His work has been featured in prominent outlets like Wired and The Economist. He has a background that includes positions at the University of Massachusetts Amherst, Johns Hopkins University, and Tufts University, where he was involved in projects related to natural language processing, digital libraries, and machine translation. His scholarly contributions include numerous publications in conferences and journals, and he has authored a forthcoming book on virality in nineteenth-century newspapers.

Research topics

Political Science
Computer Science
Artificial Intelligence
Information Retrieval
Data Mining
Natural Language Processing
Machine Learning
Law
Economics
Macroeconomics
Market economy
International trade
Archaeology
Commerce
Library science
History
Programming language
International economics

Selected publications

Review: <i>The Battle of Manila: Poisoned Victory in the Pacific War</i> , by Nicholas Evan Sarantakes
Pacific Historical Review · 2026-01-01
article1st authorCorresponding
Publisher DOI
Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training
2025-01-01 · 2 citations
articleOpen access
Due to the sensitive nature of personally identifiable information (PII), its owners may have the authority to control its inclusion or request its removal from large-language model (LLM) training.Beyond this, PII may be added or removed from training datasets due to evolving dataset curation techniques, because they were newly scraped for retraining, or because they were included in a new downstream fine-tuning stage.We find that the amount and ease of PII memorization is a dynamic property of a model that evolves throughout training pipelines and depends on commonly altered design choices.We characterize three such novel phenomena: (1) similar-appearing PII seen later in training can elicit memorization of earlier-seen sequences in what we call assisted memorization, and this is a significant factor (in our settings, up to 1/3); (2) adding PII can increase memorization of other PII significantly (in our settings, as much as 7.5); and (3) removing PII can lead to other PII being memorized.Model creators should consider these first-and second-order privacy risks when training models to avoid the risk of new PII regurgitation.* Equal senior authorship. 1 See https://platform.
Publisher OA PDF DOI
Through the Lens of History: Methods for Analyzing Temporal Variation in Content and Framing of State-run Chinese Newspapers
2025-01-01
articleOpen accessSenior author
Shijia Liu, David A. Smith. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025.
Publisher OA PDF DOI
Copyright, Privacy, and Public Access in News Archives: a proof of concept on the Boston Globe photograph morgue
AI & Society · 2025-03-19 · 1 citations
articleOpen accessSenior author
Abstract Whether supplementing written articles in newspapers or playing a leading role in photo-reporting, photography has achieved an influencial role in the delivery of information and framing of narratives to mass audiences. Photojournalism archives represent a unique source of historical data and public records about local, national, and international events, political movements, demonstrations, and urban development. This paper outlines a data archaeology project that leverages artificial intelligence (AI) for organizing and searching through photojournalism collections, based on the Boston Globe photograph morgue. While the primary goal of the project is to foster public access to news archives, the large-scale digitization and recovery of information from photojournalism collections raises ethical questions about intellectual property and the right to identity protection when records are made available online. We present a proof of concept that tackles these issues by means of AI, while still offering equitable access to journalism archives that are often kept inaccessible within private media institutions. The first part of the paper discusses how machine learning can resolve the lack of resources to parse through data on digital surrogates. After providing an introduction to the use of ML to facilitate access to information in the Boston Globe photograph morgue, we outline two partially automated computational tasks: (1) an AI toolkit for transcribing archivists’ notes and to recover photographers’ names and creation dates, which can be used by librarians and archivists to assess copyright on records; (2) a pipeline for face detection and blurring that detects areas where identifiable people are present and allows for anonymization. As news archiving is confronted with challenges derived from the “digital heap” of orphaned data, privatization, and other barriers to journalism records, this report explores an ethical approach to structuring data in news archives for public access, by preserving intellectual property and privacy.
Publisher OA PDF DOI
Detecting Manuscript Annotations in Historical Print: Negative Evidence and Evaluation Metrics
2024-01-01 · 1 citations
articleOpen accessSenior author
Publisher DOI
Mind the Gap:Analyzing Lacunae with Transformer-Based Transcription
Lecture notes in computer science · 2024-01-01
book-chapterSenior author
Publisher DOI
MONSTERMASH: Multidirectional, Overlapping, Nested, Spiral Text Extraction for Recognition Models of Arabic-Script Handwriting
Lecture notes in computer science · 2024-01-01 · 2 citations
book-chapterSenior author
Publisher DOI
Labor organizing at chokepoints along Amazon’s supply chain: Locating geo-strategic nodes
Environment and Planning A Economy and Space · 2024-02-23 · 9 citations
article
Amazon seems to be creating a new hybrid model of capitalism combining some elements of classical Fordist vertical integration, or even the over hundred-year-old “Taylorism” of scientific management, with 21st century elements of labor “flexibility” and reliance on gig labor and subcontracting. This hybrid model offers opportunities for organized labor to gain a foothold within some of Amazon’s vertically integrated nodes as the firm lengthens its corporate commodity chain to grow increasingly close to consumers. Building on earlier work on opportunities for, and constraints on, labor in a variety of global commodity chains, our empirical cases examine how Amazon’s corporate strategies may open opportunities for labor in three illustrative cases ensconced in fulfillment centers—the Fordist vertical integration side of the model—in the Inland Empire and Otay Mesa (both in southern California) and Northern Kentucky.
Publisher DOI
Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription
arXiv (Cornell University) · 2024-06-28
preprintOpen accessSenior author
Historical documents frequently suffer from damage and inconsistencies, including missing or illegible text resulting from issues such as holes, ink problems, and storage damage. These missing portions or gaps are referred to as lacunae. In this study, we employ transformer-based optical character recognition (OCR) models trained on synthetic data containing lacunae in a supervised manner. We demonstrate their effectiveness in detecting and restoring lacunae, achieving a success rate of 65%, compared to a base model lacking knowledge of lacunae, which achieves only 5% restoration. Additionally, we investigate the mechanistic properties of the model, such as the log probability of transcription, which can identify lacunae and other errors (e.g., mistranscriptions due to complex writing or ink issues) in line images without directly inspecting the image. This capability could be valuable for scholars seeking to distinguish images containing lacunae or errors from clean ones. Although we explore the potential of attention mechanisms in flagging lacunae and transcription errors, our findings suggest it is not a significant factor. Our work highlights a promising direction in utilizing transformer-based OCR models for restoring or analyzing damaged historical documents.
Publisher OA PDF DOI
The Political Economy of Trade, Work, and Economy: De-globalization – or Re-globalization?
New Global Studies · 2024-04-05 · 2 citations
articleOpen access1st authorCorresponding
Abstract What forces will shape the global future? We begin with discussion of the central roles of globalization and the ecologically destructive Anthropocene and then move onto more current popular and political debates about questions of unchallengeable globalization versus de-globalization and re-globalization. We side with the former. The broad story is how historical global capitalism, with different leading core states or hegemons, inexorably pushed global society into an increasingly tight related connected world-economy, meshed together by commodity webs and supply chains that linked increasingly far-flung locations, geologies, landscapes, and ecosystems. The vision is one of a world-system, embedded to a large degree on market and nation-state capitalism and political power, conflict, and cooperation, that grows more and more tightly integrated, spatially widespread, and ecologically destructive as it expanded for six hundred years. We disagree with a fundamental “break” from the old political economy view. In fact, we are confident that today’s current Anthropocene global consciousness remains – with major concern with climate change and worldwide pandemics. There is little doubt that worldwide globalization is not only needed but essentially inescapable.
Publisher OA PDF DOI

Frequent coauthors

Judith Stepan‐Norris
University of California, Irvine
20 shared
Valerie Jenness
19 shared
Kriste Krstovski
Columbia University
12 shared
Paul S. Ciccantell
8 shared
Michael D. Kennedy
Cedarville University
8 shared
Shaobin Xu
Northeastern University
7 shared
Jason Eisner
6 shared
Sebastian Riedel
6 shared

Labs

David A. Smith LabPI

Education

Ph.D., Natural Language Processing
Johns Hopkins University
M.S., Perseus Project
Tufts University
B.S.
University of Massachusetts, Amherst

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with David Smith

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you