
J. Stephen Downie
· Professor, Executive Associate Dean, and Co-Director of the HathiTrust Research CenterVerifiedUniversity of Illinois Urbana-Champaign · Information Sciences
Active 1900–2025
About
J. Stephen Downie is an executive associate dean and a professor in the School of Information Sciences at the University of Illinois. He is also the Illinois co-director of the HathiTrust Research Center. His research focuses on the design and evaluation of information retrieval systems, including multimedia music information retrieval, the political economy of internetworked communication systems, database design, and Web-based technologies. Downie has been an active participant in the digital libraries and digital humanities research domains, helping to establish a vibrant music information retrieval research community. He is best known for directing the annual Music Information Retrieval Evaluation eXchange (MIREX) since 2005 and was a founder and the first president of the International Society for Music Information Retrieval (ISMIR). His contributions include fostering research and collaboration in music information retrieval and digital library analysis, with active projects such as the HathiTrust Research Center and various initiatives in knowledge integration, data analysis, and digital scholarship.
Research topics
- Computer Science
- Political Science
- Sociology
- Artificial Intelligence
- Information Retrieval
- Natural Language Processing
- World Wide Web
- Visual arts
- Art
- Media studies
- Public relations
- Multimedia
- Psychology
- Programming language
- Philosophy
- Mathematics
- Computational biology
- Engineering
- Law
- Engineering ethics
- Mathematics education
- History
- Biology
- Epistemology
Selected publications
Building Educational Partnerships to Design Data Science Programs
Proceedings of the ALISE Annual Conference · 2025-10-03
articleOpen accessSenior authorSince 2021, the University of Illinois Urbana-Champaign (UIUC) has been one of the universities hosting the Bolashak International Scholarship. The scholarship aims to prepare scholars and professionals to work on priority sectors of Kazakhstan’s economy (Bolashak International Scholarship, 2025). At UIUC, the Bolashak International Scholarship is coordinated by Global Education and Training (GET), Illinois International. In 2024, GET invited the School of Information Sciences (iSchool) to set up an educational partnership to co-host data science Bolashak fellows. The partnership was implemented through the establishment of one team representing GET and one team representing the iSchool. The GET and the iSchool team collaborate to design a one-year data science program composed of academic events, mentoring, and course audit. The data science program concept is taking into consideration the scholars’ needs and interests, and relies on data science instructors’ expertise, and previous literature on teaching data science and data science programs (Wing, 2019; Brunner, Kim, 2016; Kross, Guo, 2019; Rokem et al., 2015; Tang and Sae-Lim, 2016). Although the implementation of the program faces challenges, such as finding more faculty to participate in the program, a lack of partnership policies and definition of roles, appropriate data science curriculum for one year, the partnership is cooperative and is taking shape as a coalition partnership (Tushnet, 1993; Berliner, 1997). Co-hosting Bolashak fellows has driven a research agenda that integrates research, practice, and policies to create a data science program based upon the iSchool’s unique perspective working at the intersection of people, technology, and information.
Comparative analysis of classics book review data created by users across Douban and Goodreads
Digital Scholarship in the Humanities · 2025-08-09
articleOpen accessSenior authorAbstract While empirical research on online book reviews has advanced our understanding of book reception and everyday reading, more cross-platform and comparative analyses are needed to cross-examine and enrich existing perspectives on these topics. This paper presents our comparative analysis of user-created reviews of “classics” across two platforms: Goodreads based in the U.S., and Douban based in China. With book tagging, rating, and reviewing data collected from the two platforms, we examined how 144 classics identified by users on Goodreads and 141 classics on Douban were characterized and received across platforms. Our analysis reveals how book reviewers’ interests in and opinions about classics differ across platforms and demonstrates the necessity of gaining a more comprehensive understanding of classics. We also contribute a parallel dataset of classics across Goodreads and Douban for future investigations. To the best of our knowledge, this is the first parallel classics dataset across Anglophone and non-Anglophone platforms. In addition, we identify three primary challenges in aligning and comparing cross-platform data regarding metadata quality, rating systems, and multi-lingual data processing. We present these challenges and discuss how we managed to overcome them. Our case study furthers existing conversations on the characterization and reception of classics online and informs future cross-platform online book review studies.
Big Data & Society · 2025-08-03
articleOpen accessSenior authorAttracted by the promise of a broader and more egalitarian sample of readers than published book reviews provide, researchers are increasingly scraping social reviewing platforms like Goodreads for data about readers’ behavior. Yet, treating online book reviews as direct proxies for readers and books can be problematic, as they are socially and technically constructed artifacts shaped by platform dynamics, whether between developers and users, or book industry stakeholders and reviewers. To uncover these complexities, we computationally curated 331,211 self-identified incentivized book reviews to understand the growth of incentivized content, and how these purportedly equal-access social reviewing spaces are re-inscribing the inequalities of traditional book reviewing and publishing. Our findings underscore the necessity of critical examination of both online book reviewing and cultural datasets derived from social media platforms. With the growing restrictions on access to platform data for research, this study also demonstrates the potential for a mixed-method analysis of historical scraped datasets; an approach that will likely be of interest to many researchers working with cultural data moderated by black-box algorithms. With this method, our research reveals for the first time the scale of the phenomena of incentivized book reviews that is well known to users of Goodreads but remains largely anecdotal. Additionally, it illuminates the rise of sponsored content while contributing to broader discussions on computational approaches to digital economies of prestige and the responsible use of platform-mediated cultural datasets across disciplines.
Metadata Enrichment of Long Text Documents using Large Language Models
Proceedings of the Association for Information Science and Technology · 2025-10-01
articleOpen accessSenior authorABSTRACT In this project, we semantically enriched and enhanced the metadata of long text documents, theses and dissertations, retrieved from the HathiTrust Digital Library in English published from 1920 to 2020 through a combination of manual efforts and large language models. This dataset provides a valuable resource for advancing research in areas such as computational social science, digital humanities, and information science. Our paper shows that enriching metadata using LLMs is particularly beneficial for digital repositories by introducing additional metadata access points that may not have originally been foreseen to accommodate various content types. This approach is particularly effective for repositories that have significant missing data in their existing metadata fields, enhancing search results, and improving the accessibility of the digital repository.
Making more sense with machines:
UCL Press eBooks · 2025-06-12
book-chapterUCL Press eBooks · 2025-06-12
book-chapterSenior authorMetadata Enrichment of Long Text Documents using Large Language Models
ArXiv.org · 2025-06-26
preprintOpen accessSenior authorIn this project, we semantically enriched and enhanced the metadata of long text documents, theses and dissertations, retrieved from the HathiTrust Digital Library in English published from 1920 to 2020 through a combination of manual efforts and large language models. This dataset provides a valuable resource for advancing research in areas such as computational social science, digital humanities, and information science. Our paper shows that enriching metadata using LLMs is particularly beneficial for digital repositories by introducing additional metadata access points that may not have originally been foreseen to accommodate various content types. This approach is particularly effective for repositories that have significant missing data in their existing metadata fields, enhancing search results and improving the accessibility of the digital repository.
<scp>TORCHLITE</scp>: New, Open Analytical Tools and Infrastructure for a Mega‐Scale Digital Library
Proceedings of the Association for Information Science and Technology · 2024-10-01
articleABSTRACT This paper introduces TORCHLITE, an innovative HathiTrust Research Center (HTRC) open analytical and computational framework designed to offer efficient, open, and approachable access to the HTRC Extracted Features (EF) dataset via a well‐documented web‐based API. This poster will summarize project goals and progress, and discuss community engagement, which has played a pivotal role in this project. During a hackathon event held in spring 2024, TORCHLITE fostered collaboration among digital humanities and information science scholars to develop widgets and notebooks utilizing the EF API. Through the hackathon, participants explored the API's capabilities, leading to the creation of over a dozen analytical widgets and interactive programming notebooks (e.g., Jupyter).
Guest editorial: Artificial intelligence for cultural heritage materials
Journal of Documentation · 2024-08-31 · 1 citations
editorialSenior authorThis special issue of the Journal of Documentation is focused on the uses of artificial intelligence (AI) in the provisioning and analysis of digital cultural heritage materials. Such uses may be occasioned by restricted or difficult access (e.g. due to privacy concerns or copyright restrictions or to the sheer and ever-increasing volume of cultural heritage data). Absolutely crucial as well are those uses of AI that researchers have chosen to make not out of necessity, but out of choice, relying on the affordances of new methods in computation and new algorithms to understand old cultural heritage materials in altogether new ways.The authors of all these articles responded to a call with three principal questions: How can we use AI to make digital cultural heritage collections more accessible? How might we analyze these collections using AI research methods? And can we identify synergies and collaborative avenues among cultural organizations around the world that are engaged in AI-enhanced research and access methods? We invited scholars, curators and other cultural heritage workers in any aligned fields – in the humanities; in the information, data and computer sciences and in the libraries, archives and museums sector – to submit their work for publication in this special issue.Before we get to those articles, let us put this special issue in context: it is just one of the key research outputs of the Artificial Intelligence for Cultural Organisations (AEOLIAN) Network, an effort funded jointly by the National Endowment for the Humanities (NEH) in the United States of America and by the Arts and Humanities Research Council (AHRC) in the United Kingdom. Lead institutions of this project were the University of Illinois Urbana–Champaign and Loughborough University; project partners included Durham University, Glasgow University, Dublin City University, Auburn University (Georgia, USA), the Frick Collection and Stanford University.During its three years of activity, the AEOLIAN project also produced a robust and well-attended series of six international workshops (recordings of which are available on the project website, https://www.aeolian-network.net/category/workshops/); a set of five case studies from cultural heritage organizations in the USA and the UK (which form the basis of an edited volume currently under contract with University College London Press); a number of blog posts (e.g. Gooding, 2021; Worthey, 2021) and substantive interviews published in a variety of cultural heritage venues (e.g. Smith, 2021; Dressler, 2021) and original studies such as Jaillant and Aske (2023a, b).We also commend to your attention another AEOLIAN-sponsored publication: the “Special Issue on Applying Innovative Technologies to Digitized and Born-Digital Archives,” in the Journal on Computing and Cultural Heritage (Vol. 16, No. 4), edited by AEOLIAN’s UK Principal Investigator Dr Lise Jaillant.The AEOLIAN project is named, of course, for the Aeolian harp, a musical instrument of ancient origins played solely by wind and often producing ethereal, haunting, inhuman melodies that are a source of both inspiration and scientific study. It is our hope that this project’s investigations into the issues involved in the application of AI in the cultural heritage sector will both inspire and haunt. Current public discourse about artificial intelligence – which has increased in both sound and fury over the course of our project – is full of questions and answers, promises, fears and disappointments that all swirl about us in contradictory, unsettled, but still fascinating fashion. The AEOLIAN project has sought to avoid both the utopian and the cataclysmic modes of much current debate around AI (as in the highly reductive and speculative, “Will AI save us or destroy us?” etc.). Instead, we have sought to engage critically and practically with questions of current and potential applications of AI and machine learning in the service of cultural heritage.As you read and consider the outstanding articles in this special issue of the Journal of Documentation dedicated to AI in the cultural heritage realm, we commend the many other AEOLIAN outcomes as well, all available at the project website, https://www.aeolian-network.net/. Taken together, they offer rich evidence of the complex landscape of AI activity, praxis, thought and critique as they are understood in the very human space of cultural heritage.Even taken on their own and in their relatively modest number, the articles published in this special issue demonstrate an impressive array of fields and methodologies, which are making use of an impressive set of artificial intelligence approaches, and in this, they reflect well the broad diversity of the AEOLIAN project. Topics covered in these articles range from metadata creation to archival practice to archaeological field work, and even within this broad range they include collection types as broadly divergent as audiovisual, colonial and community archives, as well as national archival research infrastructures. On the metadata front, they include automatic classification for library catalogs, impressively fine-grained enhancements for historical newspaper collections and the recovery of hidden figures in those colonial archives. Equally diverse are the subject areas supported by the work documented in these articles, which span the gamut from archeology to the martial arts and much between and beyond. Finally, seen from the methodological point of view, they include techniques as diverse as computer vision, natural language processing, digitization and crowdsourcing.In their article “Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections,” Ali et al. (2024) describe a major project at the Royal Library of Belgium that deploys computer vision to automatically enhance the metadata, and thus the accessibility, of article-level content in digitized historical newspaper collections. “Metadata enhancement” may seem like an old-fashioned concern, but this team’s use of AI-enhanced methods goes many steps further, determining from among the mass of digitized newspaper content the boundaries of individual articles, identifying their sub-genres (for example, feuilletons, news stories, etc.), extracting named entities from within those articles and creating a richer and more richly searchable historical collection.Golub et al.’s (2024) article, “Automated Dewey Decimal Classification of Swedish library metadata using Annif software,” presents the results of a robust evaluation of Annif, a promising open-source package developed at the National Library of Finland, for subject classification. Golub’s team deployed this package to evaluate the performance of five different machine-learning algorithms (lexical algorithm, support vector classifier, fastText, Omikuji Bonsai and an ensemble of the four) against the results of five expert human catalogers. Their results – that no single algorithm can do the job as well as a combination of algorithms – would seem to reaffirm not only the deep complexity and subtlety of a very common and important library function (subject classification) but also the promise and potential of such an automated system to extend (though apparently not replace) limited-supply human labor in order to provide better, more efficient access to library materials.Luthra et al. (2024) describe a major project that seeks to remediate some crucial omission biases in the historical record. Their article, “Unsilencing colonial archives via automated entity recognition,” describes the use of automatic named entity recognition on the archives of the Dutch East India Company to identify and recover mentions of previously unrecognized and underrepresented people documented in those archival records. In doing so, they focus not only on the rigorously applied and verified statistical and machine learning methods of their project, but more importantly on its ethical dimensions and its ability to rectify the injustices of 17th and 18th century hierarchies of religion, race, gender, class and colonial power.Naumann and Neuberger (2024) discuss the changing realities of traditional archival organizations in the digital age in their article “User perspectives through cross-connections. The role of archives as part of the German digital research data infrastructure.” They address user perspectives and challenges in a richly interconnected environment where “individual” or “standalone” archives no longer really exist, addressing operational questions such as the role of portal infrastructures that link together different archival institutions. The paper focuses on emerging approaches to inter-institutional connections in Germany and its national data infrastructure, but its lessons may be applicable generally: although institutional cross-connections may not be a new phenomenon, these connections appear significantly different in a digital context. The authors reflect especially on the enhanced need for archivists to remain vigilant to the quality of their data and metadata and especially to seek out institutional systems that prioritize support for cross-connection and interlinking of data.As the previous articles all demonstrate, AI and machine learning are proving to be valuable tools – though certainly not silver bullets – for the enhancement of description and access for traditional library and archival collections of books, historical newspapers and paper archives. But what of cultural heritage in newer (and thus less familiar), less tangible (and sometimes ephemeral) forms such as multimedia, born-digital and other novel types of cultural heritage? The next set of articles deals precisely with these newer (or more newly appreciated), less tangible types of collections.Although recorded audiovisual cultural heritage has been with us for well over a century, its preservation and access continue to present vexing difficulties. In the digital age, a new set of challenges and opportunities have arisen. Yang’s (2024) article “Datafication of Audiovisual Archives: from Practice Mapping to a Thinking Model” describes the still-new practice of datafication of these materials, addressing in particular the questions of what sorts of data should be extracted from AV materials and for what purpose. Constructing a model along three broad dimensions of audiovisual content (archival, affective and esthetic and social and historical), Yang proposes mapping the data derivable from such archival materials to the specific purposes to which these data can be put. This exercise leads both to theoretically more justifiable metadata standards and to an enhanced understanding of the multimedia content itself.Much newer than audiovisual materials (and arguably more ephemeral and problematic) are the born-digital materials that now constitute a vast and growing ocean of human cultural production. Hannaford et al. (2024) describe a complex set of solutions to the vexing problem of preserving and making accessible a particularly critically endangered type of cultural heritage material: community-generated digital content, especially that of marginalized communities. In their article, “Our Heritage, Our Stories: Developing AI Tools to Link and Support Community-Generated Digital Cultural Heritage,” they describe one of the primary challenges in dealing with this unique type of content as follows: the best existing attempts to collect, integrate and preserve community-created content require “bespoke interventionist activities” that are expensive, time-consuming and unsustainable at scale; at the same time, the unsophisticated use of computational methods, meant to deal with the problem of scale, tends to erase the meaning and purpose of both the content and its creators – effectively silencing already marginalized communities. The authors instead rely on a combination of multidisciplinary methods, AI tools and, crucially, a co-design process that includes the community creators themselves.However, it is not the case that only the new, digital forms of culture require new thinking in the information age: much older forms of cultural heritage are also demanding much newer forms of attention. Hou et al. (2024) take us from the contemporary and disembodied world of digital content to the ancient and profoundly embodied realm of martial arts. Their article, “Unlocking a multimodal archive of Southern Chinese martial arts through embodied cues,” proposes a novel approach to the martial arts as an “authentic carrier of cultural practice. They combine methods in “movement computing” with domain-specific modeling to enable the search and retrieval of the “embodied cues” inherent in Southern Chinese martial arts. This work allows for the archiving of human movement, the creative expression that it embodies and the cultural contexts in which it is embedded. They use machine learning methods to enhance the archival expressions of such intangible cultural heritage.Although all of our authors express subtle combinations of hope and healthy skepticism about the promise of AI in its application to cultural heritage, one of the clearest such cautionary tales comes in work related to the oldest cultural expressions treated in our special issue. Sobotkova et al. (2024) describe their thoughtful application of machine-learning methods to a well-defined set of problems in archeology in their article, “Validating Predictions of Burial Mounds with Field Data: the Promise and Reality of Machine Learning.” Here they describe a case study related to burial mounds in the Kazanlak Valley, Bulgaria, documented with high-resolution satellite imagery. Comparing the observations of carefully trained neural networks to those of even a novice human observer, they discover that even the most sophisticated models struggle when faced with the sorts of inconsistencies that occur in real-world landscapes. Importantly, though, the authors do not altogether reject machine learning but rather offer cautions for those many who will want to continue to experiment with AI and suggestions for its improvement.Most readers will no doubt realize that a lot has happened – many would say an entire revolution – in the world of AI during the three years since we began the AEOLIAN project in 2021 and since the research reported here was done. Fair enough: large language models especially, and most obviously (and most brashly) those that power “generative AI” like the ubiquitous ChatGPT and image-creation programs, have indeed taken the world by storm in the popular press. Much of this press has been hyperbolic: does this or that particular mode of artificial intelligence represent the bright future of humanity or spell its doom? Or are we simply observing a high point in the predictable technology hype cycle, a bubble that will (or perhaps already has, by the time you read these words) burst?Regardless of the answers to any of those questions, we believe that the solid work presented here, grounded in both curiosity and a respect for human cultural production, will stand the test of time. Although the timeline of scholarship and scholarly publishing in the humanities is painfully long compared with that of fickle public attention, the timeline of human culture is even longer, and it’s our hope that the thoughtful pieces presented here will still offer thoughtful readers something to think about, something to work on, for a long time to come.Funding: Funding for AEOLIAN (“Artificial Intelligence for Cultural Organisations”) came jointly from the Arts and Humanities Research Council in the UK (Reference: AH/V009443/1) and the National Endowment for the Humanities in the US (Reference: HC-278124-21). The views, findings, conclusions, and recommendations expressed in this article and those that follow, do not necessarily represent those of the National Endowment for the Humanities or the Arts and Humanities Research Council.
arXiv (Cornell University) · 2024-03-27
preprintOpen accessSenior authorIn the era of big and ubiquitous data, professionals and students alike are finding themselves needing to perform a number of textual analysis tasks. Historically, the general lack of statistical expertise and programming skills has stopped many with humanities or social sciences backgrounds from performing and fully benefiting from such analyses. Thus, we introduce Coconut Libtool (www.coconut-libtool.com/), an open-source, web-based application that utilizes state-of-the-art natural language processing (NLP) technologies. Coconut Libtool analyzes text data from customized files and bibliographic databases such as Web of Science, Scopus, and Lens. Users can verify which functions can be performed with the data they have. Coconut Libtool deploys multiple algorithmic NLP techniques at the backend, including topic modeling (LDA, Biterm, and BERTopic algorithms), network graph visualization, keyword lemmatization, and sunburst visualization. Coconut Libtool is the people-first web application designed to be used by professionals, researchers, and students in the information sciences, digital humanities, and computational social sciences domains to promote transparency, reproducibility, accessibility, reciprocity, and responsibility in research practices.
Recent grants
EAGER: Structural Analysis of Large Amounts of Music Information (SALAMI)
NSF · $100k · 2010–2013
Doctoral Consortium Support for Joint Conference on Digital Libraries (JCDL) 2015
NSF · $15k · 2015–2017
Toward the Scientific Evaluation of Music Information Retrieval Systems
NSF · $515k · 2004–2008
NSF · $212k · 2019–2022
Workshop on Integrating Digital Library Content with Computational Tools and Services
NSF · $25k · 2009–2012
Frequent coauthors
- 30 shared
David Bainbridge
University of Waikato
- 28 shared
Xiao Hu
University of Hong Kong
- 22 shared
Jin Ha Lee
University of Washington
- 21 shared
Kevin Page
University of Oxford
- 21 shared
Sally Jo Cunningham
- 20 shared
Andreas F. Ehmann
- 17 shared
Jacob Jett
University of Illinois Urbana-Champaign
- 17 shared
Timothy W. Cole
Education
- 1999
PhD, Faculty of Media and Information Studies
University of Western Ontario
- 1993
MLIS, Graduate School of Library and Information Science
University of Western Ontario
- 1988
BA, Faculty of Music
University of Western Ontario
Awards & honors
- Music Information Retrieval Evaluation eXchange (MIREX) (sin…
- Founder of the International Society Music Information Retri…
- First president of ISMIR
- iSchool participation in iConference 2026
- iSchool well represented at ASIS&T 2025
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with J. Stephen Downie
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup