
Hilke Schellmann
· ProfessorVerifiedNew York University · History
Active 2022–2026
About
Professor Hilke Schellmann is a faculty member in the Department of Journalism at New York University. She is an Emmy award-winning reporter with a focus on investigating artificial intelligence tools, exploring their impact and the power they hold over many people. Schellmann seeks to find new ways and methods to examine and understand AI technologies, emphasizing their societal implications and influence.
Research topics
- Computer Science
- Machine Learning
- Psychology
- Applied psychology
- Social psychology
- Artificial Intelligence
- Natural Language Processing
- Accounting
- Clinical psychology
- Data science
- Linguistics
- Business
- Philosophy
- Speech recognition
Selected publications
Scientific American · 2026-02-17
article1st authorCorrespondingThe case for stakeholder-driven AI auditing in automatic speech recognition
Nature Machine Intelligence · 2026-03-16
articleCareless Whisper: Speech-to-Text Hallucination Harms
arXiv (Cornell University) · 2024 · 68 citations
- Computer Science
- Natural Language Processing
- Linguistics
Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI’s Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper’s transcriptions were highly accurate, we find that roughly 1% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations—a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.
An external stability audit framework to test the validity of personality prediction in AI hiring
Data Mining and Knowledge Discovery · 2022 · 28 citations
- Computer Science
- Machine Learning
- Psychology
Automated hiring systems are among the fastest-developing of all high-stakes AI systems. Among these are algorithmic personality tests that use insights from psychometric testing, and promise to surface personality traits indicative of future success based on job seekers' resumes or social media profiles. We interrogate the validity of such systems using stability of the outputs they produce, noting that reliability is a necessary, but not a sufficient, condition for validity. Crucially, rather than challenging or affirming the assumptions made in psychometric testing - that personality is a meaningful and measurable construct, and that personality traits are indicative of future success on the job - we frame our audit methodology around testing the underlying assumptions made by the vendors of the algorithmic personality tests themselves. Our main contribution is the development of a socio-technical framework for auditing the stability of algorithmic systems. This contribution is supplemented with an open-source software library that implements the technical components of the audit, and can be used to conduct similar stability audits of algorithmic systems. We instantiate our framework with the audit of two real-world personality prediction systems, namely, Humantic AI and Crystal. The application of our audit framework demonstrates that both these systems show substantial instability with respect to key facets of measurement, and hence cannot be considered valid testing instruments.
2022 · 15 citations
- Computer Science
- Artificial Intelligence
- Machine Learning
Automated hiring systems are among the fastest-developing of all high-stakes AI systems. Among these are algorithmic personality tests that use insights from psychometric testing, and promise to surface personality traits indicative of future success based on job seekers' resumes or social media profiles. We interrogate the reliability of such systems using stability of the outputs they produce, noting that reliability is a necessary, but not a sufficient, condition for validity. We develop a methodology for an external audit of stability of algorithmic personality tests, and instantiate this methodology in an audit of two systems, Humantic AI and Crystal. Rather than challenging or affirming the assumptions made in psychometric testing -- that personality traits are meaningful and measurable constructs, and that they are indicative of future success on the job -- we frame our methodology around testing the underlying assumptions made by the vendors of the algorithmic personality tests themselves.
An External Stability Audit Framework to Test the Validity of Personality Prediction in AI Hiring
arXiv (Cornell University) · 2022-01-23 · 2 citations
preprintOpen accessAutomated hiring systems are among the fastest-developing of all high-stakes AI systems. Among these are algorithmic personality tests that use insights from psychometric testing, and promise to surface personality traits indicative of future success based on job seekers' resumes or social media profiles. We interrogate the validity of such systems using stability of the outputs they produce, noting that reliability is a necessary, but not a sufficient, condition for validity. Our approach is to (a) develop a methodology for an external audit of stability of predictions made by algorithmic personality tests, and (b) instantiate this methodology in an audit of two systems, Humantic AI and Crystal. Crucially, rather than challenging or affirming the assumptions made in psychometric testing -- that personality is a meaningful and measurable construct, and that personality traits are indicative of future success on the job -- we frame our methodology around testing the underlying assumptions made by the vendors of the algorithmic personality tests themselves. Our main contribution is the development of a socio-technical framework for auditing the stability of algorithmic systems. This contribution is supplemented with an open-source software library that implements the technical components of the audit, and can be used to conduct similar stability audits of algorithmic systems. We instantiate our framework with the audit of two real-world personality prediction systems, namely Humantic AI and Crystal. The application of our audit framework demonstrates that both these systems show substantial instability with respect to key facets of measurement, and hence cannot be considered valid testing instruments.
Frequent coauthors
- 5 shared
Mona Sloane
University of Virginia
- 3 shared
Julia Stoyanovich
New York University
- 3 shared
Alene K. Rhea
- 2 shared
Kelsey Markey
New York University
- 2 shared
Anna Seo Gyeong Choi
Cornell University
- 2 shared
Lauren D’Arinzo
New York University
- 2 shared
Allison Koenecke
Cornell University
- 2 shared
K. Mei
University of Washington
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Hilke Schellmann
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup