Mark Liberman

· Christopher H. Browne Distinguished Professor of Linguistics Phonetics, prosody, natural language processing, speech communicationVerified

University of Pennsylvania · Linguistics

Active 1976–2026

h-index41

Citations9.4k

Papers335124 last 5y

Funding$1.3M

Faculty page Lab page Website

See your match with Mark Liberman — sign in to PhdFit.Sign in

About

Mark Liberman is the Christopher H. Browne Distinguished Professor of Linguistics and Director of the Linguistic Data Consortium at the University of Pennsylvania. He holds professorships in both the Department of Linguistics and the Department of Computer and Information Science, and serves as Faculty Director of Ware College House. His academic work spans a broad range of topics within linguistics and cognitive science, including corpus-based phonetics, the phonology and phonetics of lexical tone and its relationship to intonation, and formal models for linguistic annotation. Additionally, he applies linguistic analysis to legal, medical, and political domains, demonstrating an interdisciplinary approach to language research. Professor Liberman teaches a variety of courses covering introductory linguistics, computational analysis, phonetics, big data in linguistics, and advanced topics such as deep learning and large language models in linguistic research. His research contributions reflect a commitment to integrating computational methods with linguistic theory and practical applications.

Research topics

Computer Science
Artificial Intelligence
Natural Language Processing
Speech recognition
Psychology
Linguistics

Selected publications

Automatic detection of autism using large vision-language models: A preliminary analysis.
2026-01-01
article
Robert T. Schultz, & Julia Parish-Morris.
Publisher
Speaker role identification in clinical conversations
Faculty of 1000 Research Ltd · 2025-01-01
otherOpen access
Publisher DOI
Social Context Matters for Turn‐Taking Dynamics: A Comparative Study of Autistic and Typically Developing Children
UNC Libraries · 2025-10-23
articleOpen access
Engaging in fluent conversation is a surprisingly complex task that requires interlocutors to promptly respond to each other in a way that is appropriate to the social context. In this study, we disentangled different dimensions of turn-taking by investigating how the dynamics of child-adult interactions changed according to the activity (task-oriented vs. freer conversation) and the familiarity of the interlocutor (familiar vs. unfamiliar). Twenty-eight autistic children (16 male; M a g e $M_{age}$  = 10.8 years) and 20 age-matched typically developing children (8 male; M a g e $M_{age}$  = 9.6 years) participated in seven task-orientated face-to-face conversations with their caregivers (336 total conversations) and seven more telephone conversations alternately with their caregivers (144 total conversations, 60 with the typical development group) and an experimenter (191 total conversations, 112 with the autism group). By modeling inter-turn response latencies in multi-level Bayesian location-scale models, we found that inter-turn response latencies were consistent across repeated measures within social contexts, but exhibited substantial differences across social contexts. Autistic children exhibited more overlaps, produced faster response latencies and shorter pauses than typically developing children-and these group differences were stronger when conversing with the unfamiliar experimenter. Unfamiliarity also made the relation between individual differences and latencies evident: only in conversations with the experimenter were higher sociocognitive skills and lower social awareness associated with faster responses. Information flow and shared tempo were also influenced by familiarity: children adapted their response latencies to the predictability and tempo of their interlocutor's turn, but only when interacting with their caregivers and not the experimenter. These results highlight the need to construe turn-taking as a multicomponential construct that is shaped by individual differences, interpersonal dynamics, and the affordances of the context.
Publisher DOI
Relation between Depression Dimensions and Speech Acoustic and Emotion‐based Features
Alzheimer s & Dementia · 2025-12-01
articleOpen access
BACKGROUND: Depression is a common and highly heterogeneous disorder in older adults, often linked to faster cognitive decline. While standard questionnaires are subjective, speech analysis may offer a more objective method for characterizing depression. This study investigates the relationship between speech features and depression dimensions in participants at Mount Sinai Alzheimer's Disease Research Center (ADRC). METHOD: Participants included healthy controls (n = 31) and individuals with mild cognitive impairment (MCI= 22) and Alzheimer's Disease (AD, n = 16). They described three pictures with neutral, negative and positive themes. Speech features were analyzed using automated pipelines and emotion-based variables and acoustic features were used for analysis. Depression dimensions (dysphoria, apathy, hopelessness, and memory complaints) were assessed based on Geriatric Depression Scale-15 (GDS-15). Mixed model regression was used to assess the relationship of depression dimensions and the emotional nature of the pictures. RESULT: The 73 participants (41% male, average age 80.14±8.01) showed that dysphoria and apathy had opposite associations with emotion-based measures. Dysphoria was associated with higher valence (more positive emotion), while apathy to lower valence. Subjective memory complaint was also associated with lower valence words. Further analysis revealed that apathy was associated with lower pitch and slower speech when describing negative pictures, and dysphoria with a wider pitch range and faster speech for negative pictures. Patients with memory complaints used a narrower pitch range in both positive and negative tasks. CONCLUSION: Dysphoria was associated with heightened emotional reactivity, while apathy showed decreased reactivity. Apathy and memory complaints shared similar speech features. Our preliminary results support the use of speech features in distinguishing different depression dimensions even at the subclinical level, offering promising opportunities for use of technology to enhance both understanding and diagnosis of depression.
Publisher OA PDF DOI
Social Context Matters for Turn‐Taking Dynamics: A Comparative Study of Autistic and Typically Developing Children
Cognitive Science · 2025-10-01 · 3 citations
articleOpen access
Abstract Engaging in fluent conversation is a surprisingly complex task that requires interlocutors to promptly respond to each other in a way that is appropriate to the social context. In this study, we disentangled different dimensions of turn‐taking by investigating how the dynamics of child–adult interactions changed according to the activity (task‐oriented vs. freer conversation) and the familiarity of the interlocutor (familiar vs. unfamiliar). Twenty‐eight autistic children (16 male; = 10.8 years) and 20 age‐matched typically developing children (8 male; = 9.6 years) participated in seven task‐orientated face‐to‐face conversations with their caregivers (336 total conversations) and seven more telephone conversations alternately with their caregivers (144 total conversations, 60 with the typical development group) and an experimenter (191 total conversations, 112 with the autism group). By modeling inter‐turn response latencies in multi‐level Bayesian location‐scale models, we found that inter‐turn response latencies were consistent across repeated measures within social contexts, but exhibited substantial differences across social contexts. Autistic children exhibited more overlaps, produced faster response latencies and shorter pauses than typically developing children—and these group differences were stronger when conversing with the unfamiliar experimenter. Unfamiliarity also made the relation between individual differences and latencies evident: only in conversations with the experimenter were higher sociocognitive skills and lower social awareness associated with faster responses. Information flow and shared tempo were also influenced by familiarity: children adapted their response latencies to the predictability and tempo of their interlocutor's turn, but only when interacting with their caregivers and not the experimenter. These results highlight the need to construe turn‐taking as a multicomponential construct that is shaped by individual differences, interpersonal dynamics, and the affordances of the context.
Publisher OA PDF DOI
Age distribution of speech duration measures in healthy individuals with picture description tasks
Alzheimer s & Dementia · 2025-12-01
articleOpen access
BACKGROUND: Picture description tasks have been used to gauge communication capabilities of neurodegenerative patients. Previous studies have demonstrated that speech duration measures can help distinguish between healthy and neurodegenerative individuals, or even among patients of different neurodegenerative diseases. Hence, it is crucial to establish the baseline of those measures in healthy individuals in a wide age range, as deviations could indicate neurodegeneration. We collected picture description task responses from healthy volunteers to quantify such a baseline, which can aid early detection of neurodegenerative diseases. METHODS: There were 290 healthy participants with a wide distribution of ages, from 15 to 90 years (M=49.9, SD=18.2), who voluntarily participated in picture description tasks online. The pictures included the Cookie Theft scene, the Picnic scene, and two similar pictures we designed; some participants completed all four, while others completed only some of them (M=2.6, SD=1.3). An in-house speech activity detector program was employed to segment audio files into speech and pause segments automatically. We built linear mixed-effects models to examine the effects of age on quantitative speech duration measures, including mean speech and pause segment durations, speech percentage (proportion of speech in the entire recording time), and pause rate (average pause count per minute). Sex, education levels, and picture types were included as fixed effects, and participant IDs as random effects. RESULTS: Older participants showed lower speech percentages (β=-0.079, SE=0.021, p <.001) and lower mean speech durations (β=-0.004, SE=0.002, p = .022), but the decline failed to replicate in total speech duration (β=-0.051, SE=0.079, p = .512). Meanwhile, pause rates increased with age (β=0.065, SE=0.022, p = .003), as with mean pause durations (β=0.002, SE=0.000, p = .001) and total pause durations (β=0.059, SE=0.023, p = .011). Total durations did not yield significant results (β=0.007, SE=0.095, p = .943). CONCLUSION: We demonstrated specific ageing trends across various speech duration measures. The different pictures and the demographic variance of participants support the reliability of our results. These outcomes contribute to the understanding of how speech duration patterns change with age, establishing a baseline to estimate the deviation from healthy ageing at different ages.
Publisher OA PDF DOI
Evaluating Speech-to-Text Systems with PennSound
ArXiv.org · 2025-04-08
preprintOpen access
A random sample of nearly 10 hours of speech from PennSound, the world's largest online collection of poetry readings and discussions, was used as a benchmark to evaluate several commercial and open-source speech-to-text systems. PennSound's wide variation in recording conditions and speech styles makes it a good representative for many other untranscribed audio collections. Reference transcripts were created by trained annotators, and system transcripts were produced from AWS, Azure, Google, IBM, NeMo, Rev.ai, Whisper, and Whisper.cpp. Based on word error rate, Rev.ai was the top performer, and Whisper was the top open source performer (as long as hallucinations were avoided). AWS had the best diarization error rates among three systems. However, WER and DER differences were slim, and various tradeoffs may motivate choosing different systems for different end users. We also examine the issue of hallucinations in Whisper. Users of Whisper should be cautioned to be aware of runtime options, and whether the speed vs accuracy trade off is acceptable.
Publisher OA PDF DOI
Speaker Role Identification in Clinical Conversations
2025-12-01
articleOpen access
Patient-clinician communication research is crucial for understanding interaction dynamics and for predicting outcomes that are associated with clinical discourse. Traditionally, interaction analysis is conducted manually because of challenges such as Speaker Role Identification (SRI), which must reliably differentiate between doctors, medical assistants, patients, and other caregivers in the same room. Although automatic speech recognition with diarization can efficiently create a transcript with separate labels for each speaker, these systems are not able to assign roles to each person in the interaction. Previous SRI studies in task-oriented scenarios have directly predicted roles using linguistic features, bypassing diarization. However, to our knowledge nobody has investigated SRI in clinical settings. We explored whether Large Language Models (LLMs) such as BERT could accurately identify speaker roles in clinical transcripts, with and without diarization. We used veridical turn segmentation and diarization identifiers, fine-tuning each model at varying levels of identifier corruption to assess impact on performance. Our results demonstrate that BERT achieves high performance with linguistic signals alone (82% accuracy/82% F1-score), while incorporating accurate diarization identifiers further enhances accuracy (95%/95%). We conclude that fine-tuned LLMs are effective tools for SRI in clinical settings.
Publisher DOI
Decoding Dementia from Speech: Acoustic‐Lexical Integration for Detecting Alzheimer's Disease in Older Korean Adults
Alzheimer s & Dementia · 2025-12-01
articleOpen access
BACKGROUND: Early detection of mild cognitive impairment due to Alzheimer's disease (MCI d/t AD), as well as AD dementia (ADD), is critical for timely intervention. Speech analysis offers a non-invasive way to detect subtle cognitive deficits. This study explores the utility of acoustic and lexical features in classifying older Korean adults across three clinical scenarios: (1) HC vs. AD (MCI d/t AD & ADD) for screening, (2) Non-dementia (HC & MCI d/t AD) vs. ADD for detecting advanced pathology, and (3) HC vs. ADD for assessing the most divergent clinical states. We aim to demonstrate the feasibility of speech-based methods for supporting more timely interventions. METHOD: We recruited 110 older Korean adults (HC=55, MCI d/t AD=29, ADD=26). Groups did not differ in gender (p = .372) or education (p = .278). However, the MCI d/t AD group was older (77.79±5.27) than the HC (72.51±6.38) and ADD (73.35±7.48) groups (p = .002), whereas there was no significant difference between HC and ADD. Cognitive measures (MMSE, CDR; both p <.001) differed significantly. All MCI d/t AD and ADD patients were beta-amyloid positive in PET scans. Speech was collected via recording from neuropsychological tests and additional tasks (Korean phonemic/semantic fluency, vowel phonation, picture description). Acoustic and lexical features were extracted with openSMILE (emobase, 988-dimensional) and a pretrained Korean RoBERTa model (768-dimensional). Principal component analysis was applied to each feature set. Three classification models were built using (1) acoustic-only, (2) lexical-only, and (3) an ensemble of acoustic and lexical features. Each model was implemented through a multilayer perceptron and evaluated with 5-fold cross-validation. RESULT: In our experiments, ensemble models outperformed single-feature-based models (Table 1). For HC vs. AD, the ensemble model achieved 75.8% accuracy and 0.756 AUC; for Non-dementia vs. ADD, 85.1% accuracy and 0.801 AUC; and for HC vs. ADD, 87.0% accuracy and 0.893 AUC. Combining acoustic and lexical features provided complementary information, reflecting vocal characteristics and language-based deficits. CONCLUSION: These findings demonstrate that speech-derived features can detect cognitive impairment in older Korean adults across multiple diagnostic scenarios, enabling earlier and more targeted interventions. Moreover, this non-invasive approach may ease clinical workflows and broaden screening accessibility, particularly in resource-limited settings.
Publisher OA PDF DOI
Automated speech and language markers of longitudinal changes in psychosis symptoms
NPP—Digital Psychiatry and Neuroscience · 2025-06-17 · 10 citations
articleOpen access
Abstract We sought to evaluate the ability of automated speech and language features to longitudinally track fluctuations in the major psychosis domains: Thought Disorder , Negative Symptoms , and Positive Symptoms . Sixty-six participants with psychotic disorders were assessed soon after inpatient admission, at discharge, and at 3- and 6-months. Psychosis symptoms were measured with semi-structured interviews and standardized scales. Recordings were collected from paragraph reading, fluency, picture description, and open-ended tasks. Relationships between psychosis symptoms and 357 automated speech and language features were analyzed using a single component score and as individual features, using linear mixed models. We found that all three domains demonstrated significant longitudinal relationships with the single component score. Thought Disorder was particularly related to features describing more subordinated constructions, less efficient identification of picture elements, and decreased semantic distance between sentences. Negative Symptoms was related to features describing decreased speech complexity. Positive Symptoms domain score did not show relationships with individual features that survived p-value correction, but Suspiciousness was related to decreased use of nouns and Hallucinations was related to greater semantic distances. These relationships were largely robust to interactions with gender and race. Interactions with timepoint revealed variable relationships during different phases of illness (acute vs. stable). In summary, automated speech and language features show promise as scalable, objective markers of psychosis severity. Detailed attention to clinical setting and patient population is needed to optimize clinical translation.
Publisher OA PDF DOI

Recent grants

CI-NEW: NIEUW: Novel Incentives and Workflows in Linguistic Data Collection and Annotation
NSF · $1.2M · 2017–2023
EAGER: Mining a Year of Speech
NSF · $100k · 2010–2012

Frequent coauthors

Sunghye Cho
Pennsylvania Academic Library Consortium
144 shared
Naomi Nevler
University of Pennsylvania
115 shared
Murray Grossman
100 shared
Sharon Ash
University of Pennsylvania
89 shared
Christopher Cieri
83 shared
David J. Irwin
University of Pennsylvania
71 shared
Sanjana Shellikeri
University of Pennsylvania
65 shared
Neville Ryant
55 shared

Labs

Department of Linguistics, University of PennsylvaniaPI

Education

Ph.D., Phonetics, prosody, natural language processing, speech communication
MIT
1975

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Mark Liberman

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you