
Jianjing Kuang
· Associate Professor Phonetics, Laboratory phonology, Speech production and perceptionUniversity of Pennsylvania · Linguistics
Active 1988–2026
About
Jianjing Kuang is an Associate Professor of Linguistics in the Linguistics Department at the University of Pennsylvania. She specializes in Phonetics and directs the Penn Phonetics Laboratory. In addition to her primary appointment, she is a faculty member of both MindCORE, the Mind Center for Outreach, Research, and Education, and The Center for East Asian Studies. Her work focuses on the scientific study of speech sounds, contributing to the understanding of phonetic phenomena through research and laboratory work. She is based at 3401-C Walnut Street, Philadelphia, PA, with her office located in Room 318-C on the third floor.
Research topics
- Artificial Intelligence
- Computer Science
- Mathematics
- Acoustics
- Pure mathematics
- Linguistics
- Physics
- Speech recognition
Selected publications
Evaluating prosodic encodings of information structure in generative speech AI
2026-05-14
articleOpen access1st authorCorrespondingRecent advances in text-to-speech (TTS) technology have substantially improved the intelligibility and fluency of synthetic speech, yet current systems still struggle with producing human-like prosody.This study evaluates whether state-of-theart TTS models can encode information structure with proper prosodic focus, and how their performance compares to human speakers.Using a carefully controlled experimental design, we constructed target sentences that were lexically identical but differed only in their information structure.The task for the speech models was to produce prosodically distinct prosodic focus conditions, in a manner that is both correct and natural.Model performance was assessed using human perceptual judgments as well as acoustic comparisons with human productions.Results show that only a small subset of current TTS systems can vary prosodic focus with any degree of accuracy or naturalness.Even the strongest-performing models (e.g.Gemini) exhibit uneven performance across focus types, with certain conditions proving particularly challenging.These findings highlight a persistent gap between human and synthetic prosody and suggest that achieving robust, context-sensitive prosodic focus remains a key frontier for next-generation speech models.
The Journal of the Acoustical Society of America · 2025-12-01
articleSenior authorThis study explores how the spectral information of vowel production targets changes by pitch in classical singing. Although it is established that singers engage in pitch-dependent vocal tract adjustments, less is known about how spectral envelope contrasts between vowel targets change by pitch and how consistent the patterns are across singers. Seven professional classical singers sang seven vowels across their pitch range. The energy of 16 spectral bands at every 500 Hz interval up to 8000 Hz was measured. Principal component analyses were performed to describe spectral variation. Results show consistent changes in spectral energy distribution as pitch increases despite individual differences in pitch range. A separate set of analyses further uses the center of gravity, standard deviation, skewness, and kurtosis of the spectra as a proxy for spectral shape variation, showing that the summarized spectral envelopes of vowel production targets systematically converge at higher pitches. Overall results suggest that pitch-dependent vocal tract adjustments may be shaped not only by singers' acoustic targets, but also by physiological constraints relative to each singer's overall pitch range. More broadly, this study demonstrates the possibility of using dimensionality reduction methods to characterize spectral patterns in high-pitched singing when formant tracking fails.
ArXiv.org · 2025-11-03
preprintOpen accessSenior authorProsody is essential for speech technology, shaping comprehension, naturalness, and expressiveness. However, current text-to-speech (TTS) systems still struggle to accurately capture human-like prosodic variation, in part because existing evaluation methods for prosody remain limited. Traditional metrics like Mean Opinion Score (MOS) are resource-intensive, inconsistent, and offer little insight into why a system sounds unnatural. This study introduces a linguistically informed, semi-automatic framework for evaluating TTS prosody through a two-tier architecture that mirrors human prosodic organization. The method uses quantitative linguistic criteria to evaluate synthesized speech against human speech corpora across multiple acoustic dimensions. By integrating discrete and continuous prosodic measures, it provides objective and interpretable metrics of both event placement and cue realization, while accounting for the natural variability observed across speakers and prosodic cues. Results show strong correlations with perceptual MOS ratings while revealing model-specific weaknesses that traditional perceptual tests alone cannot capture. This approach provides a principled path toward diagnosing, benchmarking, and ultimately improving the prosodic naturalness of next-generation TTS systems.
Vowel perception at high F0 in real and synthetic speech
The Journal of the Acoustical Society of America · 2025-10-01
articleSenior authorIt is known that vowels are less intelligible at higher pitches, but it remains unclear whether this is solely driven by talkers’ articulation or by changes in the acoustic structure. This study disentangles these factors to understand the production–perception mapping mechanism for higher pitched vowels. 102 participants completed a word identification task. They listened to recordings of bVt spanning 51 semitones (C2-D6), and identified the word given five choices (beat, boot, bet, bot, bat). Half the participants listened to sung vowels by four opera singers (Soprano, Mezzo-Soprano, Tenor, Baritone), and their half listened to Klattgrid synthesized vowels with two sets of fixed formant values (man, woman) across the same F0 ranges as the singers. Results show identification rates declining as F0 approaches F1, with rates for sung vowels beginning to decline at lower pitches than for synthesized vowels. The effect of F0 also varies by vowel; mid-front vowels AE and EH are confusable across both conditions, and IY remains near-ceiling in synthesized but not sung conditions across the F0 range. Overall results suggest that the decline of vowel intelligibility at high pitches is contributed both by talkers’ pitch-dependent vocal tract adjustments and by the changes in the spectrum itself.
Pitch-dependent vowel space adjustments by professional singers
The Journal of the Acoustical Society of America · 2025-04-01
articleSenior authorSingers are trained to adjust their resonance space depending on pitch target. While existing studies on classical singers have focused on modeling the resonance profile, the articulatory correspondences have been less explored. The goal of this study is to model change in tongue position—a key factor that shapes the resonance space—across singers’ pitch range using articulatory data. Fifteen professional singers sang five sets of English vowels at each semitone across their pitch range. Each set included the target vowels ([i], [ɛ], [æ], [ɑ], [u]) in randomized order, and a filler vowel ([ɔ]) closing the breath group. Midsagittal ultrasound tongue images were collected at 81.5 frames per second. Tongue position was automatically tracked using DeepLabCut. Preliminary results of eight participants modelled using Functional Data Analysis show gradual neutralization of vowel contrast as pitch increases relative to the singers’ range. Specifically, high vowels tended to lower and low vowels tended to raise at higher pitches. These results suggest that articulatory adjustments operate in vowel-specific directions. Analysis by singer gender and voice type is underway. We discuss these findings in relation to vowel–pitch interactions in voice production.
The interplay between vowels, pitch targets, and voice quality in singing
The Journal of the Acoustical Society of America · 2025-04-01
articleSenior authorOne strategy in singing is to change the resonance space depending on the target pitch to bridge register shifts. However, the systematic relationship between resonance space adjustments and timbre remains understudied. We investigate whether pitch-dependent vowel modification contributes to voice quality changes. Twenty-three lay speakers participated in an articulatory experiment with ultrasound tongue imaging and electroglottography (EGG). Participants were asked to sing five sets of English vowels across their pitch range in ascending semitone steps. Midsagittal tongue images were splined with DeepLabCut, and closed quotients (CQ) were extracted from the EGG signals. Functional principal component analysis was applied to the tongue splines to evaluate their relationship with CQ. Preliminary results show a nonlinear relationship between voice quality and pitch height, with more modal voice quality in a large portion of the lower end of the pitch range, gradually moving toward a breathier voice quality before becoming tenser again at higher pitches. Tongue position adjustments also contributed to the gradual change in voice quality as the pitch increases; an effect most apparent for participants with a vocal range larger than one octave. Findings highlight the complex interplay between tongue position, voice quality, and pitch targets, showcasing the coordination between source and filter structures.
Some prosodic consequences of varied discourse functions in a Cantonese sentence-final particle
2024-06-30
articleSenior authorUltrasound tongue imaging of vowel spaces across pitches in singing
The Journal of the Acoustical Society of America · 2024-03-01
articleSenior authorOne important technique in singing is vowel modification: the adjustment of the resonance space based on the sung pitch for more efficient voice production. We explore whether vowel modification is a learned technique for enhanced acoustics, or if it is a necessary articulatory adjustment for high pitch production. 16 participants without vocal training participated in a singing experiment with ultrasound tongue imaging. Participants were asked to sing sets of English vowels across their comfortable pitch range rising by semitone in a steady tempo, resembling a vocal warm up exercise. Participants sang 5 sets of vowels in total, each set consists of 5 target vowels ([i], [ɛ], [æ], [ɑ], [u]) in randomized order with 1 filler ([ɔ]) closing each breath group. Images of tongue position were splined using DeepLabCut. Preliminary results show that untrained singers tend not to adjust their tongue positions by pitch, though cases of tongue lowering occasionally occurred, particularly for the participants who sing a wide pitch range. In contrast, additional pilot data from 2 trained operatic singers showed gradual tongue adjustments across their pitch range, neutralizing vowel contrasts at their highest pitches. We discuss findings with respect to vowel-pitch interaction, drawing implications on theories of voice production.
Exploring the accuracy of prosodic encodings in state-of-the-art text-to-speech models
2024-06-30 · 3 citations
articleSenior authorOtolaryngology · 2024-12-26 · 1 citations
articleOpen accessOBJECTIVE: Clinicians face challenges in managing the growing population of patients with limited English proficiency (LEP) and hearing loss (HL) in the United States. This study seeks to investigate provider perspectives on evaluating, counseling, and treating HL in LEP patients. STUDY DESIGN: Prospective descriptive study. SETTING: Tertiary care center. METHODS: Researchers employed a mixed methods design: (1) structured clinician interviews, (2) cross-sectional, national electronic survey, both regarding perspectives on managing hearing loss in LEP patients. Structured interviews were analyzed using modified grounded theory. RESULTS: Twenty-nine providers participated in interviews (16 otologists, 13 audiologists). The most reported non-English language was Spanish, followed by Chinese languages. Four thematic domains were derived: barriers to care, cochlear implant (CI) candidacy evaluation, counseling, and ideal resources. Major barriers were patient desire (97%; n = 28), and lack of validated tests (72%; n = 21). Methods of CI evaluation included improvising on validated speech perception testing (59%; n = 17) and use of non-speech evaluation (52%; n = 15). One-quarter forgoes speech testing in non-Spanish-speaking patients (24%; n = 7). Suggestions to improve management include in-person interpreters (62%; n = 18) and testing battery in all languages (31%; n = 9). National survey results (n = 87 providers) demonstrated that respondents were significantly less confident in the methods of speech perception testing and in counseling on surgical hearing rehabilitation in LEP. CONCLUSION: Clinicians encounter challenges in managing LEP patients with HL, including limitations in audiometric and CI candidacy assessment, communication barriers, information accessibility, and cultural competency. Opportunities for improving care include developing language-specific test batteries, linguistically and culturally appropriate education materials, and cultural competency training.
Frequent coauthors
- 58 shared
Patricia Keating
University of Strathclyde
- 54 shared
Marc Garellek
University of California, San Diego
- 53 shared
Christina M. Esposito
Reed College
- 53 shared
Sameer ud Dowla Khan
University of California, Berkeley
- 10 shared
May Pik Yu Chan
University of Pennsylvania
- 9 shared
Mark Liberman
- 8 shared
Nari Rhee
University of California, Berkeley
- 6 shared
Jia Tian
Labs
Education
- 2013
Ph.D., Phonetics, Laboratory phonology, Speech production and perception
UCLA
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Jianjing Kuang
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup