
William W Cohen
· ProfessorCarnegie Mellon University · Machine Learning Department
Active 1926–2026
About
William W. Cohen is a Professor at Carnegie Mellon University in the Machine Learning Department, with a joint appointment in the Language Technology Institute. He holds a 20%-time appointment as a Principal Scientist at Google, where he worked full-time between May 2018 and March 2024. Cohen received his bachelor's degree in Computer Science from Duke University in 1984 and his PhD in Computer Science from Rutgers University in 1990. His professional background includes work at AT&T Bell Labs and AT&T Labs-Research from 1990 to 2000, and at Whizbang Labs from 2000 to 2002, focusing on extracting information from the web. From 2002 to 2018, he was part of Carnegie Mellon University’s Machine Learning Department, contributing significantly to the field. Cohen has served as a past president of the International Machine Learning Society and has held roles as an action editor for various prominent journals and book series related to AI and machine learning. He has been involved in organizing major conferences, including serving as General Chair for the 2008 International Machine Learning Conference and co-chairing other significant events. Recognized as an AAAI Fellow, Cohen has received multiple awards for influential papers, including the SIGMOD 'Test of Time' Award, the SIGIR 'Test of Time' Award, and the Semantic Web Science Association's Ten-Year Award. His research interests encompass question answering, machine learning for NLP tasks, neuro-symbolic reasoning, and statistical relational learning. Cohen holds seven patents related to learning, discovery, information retrieval, and data integration, and has authored more than 300 publications. His work reflects a broad engagement with both theoretical and applied aspects of machine learning and AI, contributing to advancements in understanding and developing intelligent systems.
Research topics
- Artificial Intelligence
- Information Retrieval
- Computer Science
- Machine Learning
- Natural Language Processing
- Data Mining
- World Wide Web
Selected publications
Multiple-Prediction-Powered Inference
arXiv (Cornell University) · 2026-03-28
preprintOpen accessStatistical estimation often involves tradeoffs between expensive, high-quality measurements and a variety of lower-quality proxies. We introduce Multiple-Prediction-Powered Inference (MultiPPI): a general framework for constructing statistically efficient estimates by optimally allocating resources across these diverse data sources. This work provides theoretical guarantees about the minimax optimality, finite-sample performance, and asymptotic normality of the MultiPPI estimator. Through experiments across three diverse large language model (LLM) evaluation scenarios, we show that MultiPPI consistently achieves lower estimation error than existing baselines. This advantage stems from its budget-adaptive allocation strategy, which strategically combines subsets of models by learning their complex cost and correlation structures.
Multiple-Prediction-Powered Inference
arXiv (Cornell University) · 2026-03-28
articleOpen accessStatistical estimation often involves tradeoffs between expensive, high-quality measurements and a variety of lower-quality proxies. We introduce Multiple-Prediction-Powered Inference (MultiPPI): a general framework for constructing statistically efficient estimates by optimally allocating resources across these diverse data sources. This work provides theoretical guarantees about the minimax optimality, finite-sample performance, and asymptotic normality of the MultiPPI estimator. Through experiments across three diverse large language model (LLM) evaluation scenarios, we show that MultiPPI consistently achieves lower estimation error than existing baselines. This advantage stems from its budget-adaptive allocation strategy, which strategically combines subsets of models by learning their complex cost and correlation structures.
Semi-structured LLM Reasoners Can Be Rigorously Audited
ArXiv.org · 2025-05-30
preprintOpen accessSenior authorAlthough Large Language Models (LLMs) have become capable reasoners, the problem of faithfulness persists: their reasoning can contain errors and omissions that are difficult to detect and that may obscure biases in model outputs. To address this issue, we introduce Semi-Structured Reasoning Models (SSRMs), which are trained to produce semi-structured representations of reasoning. SSRMs generate reasoning traces in a non-executable Pythonic syntax that names each reasoning step and marks its inputs and outputs. This structure allows SSRM traces to be automatically audited to identify reasoning flaws. We evaluate three types of audits: hand-crafted structured reasoning audits, written in a domain-specific language (DSL) implemented in Python; LLM-generated structured reasoning audits; and learned typicality audits, which apply probabilistic models over reasoning traces. We show that all of these methods can be used to effectively flag probable reasoning errors. Importantly, the auditability of SSRMs does not appear to compromise overall accuracy: in evaluation on twelve benchmarks and two model families, SSRMs demonstrate strong performance and generalizability relative to other models of comparable size.
Abstract TH146: Gaps in Hypertension Management within Populations Experiencing Food Insecurity
Hypertension · 2025-09-01
articleBackground: Hypertension (HTN) is the most prevalent risk factor associated with cardiovascular mortality. Food insecurity is a social determinant of health that plays a key role in determining one’s risk for HTN and CVD. Individuals experiencing food insecurity are more likely to face difficulty accessing affordable, healthy food options and healthcare. This study aims to assess the prevalence of uncontrolled hypertension within populations experiencing food insecurity in Chicago, IL. Methods: The Cardiometabolic Health Initiative (CHI) is a student-led organization that seeks to increase access to cardiometabolic screening within food insecure communities. CHI offers point-of-care cardiovascular screenings and health coaching at food pantries in Chicago, IL. Data collected include self-reported medical history and vitals. Patients blood pressures (BP) were measured and categorized as normal (systolic blood pressure (SBP)<120 and diastolic blood pressure (DBP)<80), elevated (SBP 120-129 and DBP< 80), Stage 1 (SBP 130-139 or DBP 80-89), Stage 2 (SBP >140 or DBP>90), or Hypertensive Crisis (SBP >180 or DBP>120). Results: BPs were recorded for 408 patients, of which 89 (21.81%) were categorized as normal, 31 (7.60%) as elevated, 105 (25.74%) as stage 1 HTN, 173 (42.40%) as stage 2 HTN, and 10 (2.45%) as in hypertensive crisis. Among the 408 patients, 182 (44.61%) patients self-reported not taking medication for HTN while 226 (55.39%) self-reported taking medication for HTN. Within the group of patients who take HTN medication, 19 (61.29%) had an elevated BP, 60 (57.14%) were in the stage 1 HTN range, 71 (41.04%) were in the stage 2 HTN range, and 4 (40.00%) were in hypertensive crisis. Furthermore, of the patients who do not take HTN medication, 12 (38.71%) had elevated BPs, 45 (42.86%) had BPs in the stage 1 HTN category, 102 (58.96%) in the stage 2 HTN category, and 6 (60.00%) were in hypertensive crisis. Conclusions: These findings suggest a high prevalence of uncontrolled HTN among patients screened at food pantries in Chicago, IL. This underscores key gaps in HTN management among patients experiencing food insecurity. Inequities that impact access to healthcare and levels of health literacy can contribute to difficulties controlling blood pressure among patients experiencing food insecurity, highlighting the need for additional community-based programs, like CHI, to expand access to preventative care and health education within at-risk communities.
Diabetes · 2025-06-13
articleIntroduction and Objective: Studies show diabetes is a strong predictor of cardiovascular disease (CVD). Diabetes-related complications include coronary heart disease, cerebrovascular disease, heart failure and peripheral vascular disease. Food insecurity poses key barriers to glycemic management, potentially resulting in worse CVD outcomes. The goal of this study is to examine the association between elevated A1c and ASCVD risk scores in a food insecure population in West Chicago. Methods: The Cardiometabolic Health Initiative (CHI) is a mobile screening clinic that performs comprehensive cardiometabolic health screenings at food pantries in West Chicago. Between August 2023 and December 2024, patients received point of care A1c measurements, lipid panels, and blood pressure readings. Results were used to calculate 10-year Atherosclerotic Cardiovascular Disease (ASCVD) risk scores and provide individualized health coaching, focused on confronting social determinants of health (SDoH). Results: Out of 153 patients, 82 (54%) had a normal A1c (&lt;5.7%), 51 (33%) had a prediabetic A1c (5.7%-6.4%) and 20 (13%) had a diabetic A1c (&gt;6.4%) (Figure 1). The average ASCVD risk score for the total population was 9.1% (SD=11.3; Figure 2). Among patients with a normal A1c, the average ASCVD risk score was 6.6% (SD=7.9), among prediabetic patients, the average ASCVD risk score was 9.8% (SD=10.8) and among diabetic patients, the average ASCVD risk score was 17.6% (SD=18.4, p=0.000; Figure 2). Conclusion: These findings suggest a high prevalence of prediabetes and diabetes within this food insecure population with 46% of screened patients having an elevated A1c. Consistent with previous literature, elevated A1c values may directly correlate with increased ASCVD risk. Community-based preventative screenings and health education programs, like CHI, can help identify these high-risk individuals, address SDoH, and combat disease burden in disadvantaged populations. Disclosure C. Richter: None. E. Belnap: None. A. McIntosh: None. I. Khosla: None. W. Cohen: None. E. Sullivan: None. R. Garcia: None. A. DeMeo: None. D. Luger: None.
Circulation · 2025-11-03
articleBackground: Coronary artery calcium (CAC) scoring is a key tool for risk stratification in preventive cardiology. While clinician-referred patients are typically higher-risk, some institutions adopted low-cost promotional CAC screening to enable risk assessment in individuals not reached by traditional referrals. Research Question: Among a diverse, urban cohort, do comorbidities, baseline preventive therapy, and CAC scores differ between patients undergoing clinician-referred vs. promotional CAC screening? Methods: A retrospective cohort study was conducted at a large urban academic center in Chicago. Adults undergoing CAC screening from Jan 2022 to Dec 2023 via clinician referral or low-cost promotion were included. The primary exposure was referral pathway. The primary outcome was CAC burden, categorized by Agatston score (0 = none, 1–99 = mild, 100–299 = moderate, ≥300 = severe) and stratified by coronary artery territory (Left Main, LAD, LCX, RCA). Baseline comorbidities and use of preventive medications were recorded. Descriptive statistics compared baseline characteristics. CAC coronary distributions were assessed using chi-square tests. Results: 1,743 patients were screened, 932 via clinical referral and 811 through promotion. Compared to referral patients, promotion patients had lower rates of HTN (26.1% vs. 38.4%, p<0.001), dyslipidemia (15.9% vs. 24.7%, p<0.001), and CAD (6.3% vs. 11.4%, p<0.001). Preventative medication use was lower in the promotion group: any statin (41.3% vs. 51.1%, p<0.001), high-intensity statin (14.9% vs. 24.4%, p<0.001), moderate-intensity statin (33.7% vs. 39.8%, p=0.008), aspirin (29.7% vs. 38.0%, p<0.001), and ezetimibe (2.0% vs. 4.6%, p=0.002). Referral patients had higher rates of overall Left Main(17.5% vs. 10.8%, p<0.001) and LCX (25.6% vs.23.5%, p=0.044) CAC burden. LAD(42.2% vs. 46.9%, p=0.113) and RCA(27% vs. 28.6%, p=0.433) CAC burdens did not differ significantly between promotion and referral groups moderate/severe total CAC prevalence was similar between the promotion and referral cohorts(22.4% vs.25.7%, p=0.098). Discussion: Promotion patients showed a high prevalence of non-zero CAC and similar moderate/severe CAC burden as referred patients, despite fewer comorbidities and lower medication use. These findings support low-cost promotional CAC screening as a practical method for detecting subclinical atherosclerosis and may enhance early risk detection in asymptomatic patients not typically reached by clinician referral.
Circulation · 2025-11-03
articleBackground: Coronary artery calcium(CAC) scoring is a non-invasive tool for detecting subclinical atherosclerosis. In 2022, RUSH University Medical Center launched a low-cost CAC screening initiative to broaden access and enhance early cardiovascular risk detection. This study evaluates referral patterns in a diverse urban population, addressing gaps in prior research. Research Question: Among patients referred for CAC screening at a large urban academic center, do referral patterns differ by patient and provider sex and race/ethnicity? Methods: We conducted a retrospective analysis of Chicago-based patients referred for CAC testing during the promotion(January 2022–December 2023). Demographic data, including sex and race/ethnicity, were collected for patients and referring providers. Descriptive statistics summarized distributions, and associations between patient and provider demographics were assessed using contingency tables. The relationship between patient and provider sex was assessed via Chi-square (X2) (p < 0.05). Results: A total of 931 patients underwent CAC testing during the study period. Of these, 59.4% were female and 40.6% male. Race/ethnicity data were available for 380 patients; 41.7% identified as non-White: Black, Hispanic/Latino, Asian, and other groups. Female providers accounted for 53.3% of referrals, and male providers for 46.7%. Patient-provider sex concordance was strong: 69.1% of female patients were referred by female providers; 69.8% of male patients were referred by male providers. Female providers referred predominantly female patients(77.0%); male providers referred predominantly male patients (60.7%). A significant association was observed between patient and provider sex(X2(1, N = 931) = 136.62, p<0.0001). Patient race also varied by provider sex. Female providers referred a higher proportion of Black patients(27.7% vs. 13.5%); male providers referred more White patients(64.8% vs. 52.8%). This distribution differed significantly by provider sex(X2(4, N = 898) = 29.59, p<0.0001). Conclusion: Significant associations were found between patient and provider sex and race/ethnicity in referrals for CAC screening during a large promotional initiative. Female providers were more likely to refer female and racially diverse patients, while male providers more often referred male and White patients. Understanding these referral patterns may inform provider education and system-level strategies to promote equitable cardiovascular risk assessment.
ArXiv.org · 2025-03-30
preprintOpen accessExisting reasoning evaluation frameworks for Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) predominantly assess either text-based reasoning or vision-language understanding capabilities, with limited dynamic interplay between textual and visual constraints. To address this limitation, we introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of both LLMs and LVLMs through the medium of crossword puzzles -- a task requiring multimodal adherence to semantic constraints from text-based clues and intersectional constraints from visual grid structures. CrossWordBench leverages a controllable puzzle generation framework that produces puzzles in two formats (text and image), supports adjustable difficulty through prefill ratio control, and offers different evaluation strategies, ranging from direct puzzle solving to interactive modes. Our extensive evaluation of over 20 models reveals that reasoning LLMs substantially outperform non-reasoning models by effectively leveraging crossing-letter constraints. We further demonstrate that LVLMs struggle with the task, showing a strong correlation between their puzzle-solving performance and grid-parsing accuracy. Our findings highlight limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.
Prevalence and underdiagnosis of diabetes mellitus in a food insecure population
Scientific Reports · 2025-04-10
articleOpen accessFood insecurity is a public health issue and a major risk factor for overall worse health outcomes including hypertension, diabetes, coronary heart disease, congestive heart failure, stroke, chronic kidney disease and obesity. Food-insecure patients are more likely to have both diagnosed and undiagnosed prediabetes and diabetes. This study examines the prevalence and self-awareness of diabetes and prediabetes in an at-risk, food-insecure population. The Cardiometabolic Health Initiative (CHI) is a community service organization that provides comprehensive cardiometabolic screenings at food pantries in West Chicago. Between August 2023 and December 2024, 191 patients were screened using point-of-care A1c tests. The average A1c of the population was 6.04%. Ninety-six patients had a normal A1c (< 5.7%), 66 had a prediabetic A1c (5.7-6.4) and 29 had a diabetic A1c (> 6.4). Forty-two patients self-reported a history of DM. The average A1c for the self-reported DM group was 7.58% and the average A1c for the non-reported group was 5.60%. Among the self-reported DM group, 24 patients had controlled DM (A1c < 7%) and 18 had uncontrolled DM (A1c > 7%). Among the non-reported group, 56 had a prediabetic A1c and 3 had a diabetic A1c. The presented findings suggest a high prevalence of diabetes and prediabetes within a food-insecure population in West Chicago. Further, this study suggests that many diabetic patients struggle to control their A1c levels. Our findings reflect many barriers presented to food insecure patients that can hinder diabetes diagnosis, education, and management.
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation
arXiv (Cornell University) · 2024-06-06 · 2 citations
preprintOpen accessSenior authorPrediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data. PPI achieves this by combining small amounts of human-labeled data with larger amounts of data labeled by a reasonably accurate -- but potentially biased -- automatic system, in a way that results in tighter confidence intervals for certain parameters of interest (e.g., the mean performance of a language model). In this paper, we propose a method called Stratified Prediction-Powered Inference (StratPPI), in which we show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies. Without making any assumptions on the underlying automatic labeling system or data distribution, we derive an algorithm for computing provably valid confidence intervals for population parameters (such as averages) that is based on stratified sampling. In particular, we show both theoretically and empirically that, with appropriate choices of stratification and sample allocation, our approach can provide substantially tighter confidence intervals than unstratified approaches. Specifically, StratPPI is expected to improve in cases where the performance of the autorater varies across different conditional distributions of the target data.
Recent grants
NIH · $458k · 2007
SHF: Large: Collaborative Research: Exploiting the Naturalness of Software
NSF · $667k · 2014–2018
NSF · $499k · 2005–2009
ADAPTIVE PERSONALIZED INFORMATION MANAGEMENT FOR BIOLOGISTS
NIH · $1.1M · 2008–2013
BIGDATA: Small: Big Data for Everyone
NSF · $548k · 2013–2017
Frequent coauthors
- 56 shared
Bhuwan Dhingra
- 44 shared
Ruslan Salakhutdinov
- 34 shared
Zhilin Yang
- 34 shared
Kenneth R. Koedinger
Carnegie Mellon University
- 32 shared
Haitian Sun
Nanjing University of Chinese Medicine
- 23 shared
Kathryn Mazaitis
- 22 shared
Einat Minkov
- 21 shared
Noboru Matsuda
Education
- 1984
B.S.
Duke University
- 1990
Ph.D.
Rutgers University
Awards & honors
- AAAI Fellow
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with William W Cohen
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup