Jessica Hullman

· Ginni Rometty Professor of Computer ScienceVerified

Northwestern University · Chemical Engineering

Active 1800–2026

h-index29

Citations3.8k

Papers14391 last 5y

Funding$1.0M

Faculty page

See your match with Jessica Hullman — sign in to PhdFit.Sign in

About

Jessica Hullman is a faculty member in the Department of Computer Science at Northwestern University, located in the McCormick School of Engineering. Her research focuses on challenges and limitations that arise when people theorize and draw inductive inferences from data. She explores how to best align data-driven interfaces and summaries with human reasoning capabilities, examining the role of interactive analysis across different stages of a statistical workflow. Her work involves evaluating data interfaces, developing tools to support reasoning under uncertainty, and understanding how these tools can be applied in domains such as strategic games and privacy. Hullman approaches these problems by drawing on formal models of rational inference to compare and propose solutions.

Research topics

Computer Science
Artificial Intelligence
Political Science
Econometrics
Mathematics
Data science
Statistics
Demography
Economics
Psychology
Engineering
Knowledge management
Human–computer interaction
Microeconomics
Management science

Selected publications

What’s a multiverse good for anyway?
2026-02-04
articleOpen access
Multiverse analysis has become a fairly popular approach, as indicated by the present special issue on the matter. Here, we take one step back and ask why one would conduct a multiverse analysis in the first place. We discuss various ways in which a multiverse may be employed – as a tool for reflection and critique, as a persuasive tool, as a serious inferential tool – as well as potential problems that arise depending on the specific purpose. For example, it fails as a persuasive tool when researchers disagree about which variations should be included in the analysis, and it fails as a serious inferential tool when the included analyses do not target a coherent estimand. Then, we take yet another step back and ask what the multiverse discourse has been good for and whether any broader lessons can be drawn. Ultimately, we conclude that the multiverse does remain a valuable tool; however, we urge against taking it too seriously.
Publisher DOI
ComplLLM: Fine-tuning LLMs to Discover Complementary Signals for Decision-making
ArXiv.org · 2026-01-01
articleOpen accessSenior author
Multi-agent decision pipelines can outperform single agent workflows when complementarity holds, i.e., different agents bring unique information to the table to inform a final decision. We propose ComplLLM, a post-training framework based on decision theory that fine-tunes a decision-assistant LLM using complementary information as reward to output signals that complement existing agent decisions. We validate ComplLLM on synthetic and real-world tasks involving domain experts, demonstrating how the approach recovers known complementary information and produces plausible explanations of complementary signals to support downstream decision-makers.
Publisher OA PDF
ComplLLM: Fine-tuning LLMs to Discover Complementary Signals for Decision-making
arXiv (Cornell University) · 2026-02-23
preprintOpen accessSenior author
Multi-agent decision pipelines can outperform single agent workflows when complementarity holds, i.e., different agents bring unique information to the table to inform a final decision. We propose ComplLLM, a post-training framework based on decision theory that fine-tunes a decision-assistant LLM using complementary information as reward to output signals that complement existing agent decisions. We validate ComplLLM on synthetic and real-world tasks involving domain experts, demonstrating how the approach recovers known complementary information and produces plausible explanations of complementary signals to support downstream decision-makers.
Publisher DOI
Hypothesizing an effect size by considering individual variation
arXiv (Cornell University) · 2026-04-09
articleOpen accessSenior author
When designing and evaluating an experiment or observational study, it is useful to have a realistic hypothesis regarding the average treatment effect. We present an approach to conceptualizing this average by first considering a distribution of effects. We demonstrate with examples in medicine, economics, and psychology.
Publisher OA PDF
This human study did not involve human subjects: Validating LLM simulations as behavioral evidence
arXiv (Cornell University) · 2026-02-17
articleOpen access1st authorCorresponding
A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses. Under explicit assumptions, statistical calibration preserves validity and provides more precise estimates of causal effects at lower cost than experiments that rely solely on human participants. Yet the potential of both approaches depends on how well LLMs approximate the relevant populations. We consider what opportunities are overlooked when researchers focus myopically on substituting LLMs for human participants in a study.
Publisher OA PDF
This human study did not involve human subjects: Validating LLM simulations as behavioral evidence
arXiv (Cornell University) · 2026-02-17
preprintOpen access1st authorCorresponding
A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses. Under explicit assumptions, statistical calibration preserves validity and provides more precise estimates of causal effects at lower cost than experiments that rely solely on human participants. Yet the potential of both approaches depends on how well LLMs approximate the relevant populations. We consider what opportunities are overlooked when researchers focus myopically on substituting LLMs for human participants in a study.
Publisher DOI
Hypothesizing an effect size by considering individual variation
arXiv (Cornell University) · 2026-04-09
preprintOpen accessSenior author
When designing and evaluating an experiment or observational study, it is useful to have a realistic hypothesis regarding the average treatment effect. We present an approach to conceptualizing this average by first considering a distribution of effects. We demonstrate with examples in medicine, economics, and psychology.
Publisher DOI
What’s a multiverse good for anyway?
PsyArXiv (OSF Preprints) · 2026-02-04
preprintOpen accessSenior author
Multiverse analysis has become a fairly popular approach, as indicated by the present special issue on the matter. Here, we take one step back and ask why one would conduct a multiverse analysis in the first place. We discuss various ways in which a multiverse may be employed – as a tool for reflection and critique, as a persuasive tool, as a serious inferential tool – as well as potential problems that arise depending on the specific purpose. For example, it fails as a persuasive tool when researchers disagree about which variations should be included in the analysis, and it fails as a serious inferential tool when the included analyses do not target a coherent estimand. Then, we take yet another step back and ask what the multiverse discourse has been good for and whether any broader lessons can be drawn. Ultimately, we conclude that the multiverse does remain a valuable tool; however, we urge against taking it too seriously.
Publisher
Can AI Do Strategy? A Dialogue and Debate
Strategy Science · 2026-01-01 · 1 citations
article
Publisher DOI
Characterizing Photorealism and Artifacts in Diffusion Model-Generated Images
ArXiv.org · 2025-02-17 · 1 citations
preprintOpen access
Diffusion model-generated images can appear indistinguishable from authentic photographs, but these images often contain artifacts and implausibilities that reveal their AI-generated provenance. Given the challenge to public trust in media posed by photorealistic AI-generated images, we conducted a large-scale experiment measuring human detection accuracy on 450 diffusion-model generated images and 149 real images. Based on collecting 749,828 observations and 34,675 comments from 50,444 participants, we find that scene complexity of an image, artifact types within an image, display time of an image, and human curation of AI-generated images all play significant roles in how accurately people distinguish real from AI-generated images. Additionally, we propose a taxonomy characterizing artifacts often appearing in images generated by diffusion models. Our empirical observations and taxonomy offer nuanced insights into the capabilities and limitations of diffusion models to generate photorealistic images in 2024.
Publisher OA PDF DOI

Recent grants

CAREER: Enhancing Critical Reflection on Data by Integrating Users' Expectations in Visualization Interaction
NSF · $104k · 2018–2019
CHS: Small: Collaborative Research: Representing and Learning Visualization Design Knowledge
NSF · $250k · 2019–2023
CRII: CHS: Facilitating Consumption and Re-expression of Scientific Information in a Journalism Context
NSF · $175k · 2016–2018
CAREER: Enhancing Critical Reflection on Data by Integrating Users' Expectations in Visualization Interaction
NSF · $497k · 2018–2024

Frequent coauthors

Matthew Kay
Northwestern University
28 shared
Alex Kale
20 shared
Priyanka Nanayakkara
14 shared
Yea‐Seul Kim
13 shared
Eytan Adar
University of Michigan–Ann Arbor
13 shared
Jason D. Hartline
Northwestern University
12 shared
Andrew Gelman
Columbia University
11 shared
Nicholas Diakopoulos
10 shared

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Jessica Hullman

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you