Zhuoran Yang

· Assistant Professor of Statistics & Data ScienceVerified

Yale University · Department of Statistics and Data Science

Active 2014–2026

h-index29

Citations4.1k

Papers256180 last 5y

Funding—

Faculty page Lab page

See your match with Zhuoran Yang — sign in to PhdFit.Sign in

About

Zhuoran Yang is an Assistant Professor of Statistics and Data Science and Computer Science at Yale University. He is affiliated with the Yale Institute for Foundations of Data Science and the Center for Algorithms, Data, and Market Design at Yale (CADMY). His research interests lie at the intersection of machine learning, statistics, game theory, and optimization. Recently, his work has focused on the foundations of reinforcement learning, particularly in multi-agent systems where agents interact strategically, as well as the foundations of artificial intelligence, with an emphasis on understanding the emergent behaviors of large language models during pre-training and post-training and their relationship with model architecture. His research is supported by NSF DMS 2413243. Before joining Yale, Zhuoran Yang was a postdoctoral researcher at the University of California, Berkeley, working under the supervision of Michael I. Jordan. He earned his Ph.D. from the Department of Operations Research and Financial Engineering at Princeton University, where he was co-advised by Jianqing Fan and Han Liu. He completed his bachelor's degree in Mathematics at Tsinghua University in 2015.

Research topics

Artificial Intelligence
Computer Science
Machine Learning
Mathematics
Mathematical optimization
Engineering
Combinatorics
Discrete mathematics
Management science

Selected publications

Training Language Models for Bilateral Trade with Private Information
ArXiv.org · 2026-04-10
articleOpen accessSenior author
Bilateral bargaining under incomplete information provides a controlled testbed for evaluating large language model (LLM) agent capabilities. Bilateral trade demands individual rationality, strategic surplus maximization, and cooperation to realize gains from trade. We develop a structured bargaining environment where LLMs negotiate via tool calls within an event-driven simulator, separating binding offers from natural-language messages to enable automated evaluation. The environment serves two purposes: as a benchmark for frontier models and as a training environment for open-weight models via reinforcement learning. In benchmark experiments, a round-robin tournament among five frontier models (15,000 negotiations) reveals that effective strategies implement price discrimination through sequential offers. Aggressive anchoring, calibrated concession, and temporal patience correlate with the highest surplus share and deal rate. Accommodating strategies that concede quickly disable price discrimination in the buyer role, yielding the lowest surplus capture and deal completion. Stronger models scale their behavior proportionally to item value, maintaining performance across price tiers; weaker models perform well only when wide zones of possible agreement offset suboptimal strategies. In training experiments, we fine-tune Qwen3 (8B, 14B) via supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) against a fixed frontier opponent. These stages optimize competing objectives: SFT approximately doubles surplus share but reduces deal rates, while RL recovers deal rates but erodes surplus gains, reflecting the reward structure. SFT also compresses surplus variation across price tiers, which generalizes to unseen opponents, suggesting that behavioral cloning instills proportional strategies rather than memorized price points.
Publisher OA PDF
A G-Code-Driven Modeling and Thermo-Mechanical Coupling Analysis Method for the FDM Process of Complex Lightweight Structures
Materials · 2026-03-18
articleOpen access
Accurate prediction of thermo-mechanical behavior in Fused Deposition Modeling (FDM) is often limited by mismatches between idealized Computer-Aided Design (CAD) geometry and path-dependent material deposition. This paper presents a G-code-driven, filament-level modeling and process-simulation workflow for complex geometries and infill strategies, especially toolpaths with in-plane inclinations. Extrusion segments are parsed from slicing G-code to obtain endpoints and process parameters, and each filament is reconstructed as a path-aligned rectangular bead using a dedicated local coordinate system. Progressive deposition is simulated in ANSYS Parametric Design Language (APDL) via an element birth-death method, enhanced by a centroid-based element selection strategy that reduces dependence on strictly aligned hexahedral partitions and improves robustness for complex meshes. A nonlinear transient thermal analysis is performed, and temperatures are mapped to the structural model through an indirect thermo-mechanical coupling scheme to predict warpage and residual stresses. Case studies on square plates with triangular and hexagonal infills (with/without sidewalls and a bottom base) show that the high-temperature zone follows newly deposited paths with peak temperatures near 220 °C, while displacement and von Mises stress accumulate and are strongly affected by infill topology and boundary conditions.
Publisher DOI
Training Language Models for Bilateral Trade with Private Information
arXiv (Cornell University) · 2026-04-10
preprintOpen accessSenior author
Bilateral bargaining under incomplete information provides a controlled testbed for evaluating large language model (LLM) agent capabilities. Bilateral trade demands individual rationality, strategic surplus maximization, and cooperation to realize gains from trade. We develop a structured bargaining environment where LLMs negotiate via tool calls within an event-driven simulator, separating binding offers from natural-language messages to enable automated evaluation. The environment serves two purposes: as a benchmark for frontier models and as a training environment for open-weight models via reinforcement learning. In benchmark experiments, a round-robin tournament among five frontier models (15,000 negotiations) reveals that effective strategies implement price discrimination through sequential offers. Aggressive anchoring, calibrated concession, and temporal patience correlate with the highest surplus share and deal rate. Accommodating strategies that concede quickly disable price discrimination in the buyer role, yielding the lowest surplus capture and deal completion. Stronger models scale their behavior proportionally to item value, maintaining performance across price tiers; weaker models perform well only when wide zones of possible agreement offset suboptimal strategies. In training experiments, we fine-tune Qwen3 (8B, 14B) via supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) against a fixed frontier opponent. These stages optimize competing objectives: SFT approximately doubles surplus share but reduces deal rates, while RL recovers deal rates but erodes surplus gains, reflecting the reward structure. SFT also compresses surplus variation across price tiers, which generalizes to unseen opponents, suggesting that behavioral cloning instills proportional strategies rather than memorized price points.
Publisher DOI
Incremental birth-death element method for thermo-mechanical coupling simulation of fused deposition modeling process
Journal of Manufacturing Processes · 2026-04-06
article1st authorCorresponding
Publisher DOI
Kolmogorov-Arnold Networks with Gumbel Softmax for Recovering Network Structure and Forecasting Complex Systems
SSRN Electronic Journal · 2025-01-01
preprintOpen access1st authorCorresponding
Publisher DOI
Efficient and assured reinforcement learning-based building HVAC control with heterogeneous expert-guided training
Scientific Reports · 2025-03-05 · 14 citations
articleOpen access
Building heating, ventilation, and air conditioning (HVAC) systems account for nearly half of building energy consumption and [Formula: see text] of total energy consumption in the US. Their operation is also crucial for ensuring the physical and mental health of building occupants. Compared with traditional model-based HVAC control methods, the recent model-free deep reinforcement learning (DRL) based methods have shown good performance while do not require the development of detailed and costly physical models. However, these model-free DRL approaches often suffer from long training time to reach a good performance, which is a major obstacle for their practical deployment. In this work, we present a systematic approach to accelerate online reinforcement learning for HVAC control by taking full advantage of the knowledge from domain experts in various forms. Specifically, the algorithm stages include learning expert functions from existing abstract physical models and from historical data via offline reinforcement learning, integrating the expert functions with rule-based guidelines, conducting training guided by the integrated expert function and performing policy initialization from distilled expert function. Moreover, to ensure that the learned DRL-based HVAC controller can effectively keep room temperature within the comfortable range for occupants, we design a runtime shielding framework to reduce the temperature violation rate and incorporate the learned controller into it. Experimental results demonstrate up to 8.8X speedup in DRL training from our approach over previous methods, with low temperature violation rate.
Publisher OA PDF DOI
Enhancing performance and chemical stability of Nafion composite proton exchange membranes by the impregnation of phosphotungstic acid functionized CeO2
SSRN Electronic Journal · 2025-01-01
preprintOpen access
Publisher DOI
The Sample Complexity of Online Strategic Decision Making with Information Asymmetry and Knowledge Transportability
ArXiv.org · 2025-06-11
preprintOpen accessSenior author
Information asymmetry is a pervasive feature of multi-agent systems, especially evident in economics and social sciences. In these settings, agents tailor their actions based on private information to maximize their rewards. These strategic behaviors often introduce complexities due to confounding variables. Simultaneously, knowledge transportability poses another significant challenge, arising from the difficulties of conducting experiments in target environments. It requires transferring knowledge from environments where empirical data is more readily available. Against these backdrops, this paper explores a fundamental question in online learning: Can we employ non-i.i.d. actions to learn about confounders even when requiring knowledge transfer? We present a sample-efficient algorithm designed to accurately identify system dynamics under information asymmetry and to navigate the challenges of knowledge transfer effectively in reinforcement learning, framed within an online strategic interaction model. Our method provably achieves learning of an $ε$-optimal policy with a tight sample complexity of $O(1/ε^2)$.
Publisher OA PDF DOI
BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms
ArXiv.org · 2025-05-21
preprintOpen accessSenior author
Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.
Publisher OA PDF DOI
Learning Task Representations from In-Context Learning
2025-01-01
articleOpen access
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning (ICL), where models adapt to new tasks through example-based prompts without requiring parameter updates.However, understanding how tasks are internally encoded and generalized remains a challenge.To address some of the empirical and technical gaps in the literature, we introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads within the transformer architecture.This approach computes a single task vector as a weighted sum of attention heads, with the weights optimized causally via gradient descent.Our findings show that existing methods fail to generalize effectively to modalities beyond text.In response, we also design a benchmark to evaluate whether a task vector can preserve task fidelity in functional regression tasks.The proposed method successfully extracts task-specific information from in-context demonstrations and excels in both text and regression tasks, demonstrating its generalizability across modalities.
Publisher OA PDF DOI

Frequent coauthors

Zhaoran Wang
125 shared
Zhaoran Wang
Shanghai University
72 shared
Kaiqing Zhang
24 shared
Michael I. Jordan
20 shared
Qi Cai
Civil Aviation Administration of China
20 shared
Tamer Başar
19 shared
Xiaohan Wei
University of Edinburgh
17 shared
Mingyi Hong
17 shared

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Zhuoran Yang

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you