
Qi Long
VerifiedUniversity of Pennsylvania · Rehabilitation Medicine
Active 2001–2026
About
Qi Long, Ph.D., is a Professor of Biostatistics in the Department of Biostatistics and Epidemiology at the University of Pennsylvania's Perelman School of Medicine. He also serves as the Associate Director of the Penn Institute for Biomedical Informatics, the Director of the Center for Cancer Data Science, and holds professorships in the Department of Computer and Information Science at the School of Engineering and Applied Science, as well as in the Department of Statistics and Data Science at The Wharton School. Dr. Long's research program bridges innovative data science, informatics, and machine learning/artificial intelligence research with impactful biomedical applications. His work focuses on developing robust statistical and ML/AI methods for advancing precision medicine and population health, including the integration of complex multimodal health data such as -omics, electronic health records, and imaging data, as well as addressing issues like missing data, causal inference, data privacy, algorithmic fairness, and clinical trials. Recently, his research has expanded into large language models, foundation models, and agentic AI for biomedicine. Dr. Long has directed large-scale research networks and clinical studies, supervising multidisciplinary teams, and currently co-directs the Coordinating Center for the Premedical Cancer Immunotherapy Network for Canine Trials as part of NCI’s Cancer Moonshot Initiative. He is a founding director of the Center for Cancer Data Science and holds leadership roles in cancer informatics and quantitative data science at Penn. His methodological research has been supported by major agencies including NIH, PCORI, NSF, and ARPA-H. Dr. Long is an elected fellow of the AAAS, ASA, IMS, and ISI, recognizing his significant contributions to the fields of statistics, data science, and biomedical informatics.
Research topics
- Medicine
- Political Science
- Medical emergency
- Pathology
- Internal medicine
- Intensive care medicine
- Emergency medicine
Selected publications
Digital Repository at the University of Maryland (University of Maryland College Park) · 2026-03-25
articleOpen accessArtificial intelligence (AI) is increasingly integrated into catalysis science, enabling agentic workflows in which AI systems perceive inputs, reason under constraints, plan, and autonomously execute in silico or physical experiments with minimal human intervention. While these closed-loop capabilities hold promise to accelerate knowledge generation and technological innovation, they inevitably introduce new sources of variability in data lineage, model specification, and agent policies that can undermine FAIRness, rigor, and reproducibility. These risks are particularly pronounced in heterogeneous catalysis, where subtleties in catalyst synthesis and pretreatment, dynamic restructuring under operating conditions, and transport-mediated local environments can largely determine catalytic outcomes. To address these challenges, we introduce TRACE-AI (Transparent Reporting for Agentic Catalysis Enabled by Artificial Intelligence) as a set of community guidelines paired with a publication checklist. TRACE-AI emphasizes end-to-end traceability across the full lifecycle of an agentic catalysis campaign, linking research objectives to data and models, agent reasoning and action, and the knowledge acquired. By promoting standardized and accountable reporting, TRACE-AI aims to cultivate a shared foundation for accelerating scientific discovery while reinforcing safety and trust as autonomous catalysis laboratories continue to emerge.
On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection
arXiv (Cornell University) · 2025-10-04
preprintOpen accessSenior authorLarge language models (LLMs) raise concerns about content authenticity and integrity because they can generate human-like text at scale. Text watermarks, which embed detectable statistical signals into generated text, offer a provable way to verify content origin. Many detection methods rely on pivotal statistics that are i.i.d. under human-written text, making goodness-of-fit (GoF) tests a natural tool for watermark detection. However, GoF tests remain largely underexplored in this setting. In this paper, we systematically evaluate eight GoF tests across three popular watermarking schemes, using three open-source LLMs, two datasets, various generation temperatures, and multiple post-editing methods. We find that general GoF tests can improve both the detection power and robustness of watermark detectors. Notably, we observe that text repetition, common in low-temperature settings, gives GoF tests a unique advantage not exploited by existing methods. Our results highlight that classic GoF tests are a simple yet powerful and underused tool for watermark detection in LLMs.
The Impact of Language Mixing on Bilingual LLM Reasoning
ArXiv.org · 2025-07-21
preprintOpen accessProficient multilingual speakers often intentionally switch languages in the middle of a conversation. Similarly, recent reasoning-focused bilingual large language models (LLMs) with strong capabilities in both languages exhibit language mixing-alternating languages within their chain of thought. Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning. In this work, we study language switching in Chinese-English bilingual reasoning models. We identify reinforcement learning with verifiable rewards (RLVR) as the critical training stage that leads to language mixing. We show that language mixing can enhance reasoning: enforcing monolingual decoding reduces accuracy by 5.6 percentage points on MATH500. Additionally, a lightweight probe can be trained to predict whether a potential language switch would benefit or harm reasoning, and when used to guide decoding, increases accuracy by 2.92 percentage points. Our findings suggest that language mixing is not merely a byproduct of multilingual training, but is a strategic reasoning behavior.
Communications Medicine · 2025-10-15 · 6 citations
articleOpen accessThe vast amount of natural language clinical notes about patients with cancer presents a challenge for efficient information extraction, standardization, and structuring. Traditional NLP methods require extensive annotation by domain experts for each type of named entity and necessitate model training, highlighting the need for an efficient and accurate extraction method. This study introduces a tool based on the Large Language Model (LLM) for zero-shot information extraction from cancer-related clinical notes into structured data aligned with the minimal Common Oncology Data Elements (mCODE™) structure. We utilize the zero-shot learning capabilities of LLMs for information extraction, eliminating the need for data annotated by domain experts for training. Our methodology employs advanced hierarchical prompt engineering strategies to overcome common LLM limitations like token hallucination and accuracy issues. We tested the approach on 1,000 synthetic clinical notes representing various cancer types, comparing its performance to a traditional single-step prompting method. Our hierarchical prompt engineering strategy (accuracy = 94%, misidentification, and misplacement rate = 5%) outperforms the traditional prompt strategy (accuracy = 87%, misidentification, and misplacement rate = 10%) in information extraction. By unifying staging systems (e.g., TNM, FIGO) and specific stage details (e.g., Stage II) into a standardized framework, our approach achieves improved accuracy in extracting cancer stage information. Our approach demonstrates that LLMs, when guided by structured prompting, can accurately extract complex clinical information without the need for expert-labeled data. This method has the potential to harness unstructured data for advancing cancer research. Clinical notes about cancer patients contain a lot of valuable information, but they are often written in free text, making it hard for computers to use them in research. This study develops a new framework that uses large language models (LLMs) to automatically extract important details from these notes without needing manual labeling by experts. We introduce two advanced prompting techniques—BFOP and 2POP—that hierarchically guide the LLMs step-by-step through the information extraction process. We test BFOP and 2POP on 1,000 synthetic cancer notes, achieving high accuracy and low error rates. Our approaches could help researchers better understand cancer and make more informed clinical decision making by turning hard-to-read notes into structured, standardized data for analysis. Zhang, Huang et al. introduce mCODEGPT, a zero-shot framework that extracts named entities as structured data from clinical notes of cancer patients using Large Language Models. Their hierarchical prompt engineering approach enables accuracy information extraction without relying on expert-labelled training data.
Robust detection of watermarks for large language models under human edits
Journal of the Royal Statistical Society Series B (Statistical Methodology) · 2025-08-20 · 1 citations
articleOpen accessAbstract Watermarking is an effective approach to distinguishing text generated by large language models (LLMs) from human-written text. However, the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods. In this paper, by modelling human edits through mixture model detection, we introduce a new method—a truncated goodness-of-fit test (Tr-GoF) for detecting watermarked text under human edits. We prove that Tr-GoF achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals. Importantly, Tr-GoF achieves this optimality adaptively without requiring precise knowledge of human edit levels or probabilistic specifications of LLMs, unlike the optimal but impractical Neyman–Pearson likelihood ratio test. Moreover, we establish that Tr-GoF attains the highest detection efficiency rate under moderate text modifications. In contrast, sum-based detection rules used by existing methods fail to achieve optimal robustness in both regimes because the additive nature of their statistics is less resilient to edit-induced noise. We demonstrate Tr-GoF’s competitive and sometimes superior performance on synthetic data and open-source LLMs in the OPT and LLaMA families.
Behavior-Rule Inference Based on Hyponymy–Hypernymy Knowledge Tree
Electronics · 2025-12-05
articleOpen accessSenior authorBehavior-rule reasoning aims to infer the corresponding applicable rules from specific behaviors and is a type of inductive reasoning that goes from special cases to general ones. This paper proposes a behavior-rule inference model based on hyponymy–hypernymy knowledge trees, which maps behaviors to corresponding rules through deep learning by processing textual behavior sequences. The primary contributions of this work are threefold: we present a systematic framework that adapts K-BERT to the legal domain by integrating domain-specific hyponymy–hypernymy knowledge trees, addressing the unique challenges of legal text understanding; we conduct comprehensive optimization of key components, including context length, loss function, and base model selection, providing empirical guidelines for applying pre-trained models to legal reasoning tasks; and we propose a practical evaluation metric (tolerance) that mimics real-world legal decision-making processes, providing extensive analysis on the effectiveness of different knowledge types in legal inference.
PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective
arXiv (Cornell University) · 2025-05-27
preprintOpen accessThe ever-growing scale of deep learning models and training data underscores the critical importance of efficient optimization methods. While preconditioned gradient methods such as Adam and AdamW are the de facto optimizers for training neural networks and large language models, structure-aware preconditioned optimizers like Shampoo and Muon, which utilize the matrix structure of gradients, have demonstrated promising evidence of faster convergence. In this paper, we introduce a unifying framework for analyzing "matrix-aware" preconditioned methods, which not only sheds light on the effectiveness of Muon and related optimizers but also leads to a class of new structure-aware preconditioned methods. A key contribution of this framework is its precise distinction between preconditioning strategies that treat neural network weights as vectors (addressing curvature anisotropy) versus those that consider their matrix structure (addressing gradient anisotropy). This perspective provides new insights into several empirical phenomena in language model pre-training, including Adam's training instabilities, Muon's accelerated convergence, and the necessity of learning rate warmup for Adam. Building upon this framework, we introduce PolarGrad, a new class of preconditioned optimization methods based on the polar decomposition of matrix-valued gradients. As a special instance, PolarGrad includes Muon with updates scaled by the nuclear norm of the gradients. We provide numerical implementations of these methods, leveraging efficient numerical polar decomposition algorithms for enhanced convergence. Our extensive evaluations across diverse matrix optimization problems and language model pre-training tasks demonstrate that PolarGrad outperforms both Adam and Muon.
2025-11-24
articleOpen accessSenior author<p>Key concepts, examples, and potential solutions for misuses and pitfalls of <i>P</i> values.</p>
Optimal Estimation of Watermark Proportions in Hybrid AI-Human Texts
ArXiv.org · 2025-06-27
preprintOpen accessText watermarks in large language models (LLMs) are an increasingly important tool for detecting synthetic text and distinguishing human-written content from LLM-generated text. While most existing studies focus on determining whether entire texts are watermarked, many real-world scenarios involve mixed-source texts, which blend human-written and watermarked content. In this paper, we address the problem of optimally estimating the watermark proportion in mixed-source texts. We cast this problem as estimating the proportion parameter in a mixture model based on \emph{pivotal statistics}. First, we show that this parameter is not even identifiable in certain watermarking schemes, let alone consistently estimable. In stark contrast, for watermarking methods that employ continuous pivotal statistics for detection, we demonstrate that the proportion parameter is identifiable under mild conditions. We propose efficient estimators for this class of methods, which include several popular unbiased watermarks as examples, and derive minimax lower bounds for any measurable estimator based on pivotal statistics, showing that our estimators achieve these lower bounds. Through evaluations on both synthetic data and mixed-source text generated by open-source models, we demonstrate that our proposed estimators consistently achieve high estimation accuracy.
Frontiers in Pharmacology · 2025-11-13 · 1 citations
articleOpen accessBackground: This study aims to investigate the incidence of tuberculosis (TB) infection following administration of immune checkpoint inhibitors (ICI) and to explore the risk factors for developing TB in patients treated with ICIs. Research design and methods: We conducted a retrospective review of patients who had ICI until June 2023. Patient follow-up was extended until death or on July 2025. The primary outcome was the incidence of TB infection in patients treated with ICIs. Logistic regression was used to investigate the associations between clinical characteristics and TB infection after ICI initiation. Results: Of the 8,199 patients analyzed, 2.65% had a pre-existing TB diagnosis. The incidence of TB following ICI initiation was 1.96%, with pulmonary TB being the most frequent presentation. Logistic regression revealed that pre-existing TB (OR 3.277; [95% CI, 1.822-5.895]; p < 0.001) and male sex (OR 1.798; [95% CI, 1.173-2.756]; p = 0.007) were significantly associated with TB following ICI initiation. Conclusion: In this large, real-world cohort of cancer patients receiving ICI therapy, we observed a notable incidence of tuberculosis. These findings suggest that enhanced clinical vigilance may be warranted for these high-risk populations, and they highlight the need for prospective, controlled studies to definitively quantify the excess TB risk attributable to ICI therapy. Clinical Trial Registration: https://www.chictr.org.cn, identifier ChiCTR2300075974.
Recent grants
Feature Selection for Genomic Data Using Known and Novel Biological Information
NIH · $154k · 2013–2016
Statistical Modeling of Alzheimer's Disease Progression Integrating Brain Imaging and -Omics Data
NIH · $3.3M · 2021–2027
Advancing Analysis of Multi-omics Data in Alzheimer's Disease Research
NIH · $3.8M · 2019–2024
Statistical Methods for Causal Inference in Observational Studies
NIH · $424k · 2016–2019
NIH · $152k · 2016
Frequent coauthors
- 120 shared
Roberd M. Bostick
Emory University
- 112 shared
W. Dana Flanders
Emory University
- 97 shared
Carrie R. Daniel
University of Houston
- 97 shared
Veronika Fedirko
The University of Texas MD Anderson Cancer Center
- 96 shared
Robin E. Rutherford
Emory University
- 95 shared
Aasma Shaukat
New York University
- 90 shared
Eduard Sidelnikov
Amgen (Switzerland)
- 51 shared
Arshed A. Quyyumi
Emory University
Labs
Qi Long LaboratoryPI
Education
- 2005
PhD, Biostatistics
University of Michigan
- 2003
MS, Department of Biostatistics
University of Michigan
- 1998
BS, Special Class for Gifted Young
University of Science and Technology of China
Awards & honors
- Fellow of the American Association for the Advancement of Sc…
- Fellow of the American Statistical Association (ASA)
- Fellow of the Institute of Mathematical Statistics (IMS)
- Fellow of the International Statistical Institute (ISI)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Qi Long
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup