
Inderjit Dhillon
· ProfessorVerifiedUniversity of Texas at Austin · Electrical and Computer Engineering
Active 1987–2026
Research topics
- Computer Science
- Artificial Intelligence
- Natural Language Processing
- Machine Learning
- Engineering
- Mathematics
- Electrical engineering
- Algorithm
- Database
Selected publications
LUCID: Attention with Preconditioned Representations
ArXiv.org · 2026-02-11
articleOpen accessSenior authorSoftmax-based dot-product attention is a cornerstone of Transformer architectures, enabling remarkable capabilities such as in-context learning. However, as context lengths increase, a fundamental limitation of the softmax function emerges: it tends to diffuse probability mass to irrelevant tokens degrading performance in long-sequence scenarios. Furthermore, attempts to sharpen focus by lowering softmax temperature hinder learnability due to vanishing gradients. We introduce LUCID Attention, an architectural modification that applies a preconditioner to the attention probabilities. This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space, thus allowing the query to focus on important keys among large number of keys accurately with same computational complexity as standard attention. Additionally, LUCID's preconditioning-based approach to retrieval bypasses the need for low temperature and the learnability problems associated with it. We validate our approach by training ~1 billion parameter language models evaluated on up to 128K tokens. Our results demonstrate significant gains on long-context retrieval tasks, specifically retrieval tasks from BABILong, RULER, SCROLLS and LongBench. For instance, LUCID achieves up to 18% improvement in BABILong and 14% improvement in RULER multi-needle performance compared to standard attention.
Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching
ArXiv.org · 2026-02-12
articleOpen accessWe propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
arXiv (Cornell University) · 2026-05-12
preprintOpen accessLarge language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.
PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
ArXiv.org · 2026-01-22
articleOpen accessDiscrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
Open MIND · 2026-01-22
preprintDiscrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
arXiv (Cornell University) · 2026-04-17
preprintOpen accessSenior authorPreference optimization is widely used to align Large Language Models (LLMs) with preference feedback. However, most existing methods train on a single positive-negative pair per prompt, discarding additional supervision available in preference datasets that typically contain multiple candidate responses. Motivated by this limitation, recent work explores group-wise preference optimization, which jointly contrasts multiple responses for the same prompt, but its empirical behavior and scalability remain underexplored due to the memory overhead of group-coupled objectives. In this work, we introduce a memory-efficient group-wise preference optimization algorithm that preserves gradients while decoupling samples during backpropagation, substantially reducing peak memory usage, which enables scalable training with larger group sizes. Across both offline and online alignment settings, we show that leveraging multiple responses consistently outperforms single-pair training. Furthermore, incorporating a negative log-likelihood (NLL) term on positive responses is critical for both performance gains and training stability.
LUCID: Attention with Preconditioned Representations
Open MIND · 2026-02-11
preprintSenior authorSoftmax-based dot-product attention is a cornerstone of Transformer architectures, enabling remarkable capabilities such as in-context learning. However, as context lengths increase, a fundamental limitation of the softmax function emerges: it tends to diffuse probability mass to irrelevant tokens degrading performance in long-sequence scenarios. Furthermore, attempts to sharpen focus by lowering softmax temperature hinder learnability due to vanishing gradients. We introduce LUCID Attention, an architectural modification that applies a preconditioner to the attention probabilities. This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space, thus allowing the query to focus on important keys among large number of keys accurately with same computational complexity as standard attention. Additionally, LUCID's preconditioning-based approach to retrieval bypasses the need for low temperature and the learnability problems associated with it. We validate our approach by training ~1 billion parameter language models evaluated on up to 128K tokens. Our results demonstrate significant gains on long-context retrieval tasks, specifically retrieval tasks from BABILong, RULER, SCROLLS and LongBench. For instance, LUCID achieves up to 18% improvement in BABILong and 14% improvement in RULER multi-needle performance compared to standard attention.
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
arXiv (Cornell University) · 2026-04-17
articleOpen accessSenior authorPreference optimization is widely used to align Large Language Models (LLMs) with preference feedback. However, most existing methods train on a single positive-negative pair per prompt, discarding additional supervision available in preference datasets that typically contain multiple candidate responses. Motivated by this limitation, recent work explores group-wise preference optimization, which jointly contrasts multiple responses for the same prompt, but its empirical behavior and scalability remain underexplored due to the memory overhead of group-coupled objectives. In this work, we introduce a memory-efficient group-wise preference optimization algorithm that preserves gradients while decoupling samples during backpropagation, substantially reducing peak memory usage, which enables scalable training with larger group sizes. Across both offline and online alignment settings, we show that leveraging multiple responses consistently outperforms single-pair training. Furthermore, incorporating a negative log-likelihood (NLL) term on positive responses is critical for both performance gains and training stability.
Modeling Longitudinal Student Pathways with Explainable Generative Models
2026-04-25
articleOpen accessStudent pathways — trajectories spanning academic readiness, course selections, grades, and enrollment outcomes — are critical for understanding educational progress, particularly in community college systems where pathways are highly diverse. Restricted access to student records and a sparsity of coherent pathway data pose barriers to modeling and broader research engagement. This paper contributes a scalable and explainable framework for generating synthetic student pathway data, along with a Bayesian network structure learning approach adapted to the sparsity and temporal complexity of educational trajectories. We present a generative modeling approach that produces realistic synthetic student pathway data by learning a Bayesian network trained on longitudinal student records across multiple tables linked across time. Our model captures complex conditional dependencies across hundreds of variables while remaining interpretable: each parameter encodes transparent relationships that can be inspected or adjusted. We show that our method outperforms an independent sampler in reproducing marginal, conditional, and higher-order patterns in the real data. Our analysis shows that existing k-anonymity rules are infeasible for real or synthetic data, motivating a shift toward model-aware approaches for privacy considerations in student pathway data.
Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching
Open MIND · 2026-02-12
preprintWe propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.
Recent grants
Novel Matrix Problems in Modern Applications
NSF · $230k · 2004–2009
AF: Small: Fast and Memory-Efficient Dimensionality Reduction for Massive Networks
NSF · $360k · 2011–2015
BIGDATA: Collaborative Research: F: Nomadic Algorithms for Machine Learning in the Cloud
NSF · $610k · 2016–2021
Non-Negative Matrix and Tensor Approximations: Algorithms, Software and Applications
NSF · $250k · 2007–2012
AF:Small: Divide-and-Conquer Numerical Methods for Analysis of Massive Data Sets
NSF · $491k · 2013–2017
Frequent coauthors
- 75 shared
Cho‐Jui Hsieh
- 75 shared
Hsiang‐Fu Yu
Amazon (United States)
- 38 shared
Pradeep Ravikumar
- 35 shared
Suvrit Sra
- 26 shared
Sujay Sanghavi
- 21 shared
Kai Zhong
- 21 shared
Si Si
- 20 shared
Nagarajan Natarajan
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Inderjit Dhillon
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup