Irfan Essa
VerifiedGeorgia Institute of Technology · Computer Science
Active 1990–2026
About
Irfan Essa is a Distinguished Professor in the School of Interactive Computing and a Senior Associate Dean in the College of Computing at the Georgia Institute of Technology. He serves as the Inaugural Executive Director of the new Interdisciplinary Research Center for Machine Learning at Georgia Tech (ML@GT) and is a Senior Staff Research Scientist at Google Inc. His research areas include Computer Vision, Machine Learning, Artificial Intelligence, Robotics, Computer Graphics, and Computational Journalism. He works on topics with potential impact on Autonomous Systems, Video Analysis and Production, Human Computer Interaction, and Computational Behavioral/Social Sciences. He has published over 200 scholarly articles in leading journals and conferences, with several papers winning best paper awards. He has been awarded the NSF CAREER and was elected to the grade of IEEE Fellow. His previous roles include extended research consulting with Disney Research and Google Research, and an adjunct faculty position at Carnegie Mellon’s Robotics Institute. He joined Georgia Tech faculty in 1996 after earning his MS in 1990 and Ph.D. in 1994, and prior research faculty experience at MIT Media Lab from 1988 to 1996.
Research topics
- Computer Security
- Artificial Intelligence
- Computer Science
- Sociology
- Political Science
- World Wide Web
- Public relations
- Knowledge management
- Engineering
- Data science
Selected publications
MoCHA: Denoising Caption Supervision for Motion-Text Retrieval
arXiv (Cornell University) · 2026-03-24
preprintOpen accessText-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.
MoCHA: Denoising Caption Supervision for Motion-Text Retrieval
ArXiv.org · 2026-03-24
articleOpen accessText-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.
Mamba Fusion: Learning Actions Through Questioning
2025-03-12 · 4 citations
articleSenior authorVideo Language Models (VLMs) are crucial for generalizing across diverse tasks and using language cues to enhance learning. While transformer-based architectures have been the de facto in vision-language training, they face challenges like quadratic computational complexity, high GPU memory usage, and difficulty with long-term dependencies. To address these limitations, we introduce MambaVL, a novel model that leverages recent advancements in selective state space modality fusion to efficiently capture long-range dependencies and learn joint representations for vision and language data. MambaVL utilizes a shared state transition matrix across both modalities, allowing the model to capture a more comprehensive understanding of the actions in the scene. Furthermore, we propose a question-answering task that helps guide the model toward relevant cues. These questions provide critical information about actions, objects, and environmental context, leading to enhanced performance. As a result, MambaVL achieves state-of-the-art performance in action recognition on the Epic-Kitchens-100 dataset and outperforms baseline methods in action anticipation. The code is available at https://github.com/Dongzhikang/MambaVL.
Africa Health Check: Probing Cultural Bias in Medical LLMs
2025-01-01
articleOpen accessLarge language models (LLMs) are increasingly deployed in global healthcare, yet their outputs often reflect Western-centric training data and omit indigenous medical systems and region-specific treatments.This study investigates cultural bias in instruction-tuned medical LLMs using a curated dataset of African traditional herbal medicine.We evaluate model behavior across two complementary tasks, namely, multiple-choice questions and fill-in-the-blank completions, designed to capture both treatment preferences and responsiveness to cultural context.To quantify outcome preferences and prompt influences, we apply two complementary metrics: Cultural Bias Score (CBS) and Cultural Bias Attribution (CBA).Our results show that while prompt adaptation can reduce inherent bias and enhance cultural alignment, models vary in how responsive they are to contextual guidance.Persistent default to allopathic 1 (Western) treatments in zero-shot scenarios suggest that many biases remain embedded in model training.These findings underscore the need for culturally informed evaluation strategies to guide the development of AI systems that equitably serve diverse global health contexts.By releasing our dataset and providing a dual-metric evaluation approach, we offer practical tools for developing more culturally aware and clinically grounded AI systems for healthcare settings in the Global South.
WatchWithMe: LLM-Based Interactive Guided Watching of Review Videos
2025-07-05 · 1 citations
articleOpen accessLeveraging Procedural Knowledge and Task Hierarchies for Efficient Instructional Video Pre-training
ArXiv.org · 2025-02-24
preprintOpen accessSenior authorInstructional videos provide a convenient modality to learn new tasks (ex. cooking a recipe, or assembling furniture). A viewer will want to find a corresponding video that reflects both the overall task they are interested in as well as contains the relevant steps they need to carry out the task. To perform this, an instructional video model should be capable of inferring both the tasks and the steps that occur in an input video. Doing this efficiently and in a generalizable fashion is key when compute or relevant video topics used to train this model are limited. To address these requirements we explicitly mine task hierarchies and the procedural steps associated with instructional videos. We use this prior knowledge to pre-train our model, $\texttt{Pivot}$, for step and task prediction. During pre-training, we also provide video augmentation and early stopping strategies to optimally identify which model to use for downstream tasks. We test this pre-trained model on task recognition, step recognition, and step prediction tasks on two downstream datasets. When pre-training data and compute are limited, we outperform previous baselines along these tasks. Therefore, leveraging prior task and step structures enables efficient training of $\texttt{Pivot}$ for instructional video recommendation.
Text Descriptions of Actions and Objects Improve Action Anticipation
2025-03-12 · 1 citations
articleSenior authorAnticipating future actions is a highly challenging task due to the diversity and scale of potential future actions; yet, additional and complementary information from different modalities help narrow down plausible action choices. Going beyond typical sources such as video and audio, we primarily explore how text descriptions of actions and objects leads to more accurate action anticipation, as they provide additional contextual cues, e.g., about the environment and its contents. We propose Multi-modal Contrastive Anticipative Transformer (M-CAT), which is trained in two stages, where the model first learns to align video and other modalities with descriptions of future actions, and is subsequently fine-tuned to predict future actions. Through extensive experimental evaluation, we demonstrate that M-CAT outperforms baselines on the EpicKitchens datasets, and show that explicit incorporation of object and action information via their text descriptions leads to more effective action anticipation. Code available at https://github.com/ApoorvaBeedu/M-CAT.
Proceedings of the AAAI Conference on Artificial Intelligence · 2025-04-11 · 4 citations
articleOpen accessCross-modal contrastive pre-training between natural language and other modalities, e.g., vision and audio, has demonstrated astonishing performance and effectiveness across a diverse variety of tasks and domains. In this paper, we investigate whether such natural language supervision can be used for wearable sensor based Human Activity Recognition (HAR), and discover that--surprisingly--it performs substantially worse than standard end-to-end training and self-supervision. We identify the primary causes for this as: sensor heterogeneity and the lack of rich, diverse text descriptions of activities. To mitigate their impact, we also develop strategies and assess their effectiveness through an extensive experimental evaluation. These strategies lead to significant increases in activity recognition, bringing performance closer to supervised and self-supervised training, while also enabling the recognition of unseen activities and cross modal retrieval of videos. Overall, our work paves the way for better sensor-language learning, ultimately leading to the development of foundational models for HAR using wearables.
Enabling Controllable, Identity Preserving, Non-Rigid Edits in Human-Centric Images
2025-08-18
articleSenior authorWe approach the problem of inserting a person into a novel scene and controlling their pose via text guidance. Given an image of a person, a masked image of a scene, and a text description of the target pose, our model generates realistic, highly controllable images. We validate the robustness of our model’s true-to-text accuracy and identity preservation via a user study on in-the-wild images. In addition, we present a novel dataset containing pairs of frames from human-centric and action-rich videos, with text captions of the difference in human pose between frames. We also explore the challenges of controllable identity preservation for in-the-wild scenes and the failure modes of similar models. Our methods achieve a 10% increase in pose adherence (PCKt@0.5) over comparable methods without compromising visual fidelity, and show a clear qualitative improvement.
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset
2025-01-01 · 5 citations
articleOpen accessCharles Nimo, Tobi Olatunji, Abraham Toluwase Owodunni, Tassallah Abdullahi, Emmanuel Ayodele, Mardhiyah Sanni, Ezinwanne C. Aka, Folafunmi Omofoye, Foutse Yuehgoh, Timothy Faniran, Bonaventure F. P. Dossou, Moshood O. Yekini, Jonas Kemp, Katherine A Heller, Jude Chidubem Omeke, Chidi Asuzu Md, Naome A Etori, Aïmérou Ndiaye, Ifeoma Okoh, Evans Doe Ocansey, Wendy Kinara, Michael L. Best, Irfan Essa, Stephen Edward Moore, Chris Fourie, Mercy Nyamewaa Asiedu. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.
Recent grants
NRI: Representing and Anticipating Actions in Human-Robot Collaborative Assembly Tasks
NSF · $850k · 2014–2020
Frequent coauthors
- 32 shared
Dhruv Batra
- 27 shared
Vinay Bettadapura
Atlanta Technical College
- 20 shared
Aneeq Zia
- 20 shared
Gregory D. Abowd
Northeastern University
- 18 shared
Steven Hickson
Google (United States)
- 17 shared
Stefan Lee
Oregon State University
- 16 shared
Devi Parikh
- 16 shared
Vivek Kwatra
Georgia Institute of Technology
Education
- 1995
PhD , MIT Media Lab
Massachusetts Institute of Technology
Awards & honors
- NSF CAREER
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Irfan Essa
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup