
Ziad Al-Halah
· Assistant ProfessorVerifiedUniversity of Utah · Computer Science
Active 2008–2026
About
Ziad Al-Halah is an Assistant Professor at the Kahlert School of Computing at the University of Utah. His research interests include Artificial Intelligence, specifically Computer Vision. He is involved in advancing understanding and applications within these fields, contributing to the academic community through his teaching and research activities. His contact information includes his email ziad.al-halah@utah.edu and his office location in MEB 2176.
Research topics
- Computer Science
- Artificial Intelligence
- Information Retrieval
- Natural Language Processing
- Human–computer interaction
- World Wide Web
- Computer graphics (images)
- Multimedia
- Computer vision
Selected publications
MistExit: Learning to Exit for Early Mistake Detection in Procedural Videos
arXiv (Cornell University) · 2026-03-15
articleOpen accessWe introduce the task of early mistake detection in video, where the goal is to determine whether a keystep in a procedural activity is performed correctly while observing as little of the streaming video as possible. To tackle this problem, we propose a method comprising a mistake detector and a reinforcement learning policy. At each timestep, the detector processes recently observed frames to estimate the keystep's correctness while anticipating future visual features, enabling reliable early mistake estimates. Meanwhile, the policy aggregates the detector outputs and visual observations over time and adaptively decides when to exit (i.e., stop processing incoming frames) while producing the final prediction. Using diverse real-world procedural video datasets, we demonstrate that our MistExit model achieves superior mistake detection accuracy while reducing the fraction of video observed compared to state-of-the-art models. Project: https://vision.cs.utexas.edu/projects/mist_exit.
Interactive Episodic Memory with User Feedback
arXiv (Cornell University) · 2026-04-27
preprintOpen accessSenior authorIn episodic memory with natural language queries (EM-NLQ), a user may ask a question (e.g., "Where did I place the mug?") that requires searching a long egocentric video, captured from the user's perspective, to find the moment that answers it. However, queries can be ambiguous or incomplete, leading to incorrect responses. Current methods ignore this key aspect and address EM-NLQ in a one-shot setup, limiting their applicability in real-world scenarios. In this work, we address this gap and introduce the Episodic Memory with Questions and Feedback task (EM-QnF). Here, the user can provide feedback on the model's initial prediction or add more information (e.g., "Before this. I'm looking for the big blue mug not the white one"), helping the model refine its predictions interactively. To this end, we collect datasets for feedback-based interaction and propose a lightweight training scheme that avoids expensive sequential optimization. We also introduce a plug-and-play Feedback ALignment Module (FALM) that enables existing EM-NLQ models to incorporate user feedback effectively. Our approach significantly improves over the state of the art on three challenging benchmarks and is better than or competitive with commercial large vision-language models while remaining efficient. Evaluation with human-generated feedback shows that it generalizes well to real-world scenarios.
MistExit: Learning to Exit for Early Mistake Detection in Procedural Videos
arXiv (Cornell University) · 2026-03-15
preprintOpen accessWe introduce the task of early mistake detection in video, where the goal is to determine whether a keystep in a procedural activity is performed correctly while observing as little of the streaming video as possible. To tackle this problem, we propose a method comprising a mistake detector and a reinforcement learning policy. At each timestep, the detector processes recently observed frames to estimate the keystep's correctness while anticipating future visual features, enabling reliable early mistake estimates. Meanwhile, the policy aggregates the detector outputs and visual observations over time and adaptively decides when to exit (i.e., stop processing incoming frames) while producing the final prediction. Using diverse real-world procedural video datasets, we demonstrate that our MistExit model achieves superior mistake detection accuracy while reducing the fraction of video observed compared to state-of-the-art models. Project: https://vision.cs.utexas.edu/projects/mist_exit.
Interactive Episodic Memory with User Feedback
ArXiv.org · 2026-04-27
articleOpen accessSenior authorIn episodic memory with natural language queries (EM-NLQ), a user may ask a question (e.g., "Where did I place the mug?") that requires searching a long egocentric video, captured from the user's perspective, to find the moment that answers it. However, queries can be ambiguous or incomplete, leading to incorrect responses. Current methods ignore this key aspect and address EM-NLQ in a one-shot setup, limiting their applicability in real-world scenarios. In this work, we address this gap and introduce the Episodic Memory with Questions and Feedback task (EM-QnF). Here, the user can provide feedback on the model's initial prediction or add more information (e.g., "Before this. I'm looking for the big blue mug not the white one"), helping the model refine its predictions interactively. To this end, we collect datasets for feedback-based interaction and propose a lightweight training scheme that avoids expensive sequential optimization. We also introduce a plug-and-play Feedback ALignment Module (FALM) that enables existing EM-NLQ models to incorporate user feedback effectively. Our approach significantly improves over the state of the art on three challenging benchmarks and is better than or competitive with commercial large vision-language models while remaining efficient. Evaluation with human-generated feedback shows that it generalizes well to real-world scenarios.
How Would it Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes
2025-10-19
articleSenior author2025-06-10 · 4 citations
articleGiven a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive "best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose LANGVIEW, a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video—no language or camera poses—and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation. Project: https://vision.cs.utexas.edu/projects/which-view-shows-it-best.
Switch-a-View: View Selection Learned from Unlabeled In-the-Wild Videos
2025-10-19
articleLearning Spatial Features from Audio-Visual Correspondence in Egocentric Videos
2024-06-16 · 3 citations
articleWe propose a self-supervised method for learning repre- sentations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto- encoding framework to synthesize masked binaural audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenar- ios: active speaker detection and spatial audio denoising. Through extensive experiments, we show that our features are generic enough to improve over multiple state-of-the- art baselines on both tasks on two challenging egocentric video datasets that offer binaural audio, EgoCom and Easy-Com. Project: http://vision.cs.utexas.edu/projects/ego_av_corr.
Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos
arXiv (Cornell University) · 2024-12-24
preprintOpen accessWe introduce SWITCH-A-VIEW, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled -- but human-edited -- video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then discovers the patterns between the visual and spoken content in a how-to video on the one hand and its view-switch moments on the other hand. Armed with this predictor, our model can be applied to new multi-view video settings for orchestrating which viewpoint should be displayed when, even when such settings come with limited labels. We demonstrate our idea on a variety of real-world videos from HowTo100M and Ego-Exo4D, and rigorously validate its advantages. Project: https://vision.cs.utexas.edu/projects/switch_a_view/.
arXiv (Cornell University) · 2024-11-13
preprintOpen accessGiven a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive "best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose LangView, a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video--no language or camera poses--and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation. Project page: https://vision.cs.utexas.edu/projects/which-view-shows-it-best.
Frequent coauthors
- 49 shared
Kristen Grauman
- 25 shared
Rainer Stiefelhagen
Karlsruhe Institute of Technology
- 16 shared
Santhosh Kumar Ramakrishnan
The University of Texas at Austin
- 13 shared
Sagnik Majumder
The University of Texas at Austin
- 13 shared
Changan Chen
Guangdong Medical College
- 6 shared
Carl Schissler
- 6 shared
Makarand Tapaswi
- 4 shared
Unnat Jain
Carnegie Mellon University
Education
- 2000
Ph.D., Computer Science
University of Utah
- 1996
M.S., Computer Science
University of Utah
- 1993
B.S., Computer Science
University of Jordan
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Ziad Al-Halah
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup