Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Ziad Al-Halah

Ziad Al-Halah

· Assistant ProfessorVerified

University of Utah · Computer Science

Active 2008–2026

h-index19
Citations1.5k
Papers7338 last 5y
Funding
See your match with Ziad Al-Halah — sign in to PhdFit.Sign in

About

Ziad Al-Halah is an Assistant Professor at the Kahlert School of Computing at the University of Utah. His research interests include Artificial Intelligence, specifically Computer Vision. He is involved in advancing understanding and applications within these fields, contributing to the academic community through his teaching and research activities. His contact information includes his email ziad.al-halah@utah.edu and his office location in MEB 2176.

Research topics

  • Computer Science
  • Artificial Intelligence
  • Information Retrieval
  • Natural Language Processing
  • Human–computer interaction
  • World Wide Web
  • Computer graphics (images)
  • Multimedia
  • Computer vision

Selected publications

  • MistExit: Learning to Exit for Early Mistake Detection in Procedural Videos

    arXiv (Cornell University) · 2026-03-15

    articleOpen access

    We introduce the task of early mistake detection in video, where the goal is to determine whether a keystep in a procedural activity is performed correctly while observing as little of the streaming video as possible. To tackle this problem, we propose a method comprising a mistake detector and a reinforcement learning policy. At each timestep, the detector processes recently observed frames to estimate the keystep's correctness while anticipating future visual features, enabling reliable early mistake estimates. Meanwhile, the policy aggregates the detector outputs and visual observations over time and adaptively decides when to exit (i.e., stop processing incoming frames) while producing the final prediction. Using diverse real-world procedural video datasets, we demonstrate that our MistExit model achieves superior mistake detection accuracy while reducing the fraction of video observed compared to state-of-the-art models. Project: https://vision.cs.utexas.edu/projects/mist_exit.

  • Interactive Episodic Memory with User Feedback

    arXiv (Cornell University) · 2026-04-27

    preprintOpen accessSenior author

    In episodic memory with natural language queries (EM-NLQ), a user may ask a question (e.g., "Where did I place the mug?") that requires searching a long egocentric video, captured from the user's perspective, to find the moment that answers it. However, queries can be ambiguous or incomplete, leading to incorrect responses. Current methods ignore this key aspect and address EM-NLQ in a one-shot setup, limiting their applicability in real-world scenarios. In this work, we address this gap and introduce the Episodic Memory with Questions and Feedback task (EM-QnF). Here, the user can provide feedback on the model's initial prediction or add more information (e.g., "Before this. I'm looking for the big blue mug not the white one"), helping the model refine its predictions interactively. To this end, we collect datasets for feedback-based interaction and propose a lightweight training scheme that avoids expensive sequential optimization. We also introduce a plug-and-play Feedback ALignment Module (FALM) that enables existing EM-NLQ models to incorporate user feedback effectively. Our approach significantly improves over the state of the art on three challenging benchmarks and is better than or competitive with commercial large vision-language models while remaining efficient. Evaluation with human-generated feedback shows that it generalizes well to real-world scenarios.

  • MistExit: Learning to Exit for Early Mistake Detection in Procedural Videos

    arXiv (Cornell University) · 2026-03-15

    preprintOpen access

    We introduce the task of early mistake detection in video, where the goal is to determine whether a keystep in a procedural activity is performed correctly while observing as little of the streaming video as possible. To tackle this problem, we propose a method comprising a mistake detector and a reinforcement learning policy. At each timestep, the detector processes recently observed frames to estimate the keystep's correctness while anticipating future visual features, enabling reliable early mistake estimates. Meanwhile, the policy aggregates the detector outputs and visual observations over time and adaptively decides when to exit (i.e., stop processing incoming frames) while producing the final prediction. Using diverse real-world procedural video datasets, we demonstrate that our MistExit model achieves superior mistake detection accuracy while reducing the fraction of video observed compared to state-of-the-art models. Project: https://vision.cs.utexas.edu/projects/mist_exit.

  • Interactive Episodic Memory with User Feedback

    ArXiv.org · 2026-04-27

    articleOpen accessSenior author

    In episodic memory with natural language queries (EM-NLQ), a user may ask a question (e.g., "Where did I place the mug?") that requires searching a long egocentric video, captured from the user's perspective, to find the moment that answers it. However, queries can be ambiguous or incomplete, leading to incorrect responses. Current methods ignore this key aspect and address EM-NLQ in a one-shot setup, limiting their applicability in real-world scenarios. In this work, we address this gap and introduce the Episodic Memory with Questions and Feedback task (EM-QnF). Here, the user can provide feedback on the model's initial prediction or add more information (e.g., "Before this. I'm looking for the big blue mug not the white one"), helping the model refine its predictions interactively. To this end, we collect datasets for feedback-based interaction and propose a lightweight training scheme that avoids expensive sequential optimization. We also introduce a plug-and-play Feedback ALignment Module (FALM) that enables existing EM-NLQ models to incorporate user feedback effectively. Our approach significantly improves over the state of the art on three challenging benchmarks and is better than or competitive with commercial large vision-language models while remaining efficient. Evaluation with human-generated feedback shows that it generalizes well to real-world scenarios.

  • How Would it Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes

    2025-10-19

    articleSenior author
  • Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos

    2025-06-10 · 4 citations

    article

    Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive "best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose LANGVIEW, a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video—no language or camera poses—and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation. Project: https://vision.cs.utexas.edu/projects/which-view-shows-it-best.

  • Switch-a-View: View Selection Learned from Unlabeled In-the-Wild Videos

    2025-10-19

    article
  • Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

    2024-06-16 · 3 citations

    article

    We propose a self-supervised method for learning repre- sentations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto- encoding framework to synthesize masked binaural audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenar- ios: active speaker detection and spatial audio denoising. Through extensive experiments, we show that our features are generic enough to improve over multiple state-of-the- art baselines on both tasks on two challenging egocentric video datasets that offer binaural audio, EgoCom and Easy-Com. Project: http://vision.cs.utexas.edu/projects/ego_av_corr.

  • Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos

    arXiv (Cornell University) · 2024-12-24

    preprintOpen access

    We introduce SWITCH-A-VIEW, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled -- but human-edited -- video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then discovers the patterns between the visual and spoken content in a how-to video on the one hand and its view-switch moments on the other hand. Armed with this predictor, our model can be applied to new multi-view video settings for orchestrating which viewpoint should be displayed when, even when such settings come with limited labels. We demonstrate our idea on a variety of real-world videos from HowTo100M and Ego-Exo4D, and rigorously validate its advantages. Project: https://vision.cs.utexas.edu/projects/switch_a_view/.

  • Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos

    arXiv (Cornell University) · 2024-11-13

    preprintOpen access

    Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive "best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose LangView, a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video--no language or camera poses--and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation. Project page: https://vision.cs.utexas.edu/projects/which-view-shows-it-best.

Frequent coauthors

  • Kristen Grauman

    49 shared
  • Rainer Stiefelhagen

    Karlsruhe Institute of Technology

    25 shared
  • Santhosh Kumar Ramakrishnan

    The University of Texas at Austin

    16 shared
  • Sagnik Majumder

    The University of Texas at Austin

    13 shared
  • Changan Chen

    Guangdong Medical College

    13 shared
  • Carl Schissler

    6 shared
  • Makarand Tapaswi

    6 shared
  • Unnat Jain

    Carnegie Mellon University

    4 shared

Education

  • Ph.D., Computer Science

    University of Utah

    2000
  • M.S., Computer Science

    University of Utah

    1996
  • B.S., Computer Science

    University of Jordan

    1993
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Ziad Al-Halah

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup