
Ehsan Elhamifar
· Associate Professor, Affiliate Faculty with the College of EngineeringVerifiedNortheastern University · Artificial Intelligence and Data Science
Active 2005–2026
About
Ehsan Elhamifar is an associate professor in the Khoury College of Computer Sciences at Northeastern University, based in Boston, and is affiliated with the College of Engineering. His research develops artificial intelligence systems that understand and learn from complex human activities and scenes using videos and multi-modal data. He focuses on enabling AI to learn tasks from fewer examples and less annotated data, as well as making real-time inferences as new data arrives. Elhamifar combines these AI systems with augmented reality (AR) and virtual reality (VR) technologies to assist people in performing complex procedural and physical tasks. His work pulls from a broad range of concepts and disciplines, including long-form and egocentric video understanding, action segmentation, low-shot learning, fine-grained recognition, video summarization, sequence alignment, adversarial attacks, manifold clustering, trajectory prediction, and subset selection. He is the director of the Mathematical Data Science (MCADS) Lab, which focuses on computer vision, machine learning, and AI. Elhamifar has a background that includes a PhD in Electrical and Computer Engineering from Johns Hopkins University, an MS in Engineering from Johns Hopkins University, an MS in Electrical Engineering from Sharif University of Technology in Iran, and a BS in Biomedical Engineering from Amirkabir University of Technology in Iran. He has been recognized with awards such as the DARPA Young Faculty Award and has contributed to numerous publications in the field.
Research topics
- Computer Science
- Machine Learning
- Artificial Intelligence
- Computer vision
- Human–computer interaction
Selected publications
EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms
arXiv (Cornell University) · 2026-04-23
preprintOpen accessSenior authorThis paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA's Perceptually-enabled Task Guidance (PTG) program. This dataset comprises 3,355 videos of 50 medical tasks, with at least 50 labeled videos per task. The primary objective of the PTG program was to develop virtual assistants integrated into augmented reality headsets to assist users in performing complex tasks. To encourage exploration and research using this dataset, the medical training data has been released along with an action detection challenge focused on eight medical tasks. The majority of the videos were recorded using a head-mounted stereo camera with integrated audio. From this dataset, 40 YOLO models were trained using 1.95 million labels to detect 124 medical objects, providing a robust starting point for developers working on medical AI applications. In addition to introducing the dataset, this paper presents baseline results on action detection for the eight selected medical tasks across three models, with the best-performing method achieving average mAP 0.526. Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks. The dataset is accessible via zenodo.org (DOI: 10.5281/zenodo.19239154).
EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms
ArXiv.org · 2026-04-23
articleOpen accessSenior authorThis paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA's Perceptually-enabled Task Guidance (PTG) program. This dataset comprises 3,355 videos of 50 medical tasks, with at least 50 labeled videos per task. The primary objective of the PTG program was to develop virtual assistants integrated into augmented reality headsets to assist users in performing complex tasks. To encourage exploration and research using this dataset, the medical training data has been released along with an action detection challenge focused on eight medical tasks. The majority of the videos were recorded using a head-mounted stereo camera with integrated audio. From this dataset, 40 YOLO models were trained using 1.95 million labels to detect 124 medical objects, providing a robust starting point for developers working on medical AI applications. In addition to introducing the dataset, this paper presents baseline results on action detection for the eight selected medical tasks across three models, with the best-performing method achieving average mAP 0.526. Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks. The dataset is accessible via zenodo.org (DOI: 10.5281/zenodo.19239154).
RegionAligner: Bridging Ego-Exo Views for Object Correspondence via Unified Text-Visual Learning
2026-03-06
articleSenior authorEstablishing object correspondence between egocentric (ego) and exocentric (exo) views is a critical capability for robot learning and human-robot interaction. The core task involves segmenting an object in one view given a query mask from the opposing view. This is notoriously difficult due to cluttered scenes with many task-irrelevant objects and drastic appearance changes across perspectives. To address this, we introduce RegionAligner, a unified text-visual framework that strategically focuses learning on task-relevant regions. Our method first uses a large vision-language model to identify and name salient objects, effectively filtering out visual distractors. These object phrases are then fused with visual features from both views. We introduce a novel region-guided supervision strategy that promotes focus, enforces spatial alignment, and minimizes appearance disparity between the ego-exo views. Furthermore, our framework seamlessly adapts to unsupervised settings by automatically generating pseudo-labels from matched mask proposals, drastically reducing annotation costs. Extensive experiments on the challenging Ego-Exo4D dataset show RegionAligner significantly outperforms existing baselines, improving IoU by 10.16% (ego-to-exo) and 6.04% (exo-to-ego).
Compositional Targeted Multi-Label Universal Perturbations
2025-06-10
articleSenior authorGenerating targeted universal perturbations for multi-label recognition is a combinatorially hard problem that requires exponential time and space complexity. To address the problem, we propose a compositional framework. We show that a simple independence assumption on label-wise universal perturbations naturally leads to an efficient optimization that requires learning affine convex cones spanned by label-wise universal perturbations, significantly reducing the problem complexity to linear time and space. During inference, the framework allows generating universal perturbations for novel combinations of classes in constant time. We demonstrate the scalability of our method on large datasets and target sizes, evaluating its performance on NUS-WIDE, MS-COCO, and OpenImages using state-of-the-art multi-label recognition models. Our results show that our approach outperforms baselines and achieves results comparable to methods with exponential complexity. The code is available at https://github.com/hassanmahmood/UMLLAttacks.git
MOSCATO: Predicting Multiple Object State Change through Actions
2025-10-19 · 3 citations
articleSenior authorDeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos
ArXiv.org · 2025-05-22
preprintOpen accessLong Video Temporal Grounding (LVTG) aims at identifying specific moments within lengthy videos based on user-provided text queries for effective content retrieval. The approach taken by existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale due to prohibitive computational costs of processing a large number of clips in long videos. To address this issue, we introduce DeCafNet, an approach employing ``delegate-and-conquer'' strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that exist at different temporal resolutions, we introduce DeCaf-Grounder, which unifies and refines them via query-aware temporal aggregation and multi-scale temporal refinement for accurate grounding. Experiments on two LTVG benchmark datasets demonstrate that DeCafNet reduces computation by up to 47\% while still outperforming existing methods, establishing a new state-of-the-art for LTVG in terms of both efficiency and performance. Our code is available at https://github.com/ZijiaLewisLu/CVPR2025-DeCafNet.
Multi-Modal Few-Shot Temporal Action Segmentation
2025-10-19 · 3 citations
articleSenior authorDeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos
2025-06-10 · 2 citations
articleLong Video Temporal Grounding (LVTG) aims at identifying specific moments within lengthy videos based on user-provided text queries for effective content retrieval. The approach taken by existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale due to prohibitive computational costs of processing a large number of clips in long videos. To address this issue, we introduce DeCafNet, an approach employing "delegate-and-conquer" strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that exist at different temporal resolutions, we introduce DeCaf-Grounder, which unifies and refines them via query-aware temporal aggregation and multi-scale temporal refinement for accurate grounding. Experiments on two LTVG benchmark datasets demonstrate that DeCafNet reduces computation by up to 47% while still outperforming existing methods, establishing a new state-of-the-art for LTVG in terms of both efficiency and performance.
Error Recognition in Procedural Videos Using Generalized Task Graph
2025-10-19 · 3 citations
articleSenior authorUnderstanding Multi-Task Activities from Single-Task Videos
2025-06-10 · 4 citations
articleSenior authorWe introduce and develop a framework for Multi-Task Temporal Action Segmentation (MT-TAS), a novel paradigm that addresses the challenges of interleaved actions when performing multiple tasks simultaneously. Traditional action segmentation models, trained on single-task videos, struggle to handle task switches and complex scenes inherent in multi-task scenarios. To overcome these challenges, our MT-TAS approach synthesizes multi-task video data from single-task sources using our Multi-Task Sequence Blending and Segment Boundary Learning modules. Additionally, we propose to dynamically isolate foreground and background elements within video frames, addressing the intricacies of object layouts in multi-task scenarios and enabling a new two-stage temporal action segmentation framework with Foreground-Aware Action Refinement. Also, we introduce the Multi-Task Egocentric Kitchen Activities (MEKA) dataset, containing 12 hours of egocentric multi-task videos, to rigorously benchmark MT-TAS models. Extensive experiments demonstrate that our framework effectively bridges the gap between single-task training and multi-task testing, advancing temporal action segmentation with state-of-the-art performance in complex environments. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup>
Recent grants
RI: Small: Learning Fine-Grained Instructions from Uncurated Complex Activity Videos
NSF · $498k · 2021–2026
CRII: RI: Towards a Comprehensive Dynamic Subset Selection Framework
NSF · $175k · 2017–2021
Frequent coauthors
- 16 shared
René Vidal
- 11 shared
Dat Huynh
- 9 shared
S. Shankar Sastry
- 7 shared
Mahdi Soltanolkotabi
- 6 shared
Nasser Sadati
Sharif University of Technology
- 6 shared
Yuhan Shen
Universidad del Noreste
- 5 shared
Allen Y. Yang
- 5 shared
Guillermo Sapiro
Duke University
Labs
Ehsan Elhamifar - Khoury College of Computer SciencesPI
Education
- 2012
PhD, Electrical and Computer Engineering
Johns Hopkins University
Awards & honors
- DARPA Young Faculty Award
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Ehsan Elhamifar
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup