Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Stan Sclaroff

Stan Sclaroff

· Professor & Dean of the College of Arts & Sciences

Boston University · Computer Science

Active 1990–2025

h-index69
Citations19.2k
Papers396113 last 5y
Funding$3.5M
See your match with Stan Sclaroff — sign in to PhdFit.Sign in

About

Stan Sclaroff is a Professor and Dean of the College of Arts and Sciences at Boston University. He joined the BU Department of Computer Science in 1995 after completing his PhD at MIT. He has served as the Chair of the Department from 2007 to 2013, Associate Dean of the Faculty for Mathematical & Computational Sciences from 2015 to 2018, and Dean ad interim for the College of Arts and Sciences from August 2018 to May 2019. On May 17, 2019, he was appointed as Dean for the College of Arts and Sciences. His research interests encompass computer vision, pattern recognition, and machine learning. He is an expert in tracking, video-based analysis of human motion and gesture, deformable shape matching and recognition, as well as image and video database indexing, retrieval, and data mining methods. Sclaroff developed one of the first content-based image retrieval systems for the Internet, called ImageRover, years before Google Image Search appeared. His recent work focuses on human tracking algorithms, analysis and identification of hand motion related to sign language, and filtering methods for multimedia retrieval. He co-leads the Image and Video Computing research group and has been recognized as a Fellow of the IEEE and IAPR.

Research topics

  • Artificial Intelligence
  • Computer Science
  • Natural Language Processing
  • Computer vision
  • Machine Learning
  • Algorithm
  • Mathematics

Selected publications

  • Generalized prompt-driven zero-shot domain adaptive segmentation with feature rectification and semantic modulation

    Computer Vision and Image Understanding · 2025-12-17

    article
  • Practical Disruption of Image Translation Deepfake Networks

    Proceedings of the AAAI Conference on Artificial Intelligence · 2023-06-26 · 9 citations

    articleOpen accessSenior author

    By harnessing the latest advances in deep learning, image-to-image translation architectures have recently achieved impressive capabilities. Unfortunately, the growing representational power of these architectures has prominent unethical uses. Among these, the threats of (1) face manipulation ("DeepFakes") used for misinformation or pornographic use (2) "DeepNude" manipulations of body images to remove clothes from individuals, etc. Several works tackle the task of disrupting such image translation networks by inserting imperceptible adversarial attacks into the input image. Nevertheless, these works have limitations that may result in disruptions that are not practical in the real world. Specifically, most works generate disruptions in a white-box scenario, assuming perfect knowledge about the image translation network. The few remaining works that assume a black-box scenario require a large number of queries to successfully disrupt the adversary's image translation network. In this work we propose Leaking Transferable Perturbations (LTP), an algorithm that significantly reduces the number of queries needed to disrupt an image translation network by dynamically re-purposing previous disruptions into new query efficient disruptions.

  • DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition With Limited Annotations

    IEEE Transactions on Pattern Analysis and Machine Intelligence · 2023-12-25 · 25 citations

    article

    Multi-label image recognition in the low-label regime is a task of great challenge and practical significance. Previous works have focused on learning the alignment between textual and visual spaces to compensate for limited image labels, yet may suffer from reduced accuracy due to the scarcity of high-quality multi-label annotations. In this research, we leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs. We introduce an efficient and effective framework called <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Evidence-guided Dual Context Optimization</i> ( <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DualCoOp++</monospace> ), which serves as a unified approach for addressing partial-label and zero-shot multi-label recognition. In <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DualCoOp++</monospace> we separately encode evidential, positive, and negative contexts for target classes as parametric components of the linguistic input (i.e., prompts). The evidential context aims to discover all the related visual content for the target class, and serves as guidance to aggregate positive and negative contexts from the spatial domain of the image, enabling better distinguishment between similar categories. Additionally, we introduce a Winner-Take-All module that promotes inter-class interaction during training, while avoiding the need for extra parameters and costs. As <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DualCoOp++</monospace> imposes minimal additional learnable overhead on the pretrained vision-language framework, it enables rapid adaptation to multi-label recognition tasks with limited annotations and even unseen classes. Experiments on standard multi-label recognition benchmarks across two challenging low-label settings demonstrate the superior performance of our approach compared to state-of-the-art methods.

  • Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

    arXiv (Cornell University) · 2023-06-30

    preprintOpen access

    Vision transformers (ViTs) have significantly changed the computer vision landscape and have periodically exhibited superior performance in vision tasks compared to convolutional neural networks (CNNs). Although the jury is still out on which model type is superior, each has unique inductive biases that shape their learning and generalization performance. For example, ViTs have interesting properties with respect to early layer non-local feature dependence, as well as self-attention mechanisms which enhance learning flexibility, enabling them to ignore out-of-context image information more effectively. We hypothesize that this power to ignore out-of-context information (which we name $\textit{patch selectivity}$), while integrating in-context information in a non-local manner in early layers, allows ViTs to more easily handle occlusion. In this study, our aim is to see whether we can have CNNs $\textit{simulate}$ this ability of patch selectivity by effectively hardwiring this inductive bias using Patch Mixing data augmentation, which consists of inserting patches from another image onto a training image and interpolating labels between the two image classes. Specifically, we use Patch Mixing to train state-of-the-art ViTs and CNNs, assessing its impact on their ability to ignore out-of-context patches and handle natural occlusions. We find that ViTs do not improve nor degrade when trained using Patch Mixing, but CNNs acquire new capabilities to ignore out-of-context information and improve on occlusion benchmarks, leaving us to conclude that this training method is a way of simulating in CNNs the abilities that ViTs already possess. We will release our Patch Mixing implementation and proposed datasets for public use. Project page: https://arielnlee.github.io/PatchMixing/

  • Video Frame Interpolation with Many-to-many Splatting and Spatial Selective Refinement

    arXiv (Cornell University) · 2023-10-29

    preprintOpen access

    In this work, we first propose a fully differentiable Many-to-Many (M2M) splatting framework to interpolate frames efficiently. Given a frame pair, we estimate multiple bidirectional flows to directly forward warp the pixels to the desired time step before fusing overlapping pixels. In doing so, each source pixel renders multiple target pixels and each target pixel can be synthesized from a larger area of visual context, establishing a many-to-many splatting scheme with robustness to undesirable artifacts. For each input frame pair, M2M has a minuscule computational overhead when interpolating an arbitrary number of in-between frames, hence achieving fast multi-frame interpolation. However, directly warping and fusing pixels in the intensity domain is sensitive to the quality of motion estimation and may suffer from less effective representation capacity. To improve interpolation accuracy, we further extend an M2M++ framework by introducing a flexible Spatial Selective Refinement (SSR) component, which allows for trading computational efficiency for interpolation quality and vice versa. Instead of refining the entire interpolated frame, SSR only processes difficult regions selected under the guidance of an estimated error map, thereby avoiding redundant computation. Evaluation on multiple benchmark datasets shows that our method is able to improve the efficiency while maintaining competitive video interpolation quality, and it can be adjusted to use more or less compute as needed.

  • Video Frame Interpolation With Many-to-Many Splatting and Spatial Selective Refinement

    IEEE Transactions on Pattern Analysis and Machine Intelligence · 2023-10-24 · 8 citations

    article

    In this work, we first propose a fully differentiable Many-to-Many (M2M) splatting framework to interpolate frames efficiently. Given a frame pair, we estimate multiple bidirectional flows to directly forward warp the pixels to the desired time step before fusing any overlapping pixels. In doing so, each source pixel renders multiple target pixels and each target pixel can be synthesized from a larger area of visual context, establishing a many-to-many splatting scheme with robustness to undesirable artifacts. For each input frame pair, M2M has a minuscule computational overhead when interpolating an arbitrary number of in-between frames, hence achieving fast multi-frame interpolation. However, directly warping and fusing pixels in the intensity domain is sensitive to the quality of motion estimation and may suffer from less effective representation capacity. To improve interpolation accuracy, we further extend an M2M++ framework by introducing a flexible Spatial Selective Refinement (SSR) component, which allows for trading computational efficiency for interpolation quality and vice versa. Instead of refining the entire interpolated frame, SSR only processes difficult regions selected under the guidance of an estimated error map, thereby avoiding redundant computation. Evaluation on multiple benchmark datasets shows that our method is able to improve the efficiency while maintaining competitive video interpolation quality, and it can be adjusted to use more or less compute as needed.

  • DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations

    arXiv (Cornell University) · 2023-08-03

    preprintOpen access

    Multi-label image recognition in the low-label regime is a task of great challenge and practical significance. Previous works have focused on learning the alignment between textual and visual spaces to compensate for limited image labels, yet may suffer from reduced accuracy due to the scarcity of high-quality multi-label annotations. In this research, we leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs. We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++), which serves as a unified approach for addressing partial-label and zero-shot multi-label recognition. In DualCoOp++ we separately encode evidential, positive, and negative contexts for target classes as parametric components of the linguistic input (i.e., prompts). The evidential context aims to discover all the related visual content for the target class, and serves as guidance to aggregate positive and negative contexts from the spatial domain of the image, enabling better distinguishment between similar categories. Additionally, we introduce a Winner-Take-All module that promotes inter-class interaction during training, while avoiding the need for extra parameters and costs. As DualCoOp++ imposes minimal additional learnable overhead on the pretrained vision-language framework, it enables rapid adaptation to multi-label recognition tasks with limited annotations and even unseen classes. Experiments on standard multi-label recognition benchmarks across two challenging low-label settings demonstrate the superior performance of our approach compared to state-of-the-art methods.

  • Many-to-many Splatting for Efficient Video Frame Interpolation

    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) · 2022-06-01 · 64 citations

    article

    Motion-based video frame interpolation commonly relies on optical flow to warp pixels from the inputs to the desired interpolation instant. Yet due to the inherent challenges of motion estimation (e.g. occlusions and discontinuities), most state-of-the-art interpolation approaches require subsequent refinement of the warped result to generate satisfying outputs, which drastically decreases the efficiency for multi-frame interpolation. In this work, we propose a fully differentiable Many-to-Many (M2M) splatting framework to interpolate frames efficiently. Specifically, given a frame pair, we estimate multiple bidirectional flows to directly forward warp the pixels to the desired time step, and then fuse any overlapping pixels. In doing so, each source pixel renders multiple target pixels and each target pixel can be synthesized from a larger area of visual context. This establishes a many-to-many splatting scheme with robustness to artifacts like holes. Moreover, for each input frame pair, M2M only performs motion estimation once and has a minuscule computational overhead when interpolating an arbitrary number of in-between frames, hence achieving fast multi-frame interpolation. We conducted extensive experiments to analyze M2M, and found that it significantly improves the efficiency while maintaining high effectiveness.

  • Leveraging Geometric Structure for Label-Efficient Semi-Supervised Scene Segmentation

    IEEE Transactions on Image Processing · 2022-01-01 · 6 citations

    article

    Label-efficient scene segmentation aims to achieve effective per-pixel classification with reduced labeling effort. Recent approaches for this task focus on leveraging unlabelled images by formulating consistency regularization or pseudo labels for individual pixels. Yet most of these methods ignore the 3D geometric structures naturally conveyed by image scenes, which is free for enhancing training segmentation models with better discrimination of image details. In this work, we present a novel Geometric Structure Refinement (GSR) framework to explicitly exploit the geometric structures of image scenes to enhance the semi-supervised training of segmentation models. In the training phase, we generate initial dense pseudo labels based on fast and coarse annotations, and then utilize the free unsupervised 3D reconstruction of the image scene to calibrate the dense pseudo labels with more reliable details. With the calibrated pseudo groundtruth, we are able to conveniently train any existing image segmentation models without increasing the costs of annotations or modifying the models' architectures. Moreover, we explore different strategies for allocating labeling effort in semi-supervised scene segmentation, and find that a combination of finely-labeled samples and coarsely-labeled samples performs better than the traditional dense-fine only annotations. Extensive experiments on datasets including Cityscapes and KITTI are conducted to evaluate our proposed methods. The results demonstrate that GSR can be easily applied to boost the performance of existing models like PSPNet, DeepLabv3+, etc with reduced annotations. With half of the annotation effort, GSR achieves 99% of the accuracy of its fully supervised state-of-the-art counterparts.

  • Image Analysis and Processing. ICIAP 2022 Workshops

    Lecture notes in computer science · 2022-01-01 · 11 citations

    bookSenior author

Recent grants

Frequent coauthors

  • Vittorio Murino

    1174 shared
  • Giovanni Maria Farinella

    University of Catania

    1161 shared
  • Sérgio Escalera

    Computer Vision Center

    1160 shared
  • Cosimo Distante

    National Research Council

    1158 shared
  • Emanuele Frontoni

    University of Macerata

    1158 shared
  • Marcos Ortega

    1156 shared
  • Pierluigi Carcagnì

    1156 shared
  • Fausto Milletarì

    1156 shared

Labs

Education

  • Ph.D.

    MIT

Awards & honors

  • Fellow of the IEEE
  • Fellow of the IAPR
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Stan Sclaroff

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup