
Stan Sclaroff
· Professor & Dean of the College of Arts & SciencesBoston University · Computer Science
Active 1990–2025
About
Stan Sclaroff is a Professor and Dean of the College of Arts and Sciences at Boston University. He joined the BU Department of Computer Science in 1995 after completing his PhD at MIT. He has served as the Chair of the Department from 2007 to 2013, Associate Dean of the Faculty for Mathematical & Computational Sciences from 2015 to 2018, and Dean ad interim for the College of Arts and Sciences from August 2018 to May 2019. On May 17, 2019, he was appointed as Dean for the College of Arts and Sciences. His research interests encompass computer vision, pattern recognition, and machine learning. He is an expert in tracking, video-based analysis of human motion and gesture, deformable shape matching and recognition, as well as image and video database indexing, retrieval, and data mining methods. Sclaroff developed one of the first content-based image retrieval systems for the Internet, called ImageRover, years before Google Image Search appeared. His recent work focuses on human tracking algorithms, analysis and identification of hand motion related to sign language, and filtering methods for multimedia retrieval. He co-leads the Image and Video Computing research group and has been recognized as a Fellow of the IEEE and IAPR.
Research topics
- Artificial Intelligence
- Computer Science
- Natural Language Processing
- Computer vision
- Machine Learning
- Algorithm
- Mathematics
Selected publications
Computer Vision and Image Understanding · 2025-12-17
articlePractical Disruption of Image Translation Deepfake Networks
Proceedings of the AAAI Conference on Artificial Intelligence · 2023-06-26 · 9 citations
articleOpen accessSenior authorBy harnessing the latest advances in deep learning, image-to-image translation architectures have recently achieved impressive capabilities. Unfortunately, the growing representational power of these architectures has prominent unethical uses. Among these, the threats of (1) face manipulation ("DeepFakes") used for misinformation or pornographic use (2) "DeepNude" manipulations of body images to remove clothes from individuals, etc. Several works tackle the task of disrupting such image translation networks by inserting imperceptible adversarial attacks into the input image. Nevertheless, these works have limitations that may result in disruptions that are not practical in the real world. Specifically, most works generate disruptions in a white-box scenario, assuming perfect knowledge about the image translation network. The few remaining works that assume a black-box scenario require a large number of queries to successfully disrupt the adversary's image translation network. In this work we propose Leaking Transferable Perturbations (LTP), an algorithm that significantly reduces the number of queries needed to disrupt an image translation network by dynamically re-purposing previous disruptions into new query efficient disruptions.
DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition With Limited Annotations
IEEE Transactions on Pattern Analysis and Machine Intelligence · 2023-12-25 · 25 citations
articleMulti-label image recognition in the low-label regime is a task of great challenge and practical significance. Previous works have focused on learning the alignment between textual and visual spaces to compensate for limited image labels, yet may suffer from reduced accuracy due to the scarcity of high-quality multi-label annotations. In this research, we leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs. We introduce an efficient and effective framework called <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Evidence-guided Dual Context Optimization</i> ( <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DualCoOp++</monospace> ), which serves as a unified approach for addressing partial-label and zero-shot multi-label recognition. In <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DualCoOp++</monospace> we separately encode evidential, positive, and negative contexts for target classes as parametric components of the linguistic input (i.e., prompts). The evidential context aims to discover all the related visual content for the target class, and serves as guidance to aggregate positive and negative contexts from the spatial domain of the image, enabling better distinguishment between similar categories. Additionally, we introduce a Winner-Take-All module that promotes inter-class interaction during training, while avoiding the need for extra parameters and costs. As <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DualCoOp++</monospace> imposes minimal additional learnable overhead on the pretrained vision-language framework, it enables rapid adaptation to multi-label recognition tasks with limited annotations and even unseen classes. Experiments on standard multi-label recognition benchmarks across two challenging low-label settings demonstrate the superior performance of our approach compared to state-of-the-art methods.
Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing
arXiv (Cornell University) · 2023-06-30
preprintOpen accessVision transformers (ViTs) have significantly changed the computer vision landscape and have periodically exhibited superior performance in vision tasks compared to convolutional neural networks (CNNs). Although the jury is still out on which model type is superior, each has unique inductive biases that shape their learning and generalization performance. For example, ViTs have interesting properties with respect to early layer non-local feature dependence, as well as self-attention mechanisms which enhance learning flexibility, enabling them to ignore out-of-context image information more effectively. We hypothesize that this power to ignore out-of-context information (which we name $\textit{patch selectivity}$), while integrating in-context information in a non-local manner in early layers, allows ViTs to more easily handle occlusion. In this study, our aim is to see whether we can have CNNs $\textit{simulate}$ this ability of patch selectivity by effectively hardwiring this inductive bias using Patch Mixing data augmentation, which consists of inserting patches from another image onto a training image and interpolating labels between the two image classes. Specifically, we use Patch Mixing to train state-of-the-art ViTs and CNNs, assessing its impact on their ability to ignore out-of-context patches and handle natural occlusions. We find that ViTs do not improve nor degrade when trained using Patch Mixing, but CNNs acquire new capabilities to ignore out-of-context information and improve on occlusion benchmarks, leaving us to conclude that this training method is a way of simulating in CNNs the abilities that ViTs already possess. We will release our Patch Mixing implementation and proposed datasets for public use. Project page: https://arielnlee.github.io/PatchMixing/
Video Frame Interpolation with Many-to-many Splatting and Spatial Selective Refinement
arXiv (Cornell University) · 2023-10-29
preprintOpen accessIn this work, we first propose a fully differentiable Many-to-Many (M2M) splatting framework to interpolate frames efficiently. Given a frame pair, we estimate multiple bidirectional flows to directly forward warp the pixels to the desired time step before fusing overlapping pixels. In doing so, each source pixel renders multiple target pixels and each target pixel can be synthesized from a larger area of visual context, establishing a many-to-many splatting scheme with robustness to undesirable artifacts. For each input frame pair, M2M has a minuscule computational overhead when interpolating an arbitrary number of in-between frames, hence achieving fast multi-frame interpolation. However, directly warping and fusing pixels in the intensity domain is sensitive to the quality of motion estimation and may suffer from less effective representation capacity. To improve interpolation accuracy, we further extend an M2M++ framework by introducing a flexible Spatial Selective Refinement (SSR) component, which allows for trading computational efficiency for interpolation quality and vice versa. Instead of refining the entire interpolated frame, SSR only processes difficult regions selected under the guidance of an estimated error map, thereby avoiding redundant computation. Evaluation on multiple benchmark datasets shows that our method is able to improve the efficiency while maintaining competitive video interpolation quality, and it can be adjusted to use more or less compute as needed.
Video Frame Interpolation With Many-to-Many Splatting and Spatial Selective Refinement
IEEE Transactions on Pattern Analysis and Machine Intelligence · 2023-10-24 · 8 citations
articleIn this work, we first propose a fully differentiable Many-to-Many (M2M) splatting framework to interpolate frames efficiently. Given a frame pair, we estimate multiple bidirectional flows to directly forward warp the pixels to the desired time step before fusing any overlapping pixels. In doing so, each source pixel renders multiple target pixels and each target pixel can be synthesized from a larger area of visual context, establishing a many-to-many splatting scheme with robustness to undesirable artifacts. For each input frame pair, M2M has a minuscule computational overhead when interpolating an arbitrary number of in-between frames, hence achieving fast multi-frame interpolation. However, directly warping and fusing pixels in the intensity domain is sensitive to the quality of motion estimation and may suffer from less effective representation capacity. To improve interpolation accuracy, we further extend an M2M++ framework by introducing a flexible Spatial Selective Refinement (SSR) component, which allows for trading computational efficiency for interpolation quality and vice versa. Instead of refining the entire interpolated frame, SSR only processes difficult regions selected under the guidance of an estimated error map, thereby avoiding redundant computation. Evaluation on multiple benchmark datasets shows that our method is able to improve the efficiency while maintaining competitive video interpolation quality, and it can be adjusted to use more or less compute as needed.
DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations
arXiv (Cornell University) · 2023-08-03
preprintOpen accessMulti-label image recognition in the low-label regime is a task of great challenge and practical significance. Previous works have focused on learning the alignment between textual and visual spaces to compensate for limited image labels, yet may suffer from reduced accuracy due to the scarcity of high-quality multi-label annotations. In this research, we leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs. We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++), which serves as a unified approach for addressing partial-label and zero-shot multi-label recognition. In DualCoOp++ we separately encode evidential, positive, and negative contexts for target classes as parametric components of the linguistic input (i.e., prompts). The evidential context aims to discover all the related visual content for the target class, and serves as guidance to aggregate positive and negative contexts from the spatial domain of the image, enabling better distinguishment between similar categories. Additionally, we introduce a Winner-Take-All module that promotes inter-class interaction during training, while avoiding the need for extra parameters and costs. As DualCoOp++ imposes minimal additional learnable overhead on the pretrained vision-language framework, it enables rapid adaptation to multi-label recognition tasks with limited annotations and even unseen classes. Experiments on standard multi-label recognition benchmarks across two challenging low-label settings demonstrate the superior performance of our approach compared to state-of-the-art methods.
Many-to-many Splatting for Efficient Video Frame Interpolation
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) · 2022-06-01 · 64 citations
articleMotion-based video frame interpolation commonly relies on optical flow to warp pixels from the inputs to the desired interpolation instant. Yet due to the inherent challenges of motion estimation (e.g. occlusions and discontinuities), most state-of-the-art interpolation approaches require subsequent refinement of the warped result to generate satisfying outputs, which drastically decreases the efficiency for multi-frame interpolation. In this work, we propose a fully differentiable Many-to-Many (M2M) splatting framework to interpolate frames efficiently. Specifically, given a frame pair, we estimate multiple bidirectional flows to directly forward warp the pixels to the desired time step, and then fuse any overlapping pixels. In doing so, each source pixel renders multiple target pixels and each target pixel can be synthesized from a larger area of visual context. This establishes a many-to-many splatting scheme with robustness to artifacts like holes. Moreover, for each input frame pair, M2M only performs motion estimation once and has a minuscule computational overhead when interpolating an arbitrary number of in-between frames, hence achieving fast multi-frame interpolation. We conducted extensive experiments to analyze M2M, and found that it significantly improves the efficiency while maintaining high effectiveness.
Leveraging Geometric Structure for Label-Efficient Semi-Supervised Scene Segmentation
IEEE Transactions on Image Processing · 2022-01-01 · 6 citations
articleLabel-efficient scene segmentation aims to achieve effective per-pixel classification with reduced labeling effort. Recent approaches for this task focus on leveraging unlabelled images by formulating consistency regularization or pseudo labels for individual pixels. Yet most of these methods ignore the 3D geometric structures naturally conveyed by image scenes, which is free for enhancing training segmentation models with better discrimination of image details. In this work, we present a novel Geometric Structure Refinement (GSR) framework to explicitly exploit the geometric structures of image scenes to enhance the semi-supervised training of segmentation models. In the training phase, we generate initial dense pseudo labels based on fast and coarse annotations, and then utilize the free unsupervised 3D reconstruction of the image scene to calibrate the dense pseudo labels with more reliable details. With the calibrated pseudo groundtruth, we are able to conveniently train any existing image segmentation models without increasing the costs of annotations or modifying the models' architectures. Moreover, we explore different strategies for allocating labeling effort in semi-supervised scene segmentation, and find that a combination of finely-labeled samples and coarsely-labeled samples performs better than the traditional dense-fine only annotations. Extensive experiments on datasets including Cityscapes and KITTI are conducted to evaluate our proposed methods. The results demonstrate that GSR can be easily applied to boost the performance of existing models like PSPNet, DeepLabv3+, etc with reduced annotations. With half of the annotation effort, GSR achieves 99% of the accuracy of its fully supervised state-of-the-art counterparts.
Image Analysis and Processing. ICIAP 2022 Workshops
Lecture notes in computer science · 2022-01-01 · 11 citations
bookSenior author
Recent grants
Mining and Indexing Spatio-Temporal Patterns in Video Databases of Human Motion
NSF · $405k · 2003–2007
NSF · $404k · 2007–2011
Estimating and Recognizing 3D Articulated Motion via Uncalibrated Cameras
NSF · $403k · 2002–2006
NSF · $750k · 2010–2017
II-EN: Infrastructure for Gesture Interface Research Outside the Lab
NSF · $591k · 2009–2013
Frequent coauthors
- 1174 shared
Vittorio Murino
- 1161 shared
Giovanni Maria Farinella
University of Catania
- 1160 shared
Sérgio Escalera
Computer Vision Center
- 1158 shared
Cosimo Distante
National Research Council
- 1158 shared
Emanuele Frontoni
University of Macerata
- 1156 shared
Marcos Ortega
- 1156 shared
Pierluigi Carcagnì
- 1156 shared
Fausto Milletarì
Labs
Education
Ph.D.
MIT
Awards & honors
- Fellow of the IEEE
- Fellow of the IAPR
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Stan Sclaroff
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup