Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Kate Saenko

Kate Saenko

· ProfessorVerified

Boston University · Computer Science

Active 2004–2025

h-index89
Citations52.8k
Papers445209 last 5y
Funding$1.1M
See your match with Kate Saenko — sign in to PhdFit.Sign in

About

Professor Kate Saenko is a faculty member at the Department of Computer Science at Boston University, where she serves as a Professor and the director of the Computer Vision and Learning Group. She is also a member of the IVC Group. She received her PhD from MIT and has held various academic and research positions, including Assistant Professor at UMass Lowell, Postdoctoral Researcher at the International Computer Science Institute, Visiting Scholar at UC Berkeley EECS, and Visiting Postdoctoral Fellow in the School of Engineering and Applied Science at Harvard University. Her research interests encompass the broad area of Artificial Intelligence, with a focus on Adaptive Machine Learning, Learning for Vision and Language Understanding, and Deep Learning.

Research topics

  • Computer Science
  • Artificial Intelligence
  • Machine Learning
  • Natural Language Processing
  • Mathematics
  • Psychology
  • Cognitive psychology
  • Theoretical computer science
  • Computer vision

Selected publications

  • SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models

    2025-06-10

    article

    Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Existing approaches require prompt tuning or architectural adaptations, limiting zero-shot applicability. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. We make two contributions. First, we find that VLM scores suffer from image- and prompt-specific biases, and that simple standardization is surprisingly effective at removing these and boosting MLR performance. And second, we introduce compound prompts grounded in realistic object combinations. Our analysis reveals "AND"/"OR" signal ambiguities that cause maximum compound scores to be surprisingly suboptimal compared to second-highest scores. We introduce an adaptive fusion method to address this issue. Our method enhances other zero-shot approaches, consistently improving their results. Experiments show superior mean Average Precision (mAP) compared to methods requiring training data, achieved through refined object ranking for robust zero-shot MLR. Code can be found at https://github.com/kjmillerCURIS/SPARC.

  • SAM 3: Segment Anything with Concepts

    arXiv (Cornell University) · 2025-11-20 · 3 citations

    preprintOpen access

    We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.

  • BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

    arXiv (Cornell University) · 2025-12-11

    preprintOpen access

    Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.

  • ERM++: An Improved Baseline for Domain Generalization

    2025-02-26 · 8 citations

    article

    Domain Generalization (DG) aims to develop classifiers that can generalize to new, unseen data distributions, a critical capability when collecting new domain-specific data is impractical. A common DG baseline minimizes the empirical risk on the source domains. Recent studies have shown that this approach, known as Empirical Risk Minimization (ERM), can outperform most more complex DG methods when properly tuned. However, these studies have primarily focused on a narrow set of hyperparameters, neglecting other factors that can enhance robustness and prevent overfitting and catastrophic forgetting, properties which are critical for strong DG performance. In our investigation of training data utilization (i.e., duration and setting validation splits), initialization, and additional regularizers, we find that tuning these previously overlooked factors significantly improves model generalization across diverse datasets without adding much complexity. We call this improved, yet simple baseline ERM++. Despite its ease of implementation, ERM++ improves DG performance by over 5% compared to prior ERM baselines on a standard benchmark of 5 datasets with a ResNet-50 and over 15% with a ViT-B/16. It also outperforms all state-of-the-art methods on DomainBed datasets with both architectures. Importantly, ERM++ is easy to integrate into existing frameworks like DomainBed, making it a practical and powerful tool for researchers and practitioners. Overall, ERM++ challenges the need for more complex DG methods by providing a stronger, more reliable baseline that maintains simplicity and ease of use. Code is available at https://github.com/piotr-teterwak/erm_plusplus

  • SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models

    ArXiv.org · 2025-02-24

    preprintOpen access

    Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Existing approaches require prompt tuning or architectural adaptations, limiting zero-shot applicability. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Using large language model insights on object co-occurrence, we introduce compound prompts grounded in realistic object combinations. Analysis of these prompt scores reveals VLM biases and ``AND''/``OR'' signal ambiguities, notably that maximum compound scores are surprisingly suboptimal compared to second-highest scores. We address these through a debiasing and score-fusion algorithm that corrects image bias and clarifies VLM response behaviors. Our method enhances other zero-shot approaches, consistently improving their results. Experiments show superior mean Average Precision (mAP) compared to methods requiring training data, achieved through refined object ranking for robust zero-shot MLR.

  • Scaling Up Temporal Domain Generalization via Temporal Experts Averaging

    2025-01-01

    articleOpen access

    Temporal Domain Generalization (TDG) aims to generalize across temporal distribution shifts, e.g., lexical change over time.Prior work often addresses this by predicting future model weights.However, full model prediction is prohibitively expensive for even reasonably sized models.Thus, recent methods only predict the classifier layer, limiting generalization by failing to adjust other model components.To address this, we propose Temporal Experts Averaging (TEA), a novel and scalable TDG framework that updates the entire model using weight averaging to maximize generalization potential while minimizing computational costs.Our theoretical analysis guides us to two steps that enhance generalization to future domains.First, we create expert models with functional diversity yet parameter similarity by fine-tuning a domain-agnostic base model on individual temporal domains while constraining weight changes.Second, we optimize the bias-variance tradeoff through adaptive averaging coefficients derived from modeling temporal weight trajectories in a principal component subspace.Expert's contributions are based on their projected proximity to future domains.Extensive experiments across 7 TDG benchmarks, 5 models, and 2 TDG settings shows TEA outperforms prior TDG methods by up to 69% while being up to 60x more efficient 1 .

  • Generalized prompt-driven zero-shot domain adaptive segmentation with feature rectification and semantic modulation

    Computer Vision and Image Understanding · 2025-12-17

    article
  • OP-LoRA: The Blessing of Dimensionality

    arXiv (Cornell University) · 2024-12-13

    preprintOpen access

    Low-rank adapters enable fine-tuning of large models with only a small number\nof parameters, thus reducing storage costs and minimizing the risk of\ncatastrophic forgetting. However, they often pose optimization challenges, with\npoor convergence. To overcome these challenges, we introduce an\nover-parameterized approach that accelerates training without increasing\ninference costs. This method reparameterizes low-rank adaptation by employing a\nseparate MLP and learned embedding for each layer. The learned embedding is\ninput to the MLP, which generates the adapter parameters. Such\noverparamaterization has been shown to implicitly function as an adaptive\nlearning rate and momentum, accelerating optimization. At inference time, the\nMLP can be discarded, leaving behind a standard low-rank adapter. To study the\neffect of MLP overparameterization on a small yet difficult proxy task, we\nimplement it for matrix factorization, and find it achieves faster convergence\nand lower final loss. Extending this approach to larger-scale tasks, we observe\nconsistent performance gains across domains. We achieve improvements in\nvision-language tasks and especially notable increases in image generation,\nwith CMMD scores improving by up to 15 points.\n

  • Koala: Key Frame-Conditioned Long Video-LLM

    2024-06-16 · 16 citations

    articleSenior author

    Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to understand minutes-long videos and accurately answer questions about them. To address this limitation, we propose a lightweight and self-supervised approach, Key frame-conditioned long video-LLM (Koala), that introduces learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to longer videos. Our approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. We train our proposed approach on HowTo100M and demonstrate its effectiveness on zero-shot long video understanding benchmarks, where it outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.

  • Learning to Compose SuperWeights for Neural Parameter Allocation Search

    2024-01-03 · 3 citations

    article

    Neural parameter allocation search (NPAS) automates parameter sharing by obtaining weights for a network given an arbitrary, fixed parameter budget. Prior work has two major drawbacks we aim to address. First, there is a disconnect in the sharing pattern between the search and training steps, where weights are warped for layers of different sizes during the search to measure similarity, but not during training, resulting in reduced performance. To address this, we generate layer weights by learning to compose sets of SuperWeights, which represent a group of trainable parameters. These SuperWeights are created to be large enough so they can be used to represent any layer in the network, but small enough that they are computationally efficient. The second drawback we address is the method of measuring similarity between shared parameters. Whereas prior work compared the weights themselves, we argue this does not take into account the amount of conflict between the shared weights. Instead, we use gradient information to identify layers with shared weights that wish to diverge from each other. We demonstrate that our SuperWeight Networks consistently boost performance over the state-of-the-art on the ImageNet and CIFAR datasets in the NPAS setting. We further show that our approach can generate parameters for many network architectures using the same set of weights. This enables us to support tasks like efficient ensembling and anytime prediction, outperforming fully-parameterized ensembles with 17% fewer parameters <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> .

Recent grants

Frequent coauthors

Labs

Education

  • Ph.D.

    MIT

  • Other

    UMass Lowell

  • Other

    International Computer Science Institute

  • Other

    UC Berkeley EECS

  • Other

    School of Engineering and Applied Science at Harvard University

  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Kate Saenko

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup