
Andrew Owens
VerifiedUniversity of Michigan · Computer Science
Active 2008–2025
About
Andrew Owens is an associate professor of computer science at Cornell Tech and the Cornell Ann S. Bowers College of Computing and Information Science. His research aims to create multimodal systems that learn to see, hear, and touch without human-labeled training data. Instead, these systems learn from co-occurring sensory signals, such as the correlations between the visual and audio streams of a video. His work has enabled applications that include producing soundtracks for silent videos, robotic manipulation with vision and touch, detecting AI-generated images, and generating visual illusions. Owens is a recipient of a Sloan Research Fellowship and an NSF CAREER Award. Prior to joining Cornell, he was an assistant professor at the University of Michigan and a postdoctoral scholar at the University of California, Berkeley. He received a Ph.D. in computer science from the Massachusetts Institute of Technology in 2016 and a B.A. in computer science from Cornell University in 2010.
Research topics
- Computer Science
- Artificial Intelligence
- Computer vision
Selected publications
SSRN Electronic Journal · 2025-01-01
preprintOpen accessGPS as a Control Signal for Image Generation
2025-06-10 · 2 citations
articleSenior authorWe show that the GPS tags contained in photo metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images vary within a city. In particular, we train a diffusion model to generate images conditioned on both GPS and text. The learned model generates images that capture the distinctive appearance of different neighborhoods, parks, and landmarks. We also extract 3D models from 2D GPS-to-image models through score distillation sampling, using GPS conditioning to constrain the appearance of the reconstruction from each viewpoint. Our evaluations suggest that our GPS-conditioned models successfully learn to generate images that vary based on location, and that GPS conditioning improves estimated 3D structure.
Contrastive Touch-to-Touch Pretraining
2025-05-19 · 1 citations
articleToday's tactile sensors have a variety of different designs, making it challenging to develop general-purpose methods for processing touch signals. In this paper, we learn a unified representation that captures the shared information between different tactile sensors. Unlike current approaches that focus on reconstruction or task-specific supervision, we leverage contrastive learning to integrate tactile signals from two different sensors into a shared embedding space, using a dataset in which the same objects are probed with multiple sensors. We apply this approach to paired touch signals from GelSlim and Soft Bubble sensors. We show that our learned features provide strong pretraining for downstream pose estimation and classification tasks. We also show that our embedding enables models trained using one touch sensor to be deployed using another without additional training. Project details can be found at https://www.mmintlab.com/research/cttp/.
Community Forensics: Using Thousands of Generators to Train Fake Image Detectors
2025-06-10 · 3 citations
articleSenior authorOne of the key challenges of detecting AI-generated images is spotting images that have been created by previously unseen generative models. We argue that the limited diversity of the training data is a major obstacle to addressing this problem, and we propose a new dataset that is significantly larger and more diverse than prior works. As part of creating this dataset, we systematically download thousands of text-to-image latent diffusion models and sample images from them. We also collect images from dozens of popular open source and commercial models. The resulting dataset contains 2.7M images that have been sampled from 4803 different models. These images collectively capture a wide range of scene content, generator architectures, and image processing settings. Using this dataset, we study the generalization abilities of fake image detectors. Our experiments suggest that detection performance improves as the number of models in the training set increases, even when these models have similar architectures. We also find that increasing the diversity of the models improves detection performance, and that our trained detectors generalize better than those trained on other datasets. The dataset can be found in https://jespark.net/projects/2024/community_forensics
Supervising Sound Localization by In-the-wild Egomotion
2025-06-10
articleSenior authorWe present a method for learning binaural sound localization using egomotion as a supervisory signal. Over the course of a video, the camera’s direction to a sound source will change as the camera moves. We train an audio model to predict sound directions that are consistent with visual estimates of camera motion, which we obtain using traditional methods from multi-view geometry. This provides a weak but plentiful form of supervision that we combine with traditional binaural cues. To evaluate this method, we propose a dataset of real-world audio-visual videos with egomotion. We show that our model can successfully learn from real-world data and that it performs well on sound localization tasks.
Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models
2024-06-16 · 13 citations
articleSenior authorWe address the problem of synthesizing multi-view optical illusions: images that change appearance upon a transformation, such as a flip or rotation. We propose a simple, zero-shot method for obtaining these illusions from off-the-shelf text-to-image diffusion models. During the reverse diffusion process, we estimate the noise from different views of a noisy image, and then combine these noise estimates together and denoise the image. A theoretical analysis suggests that this method works precisely for views that can be written as orthogonal transformations, of which permutations are a subset. This leads to the idea of a visual anagram-an image that changes appearance under some rearrangement of pixels. This includes rotations and flips, but also more exotic pixel permutations such as a jigsaw rearrangement. Our approach also naturally extends to illusions with more than two views. We provide both qualitative and quantitative results demonstrating the effectiveness and flexibility of our method. Please see our project webpage for additional visualizations and results: https://dangeng.github.io/visual_anagrams/.
Efficient Vision-Language Pre-Training by Cluster Masking
2024-06-16 · 4 citations
articleSenior authorWe propose a simple strategy for masking image patches during visual-language contrastive learning that improves the quality of the learned representations and the training speed. During each iteration of training, we randomly mask clusters of visually similar image patches, as measured by their raw pixel intensities. This provides an extra learning signal, beyond the contrastive training itself, since it forces a model to predict words for masked visual structures solely from context. It also speeds up training by reducing the amount of data used in each image. We evaluate the effectiveness of our model by pre-training on a number of bench-marks, finding that it outperforms other masking strategies, such as FLIP, on the quality of the learned representation.
Math Horizons · 2024-08-29
articleSenior authorCommunity Forensics: Using Thousands of Generators to Train Fake Image Detectors
arXiv (Cornell University) · 2024-11-06
preprintOpen accessSenior authorOne of the key challenges of detecting AI-generated images is spotting images that have been created by previously unseen generative models. We argue that the limited diversity of the training data is a major obstacle to addressing this problem, and we propose a new dataset that is significantly larger and more diverse than prior work. As part of creating this dataset, we systematically download thousands of text-to-image latent diffusion models and sample images from them. We also collect images from dozens of popular open source and commercial models. The resulting dataset contains 2.7M images that have been sampled from 4803 different models. These images collectively capture a wide range of scene content, generator architectures, and image processing settings. Using this dataset, we study the generalization abilities of fake image detectors. Our experiments suggest that detection performance improves as the number of models in the training set increases, even when these models have similar architectures. We also find that detection performance improves as the diversity of the models increases, and that our trained detectors generalize better than those trained on other datasets. The dataset can be found in https://jespark.net/projects/2024/community_forensics
Contrastive Touch-to-Touch Pretraining
arXiv (Cornell University) · 2024-10-15
preprintOpen accessToday's tactile sensors have a variety of different designs, making it challenging to develop general-purpose methods for processing touch signals. In this paper, we learn a unified representation that captures the shared information between different tactile sensors. Unlike current approaches that focus on reconstruction or task-specific supervision, we leverage contrastive learning to integrate tactile signals from two different sensors into a shared embedding space, using a dataset in which the same objects are probed with multiple sensors. We apply this approach to paired touch signals from GelSlim and Soft Bubble sensors. We show that our learned features provide strong pretraining for downstream pose estimation and classification tasks. We also show that our embedding enables models trained using one touch sensor to be deployed using another without additional training. Project details can be found at https://www.mmintlab.com/research/cttp/.
Frequent coauthors
- 34 shared
Alexei A. Efros
- 13 shared
Sheng-Yu Wang
Carnegie Mellon University
- 11 shared
William T. Freeman
- 10 shared
Shiry Ginosar
- 8 shared
Przemysław Prusinkiewicz
University of Calgary
- 8 shared
Mikolaj Cieslak
University of Calgary
- 8 shared
Oliver Wang
Adobe Systems (United States)
- 8 shared
Richard Zhang
Awards & honors
- Sloan Research Fellowship
- NSF CAREER Award
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Andrew Owens
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup