
Roni Sengupta
· 3D Vision and Computational PhotographyUniversity of North Carolina at Chapel Hill · Computer Science
Active 2023–2026
About
Roni Sengupta is an Assistant Professor in the Department of Computer Science at UNC Chapel Hill and leads the SPIN Lab at UNC. His research lies at the intersection of Computer Vision and Computer Graphics, with a focus on 3D Vision and Computational Photography. His lab develops AI techniques for understanding spatial and physical properties from images and videos, including geometry, motion, material reflectance, material deformation properties, and lighting, through inverse physics. These methods are applied to advancing immersive media such as AR/VR, telepresence, and content creation, as well as healthcare, robotics, and physical sciences. Sengupta's work involves explicit estimation of scene properties via inverse problems leveraging foundation AI models, and implicit manipulation of these properties using generative AI, with a goal to improve both fundamental methods and practical applications across various domains.
Research topics
- Computer Science
- Artificial Intelligence
- Computer Security
- Computer vision
- Multimedia
- Computer graphics (images)
- Human–computer interaction
Selected publications
Understanding Model Behavior in Monocular Polyp Sizing
arXiv (Cornell University) · 2026-05-19
preprintOpen accessSenior authorAccurate polyp size stratification guides surveillance decisions, with lesions larger than 5 mm typically requiring closer follow-up. However, monocular colonoscopy lacks a reliable metric reference. We present a diagnostic audit of binary polyp size classification (<=5 mm vs. >5 mm) across multiple public multi-center datasets, model families, and patient-stratified cross-validation. Across architectures and input modalities, including RGB appearance, relative depth, and photometry, model performance is moderately consistent, suggesting reliance on cues correlated with examination behavior rather than true metric scales. By providing ground-truth scale at varying granularities, we quantify the potential improvement from perfect scale information and show that current depth estimation and global calibration offer limited gains. We further demonstrate that segmentation errors under distribution shift eliminate most of this potential, with oracle scale under predicted masks recovering only baseline performance. These results highlight metric scale and mask robustness as two independent bottlenecks and provide reusable evaluation tools such as oracle scale ladders, shortcut partitions, and mask substitution for auditing future polyp sizing pipelines. Our code is publicly accessible at https://github.com/anaxqx/polyp-sizing-audit.
Stop Holding Your Breath: CT-Informed Gaussian Splatting for Dynamic Bronchoscopy
arXiv (Cornell University) · 2026-04-30
preprintOpen accessSenior authorBronchoscopic navigation relies on registering endoscopic video to a preoperative CT scan, but respiratory motion deforms the airway by 5-20 mm, creating CT-to-body divergence that limits localization accuracy. In practice, this is mitigated through breath-hold protocols, which attempt to match the intraoperative anatomy to a static CT, but are difficult to reproduce and disrupt clinical workflow. We propose to eliminate the need for breath-hold protocols by leveraging patient-specific respiratory modeling. Paired inhale-exhale CT scans, already acquired for planning, implicitly define the patient-specific deformation space of the breathing airway. By registering these scans, we reduce respiratory motion to a single scalar breathing phase per frame, constraining all reconstructions to anatomically observed configurations. We embed this representation within a mesh-anchored Gaussian splatting framework, where a lightweight estimator infers breathing phase directly from endoscopic RGB, enabling continuous, deformation-aware reconstruction throughout the respiratory cycle without breath-holds or external sensing. To enable quantitative evaluation, we introduce RESPIRE, a physically grounded bronchoscopy simulation pipeline with per-frame ground truth for geometry, pose, breathing phase, and deformation. Experiments on RESPIRE show that our approach achieves geometrically faithful reconstruction, over 20x faster training, and 1.22 mm target localization accuracy (within the 3mm clinically relevant tolerances) outperforming unconstrained single-CT baselines. Please check out our website for additional visuals: https://asdunnbe.github.io/RESPIRE/
Stop Holding Your Breath: CT-Informed Gaussian Splatting for Dynamic Bronchoscopy
ArXiv.org · 2026-04-30
articleOpen accessSenior authorBronchoscopic navigation relies on registering endoscopic video to a preoperative CT scan, but respiratory motion deforms the airway by 5-20 mm, creating CT-to-body divergence that limits localization accuracy. In practice, this is mitigated through breath-hold protocols, which attempt to match the intraoperative anatomy to a static CT, but are difficult to reproduce and disrupt clinical workflow. We propose to eliminate the need for breath-hold protocols by leveraging patient-specific respiratory modeling. Paired inhale-exhale CT scans, already acquired for planning, implicitly define the patient-specific deformation space of the breathing airway. By registering these scans, we reduce respiratory motion to a single scalar breathing phase per frame, constraining all reconstructions to anatomically observed configurations. We embed this representation within a mesh-anchored Gaussian splatting framework, where a lightweight estimator infers breathing phase directly from endoscopic RGB, enabling continuous, deformation-aware reconstruction throughout the respiratory cycle without breath-holds or external sensing. To enable quantitative evaluation, we introduce RESPIRE, a physically grounded bronchoscopy simulation pipeline with per-frame ground truth for geometry, pose, breathing phase, and deformation. Experiments on RESPIRE show that our approach achieves geometrically faithful reconstruction, over 20x faster training, and 1.22 mm target localization accuracy (within the 3mm clinically relevant tolerances) outperforming unconstrained single-CT baselines. Please check out our website for additional visuals: https://asdunnbe.github.io/RESPIRE/
Understanding Model Behavior in Monocular Polyp Sizing
ArXiv.org · 2026-05-19
articleOpen accessSenior authorAccurate polyp size stratification guides surveillance decisions, with lesions larger than 5 mm typically requiring closer follow-up. However, monocular colonoscopy lacks a reliable metric reference. We present a diagnostic audit of binary polyp size classification (<=5 mm vs. >5 mm) across multiple public multi-center datasets, model families, and patient-stratified cross-validation. Across architectures and input modalities, including RGB appearance, relative depth, and photometry, model performance is moderately consistent, suggesting reliance on cues correlated with examination behavior rather than true metric scales. By providing ground-truth scale at varying granularities, we quantify the potential improvement from perfect scale information and show that current depth estimation and global calibration offer limited gains. We further demonstrate that segmentation errors under distribution shift eliminate most of this potential, with oracle scale under predicted masks recovering only baseline performance. These results highlight metric scale and mask robustness as two independent bottlenecks and provide reusable evaluation tools such as oracle scale ladders, shortcut partitions, and mask substitution for auditing future polyp sizing pipelines. Our code is publicly accessible at https://github.com/anaxqx/polyp-sizing-audit.
TalkingHeadBench: A Multi-Modal Benchmark & Analysis of Talking-Head DeepFake Detection
2026-03-06
articleSenior authorThe rapid advancement of talking-head deepfake generation fueled by advanced generative models has elevated the realism of synthetic videos to a level that poses substantial risks in domains such as media, politics, and finance. However, current benchmarks for deepfake talking-head detection fail to reflect this progress, relying on outdated generators and offering limited insight into model robustness and generalization. We introduce TalkingHeadBench, a new benchmark designed to address this gap, featuring talking-head videos from six modern generators, with an additional two emerging generators used exclusively for testing generalization. The dataset is built on an expert-led curation process that filters over 60% of samples to remove videos with noticeable artifacts, presenting a more difficult challenge for detectors. Our evaluation protocols are designed to measure generalization across identity and generator shifts. Benchmarking seven state-of-the-art detectors reveals that models with high accuracy on older datasets like FaceForensics++ show a significant performance drop on our curated data, particularly at strict false positive rates (e.g., TPR@FPR=0.1%). In addition, we identify a trend where detectors focus on background cues instead of facial features using Grad-CAM visualization. Our benchmark aims to accelerate research towards more robust and generalizable detection models in the face of rapidly evolving generative techniques. We release our benchmark and dataset with all data splits and protocols at https://anaxqx.github.io/talkingheadbench.github.io.
MyTimeMachine: Personalized Facial Age Transformation
UNC Libraries · 2025-08-07
articleOpen accessSenior authorFacial aging is a complex process, highly dependent on multiple factors like gender, ethnicity, lifestyle, etc., making it extremely challenging to learn a global aging prior to predict aging for any individual accurately. Existing techniques often produce realistic and plausible aging results, but the re-aged images often do not resemble the person's appearance at the target age and thus need personalization. In many practical applications of virtual aging, e.g. VFX in movies and TV shows, access to a personal photo collection of the user depicting aging in a small time interval (20~40 years) is often available. However, naive attempts to personalize global aging techniques on personal photo collections often fail. Thus, we propose MyTimeMachine (MyTM), a method that combines a global aging prior with a personalized photo collection (ranging from as few as 10 images, ideally 50) to learn individualized age transformations. We introduce a novel Adapter Network that combines personalized aging features with global aging features and generates a re-aged image with StyleGAN2. We also introduce three loss functions to personalize the Adapter Network with personalized aging loss, extrapolation regularization, and adaptive w-norm regularization. Our method demonstrates strong performance on fair-use imagery of widely recognizable individuals, producing photorealistic and identity-consistent age transformations that generalize well across diverse appearances. It also extends naturally to video, delivering high-quality, temporally consistent results that closely resemble actual appearances at target ages—outperforming state-of-the-art approaches.
The Aging Multiverse: Generating Condition-Aware Facial Aging Tree via Training-Free Diffusion
ArXiv.org · 2025-06-26
preprintOpen accessSenior authorWe introduce the Aging Multiverse, a framework for generating multiple plausible facial aging trajectories from a single image, each conditioned on external factors such as environment, health, and lifestyle. Unlike prior methods that model aging as a single deterministic path, our approach creates an aging tree that visualizes diverse futures. To enable this, we propose a training-free diffusion-based method that balances identity preservation, age accuracy, and condition control. Our key contributions include attention mixing to modulate editing strength and a Simulated Aging Regularization strategy to stabilize edits. Extensive experiments and user studies demonstrate state-of-the-art performance across identity preservation, aging realism, and conditional alignment, outperforming existing editing and age-progression models, which often fail to account for one or more of the editing criteria. By transforming aging into a multi-dimensional, controllable, and interpretable process, our approach opens up new creative and practical avenues in digital storytelling, health education, and personalized visualization.
ScribbleLight: Single Image Indoor Relighting with Scribbles
2025-06-10 · 1 citations
articleSenior authorImage-based relighting of indoor rooms creates an immersive virtual understanding of the space, which is useful for interior design, virtual staging, and real estate. Relighting indoor rooms from a single image is especially challenging due to complex illumination interactions between multiple lights and cluttered objects featuring a large variety in geometrical and material complexity. Recently, generative models have been successfully applied to imagebased relighting conditioned on a target image or a latent code, albeit without detailed local lighting control. In this paper, we introduce ScribbleLight, a generative model that supports local fine-grained control of lighting effects through scribbles that describe changes in lighting. Our key technical novelty is an Albedo-conditioned Stable Image Diffusion model that preserves the intrinsic color and texture of the original image after relighting and an encoder-decoder-based ControlNet architecture that enables geometry-preserving lighting effects with normal map and scribble annotations. We demonstrate ScribbleLight’s ability to create different lighting effects (e.g., turning lights on/off, adding highlights, cast shadows, or indirect lighting from unseen lights) from sparse scribble annotations.
ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation
ArXiv.org · 2025-06-05
preprintOpen accessSenior authorNeural rendering has advanced significantly in 3D reconstruction and novel view synthesis, and integrating physics into these frameworks opens new applications such as physically accurate digital twins for robotics and XR. However, the inverse problem of estimating physical parameters from visual observations remains challenging. Existing physics-aware neural rendering methods typically require dense multi-view videos, making them impractical for scalable, real-world deployment. Under sparse-view settings, the sequential optimization strategies employed by current approaches suffer from severe error accumulation: inaccuracies in initial 3D reconstruction propagate to subsequent stages, degrading physical state and material parameter estimates. On the other hand, simultaneous optimization of all parameters fails due to the highly non-convex and often non-differentiable nature of the problem. We propose ProJo4D, a progressive joint optimization framework that gradually expands the set of jointly optimized parameters. This design enables physics-informed gradients to refine geometry while avoiding the instability of direct joint optimization over all parameters. Evaluations on synthetic and real-world datasets demonstrate that ProJo4D substantially outperforms prior work in 4D future state prediction and physical parameter estimation, achieving up to 10x improvement in geometric accuracy while maintaining computational efficiency. Please visit the project webpage: https://daniel03c1.github.io/ProJo4D/
PPS-Ctrl: Controllable Sim-to-Real Translation for Colonoscopy Depth Estimation
ArXiv.org · 2025-04-23
preprintOpen accessSenior authorAccurate depth estimation enhances endoscopy navigation and diagnostics, but obtaining ground-truth depth in clinical settings is challenging. Synthetic datasets are often used for training, yet the domain gap limits generalization to real data. We propose a novel image-to-image translation framework that preserves structure while generating realistic textures from clinical data. Our key innovation integrates Stable Diffusion with ControlNet, conditioned on a latent representation extracted from a Per-Pixel Shading (PPS) map. PPS captures surface lighting effects, providing a stronger structural constraint than depth maps. Experiments show our approach produces more realistic translations and improves depth estimation over GAN-based MI-CycleGAN. Our code is publicly accessible at https://github.com/anaxqx/PPS-Ctrl.
Frequent coauthors
- 6 shared
Marc Niethammer
- 4 shared
Jun Myeong Choi
- 4 shared
Ziheng Wang
- 4 shared
Shengze Wang
- 4 shared
Henry Fuchs
Columbia University
- 4 shared
Ryan Schmelzle
University of North Carolina at Chapel Hill
- 4 shared
Liujie Zheng
Huazhong University of Science and Technology
- 4 shared
Akshay Paruchuri
Awards & honors
- NIH NIBIB Trailblazer Award for New and Early Stage Investig…
- UNC Junior Faculty Development Award (2024)
- UNC CS Student Association Excellence in Teaching Award (202…
- CVPR Best Student Paper Honorable Mentions (2021)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Roni Sengupta
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup