Ugur Cetintemel
· Khosrowshahi University Professor of Computer ScienceBrown University · Computer Science
Active 1998–2025
Research topics
- Artificial Intelligence
- Computer Science
- Data Mining
- Algorithm
- Theoretical computer science
- Natural Language Processing
- Programming language
- Medicine
- Mathematics
- Parallel computing
- Database
- Cardiology
- Radiology
Selected publications
Continuous Prompts: LLM-Augmented Pipeline Processing over Unstructured Streams
ArXiv.org · 2025-12-03
preprintOpen accessSenior authorMonitoring unstructured streams increasingly requires persistent, semantics-aware computation, yet today's LLM frameworks remain stateless and one-shot, limiting their usefulness for long-running analytics. We introduce Continuous Prompts (CPs), the first framework that brings LLM reasoning into continuous stream processing. CPs extend RAG to streaming settings, define continuous semantic operators, and provide multiple implementations, primarily focusing on LLM-based approaches but also reporting one embedding-based variants. Furthermore, we study two LLM-centric optimizations, tuple batching and operator fusion, to significantly improve efficiency while managing accuracy loss. Because these optimizations inherently trade accuracy for speed, we present a dynamic optimization framework that uses lightweight shadow executions and cost-aware multi-objective Bayesian optimization (MOBO) to learn throughput-accuracy frontiers and adapt plans under probing budgets. We implement CPs in the VectraFlow stream processing system. Using operator-level microbenchmarks and streaming pipelines on real datasets, we show that VectraFlow can adapt to workload dynamics, navigate accuracy-efficiency trade-offs, and sustain persistent semantic queries over evolving unstructured streams.
Meta-Radiology · 2025-04-29 · 3 citations
articleOpen accessIn this study, we use large language models (LLMs) to integrate information from multi-source medical reports to enhance the accuracy of automated diagnostic classification and prognosis for brain tumors. Brain MRI reports from a cohort of 426 brain tumor patients were manually labeled for tumor presence and stability. Pathology reports from the same cohort were incorporated as an additional information source. A pre-trained LLM was used to extract features from the multi-source reports, and a Multi-layer perceptron (MLP) was trained for classification tasks. Model performance was evaluated on the test set using Micro F1 scores and AUROCs. The model’s zero-shot prognostic capability was validated on an independent cohort of 33 glioblastoma patients. Micro F1-score 0.849 (95%CI: 0.814, 0.880) for tumor presence classification and 0.929 (95%CI: 0.904, 0.954) for tumor stability classification are reached. Compared to using solely radiology reports, the developed model showed improvements on Micro F1 of 10.4 % for tumor presence and 5.6 % for stability classification. Log-rank tests confirmed significant distinction between the high- and low-risk patient groups stratified by model-predicted “Tumor Stability” label ( p -value = 0.017), confirming the prognostic value of the model-generated labels. This study developed a multi-source integration model based on LLMs for automated diagnostic classification and zero-shot prognosis of brain tumors. The integration of multi-source reports improved classification accuracy compared to single-source reports. Predicted tumor stability labels demonstrated survival prognostic capabilities. These findings confirm the potential of LLMs in brain tumor research, supporting precision diagnostics and prognosis. • The study integrates MRI and pathology data, highlighting the value of multi-source cancer information. • The study demonstrates LLMs' ability to bridge modalities in cancer diagnosis and prognosis applications. • The study confirms the prognostic value of automated diagnostic labels using survival correlation and log-rank tests.
Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems
Proceedings of the VLDB Endowment · 2025-07-01 · 2 citations
articleOpen accessSenior authorAI-augmented data processing systems (DPSs) integrate large language models (LLMs) into query pipelines, allowing powerful semantic operations on structured and unstructured data. However, the reliability (a.k.a. trust) of these systems is fundamentally challenged by the potential for LLMs to produce errors, limiting their adoption in critical domains. To help address this reliability bottleneck, we introduce semantic integrity constraints (SICs) —a declarative abstraction for specifying and enforcing correctness conditions over LLM outputs in semantic queries. SICs generalize traditional database integrity constraints to semantic settings, supporting common types of constraints, such as grounding, soundness, and exclusion, with both reactive and proactive enforcement strategies. We argue that SICs provide a foundation for building reliable and auditable AI-augmented data systems. Specifically, we present a system design for integrating SICs into query planning and runtime execution and discuss its realization in AI-augmented DPSs. To guide and evaluate our vision, we outline several design goals—covering criteria around expressiveness, runtime semantics, integration, performance, and enterprise-scale applicability—and discuss how our framework addresses each, along with open research challenges.
Leveraging Large Language Models to Detect Protected Heath Information: Does Context Matter?
SSRN Electronic Journal · 2024-01-01
preprintOpen accessJournal of Stroke and Cerebrovascular Diseases · 2022-09-14 · 10 citations
articleThe Case for In-Memory OLAP on "Wimpy" Nodes
2021-04-01 · 4 citations
articleSenior authorResearch projects will often use the latest hardware to achieve orders-of-magnitude performance improvements while ignoring the (usually hefty) associated price tag. Real-world deployments typically follow suit, requiring expensive computing infrastructures that cost even more to power and cool.In this paper, we challenge the conventional wisdom that high-end hardware is absolutely necessary for state-of-the-art performance and instead advocate for a radically different approach based on cheap single-board computers (SBCs). While others have previously explored similar ideas for computationally simple and easily partitionable use cases (e.g., key-value stores), so-called "wimpy" nodes have traditionally been rejected as unsuitable for more complex workloads. We believe, however, that recent hardware advancements driven by the mobile computing market call this orthodoxy into question. For example, our microbenchmarks show that one popular SBC, the Raspberry Pi 3B+, offers single-core compute performance that is surprisingly competitive with many server-grade Intel Xeon and ARM-based CPUs at a fraction of the cost and energy consumption.To make our case, we conducted an extensive experimental study, beginning with a series of microbenchmarks to identify the strengths and weaknesses of SBCs relative to server-grade CPUs. Then, to evaluate the ability of SBCs to handle more complex use cases, we analyzed the performance of an in-memory OLAP workload in both single-node and distributed settings. Overall, our results demonstrate up to several orders of magnitude in cost reductions coupled with substantial energy savings when compared to traditional on-premises and cloud deployments, all without a significant increase in absolute runtimes.
Odlaw: A Tool for Retroactive GDPR Compliance
2021-04-01 · 6 citations
articleSenior authorIn this demo, we present ODLAW, a new tool for retroactive compliance with privacy laws like the European Union's General Data Protection Regulation (GDPR). The GDPR enumerates the explicit rights of individuals regarding the use of their personal data, and regulators can impose strict penalties for organizations that fail to comply. While others have advocated for a completely new class of systems to address these regulations, ODLAW takes a different approach by achieving GDPR compliance while allowing an organization to keep its existing data management infrastructure intact. Using a variety of realistic datasets, the demo will show the specific ways that ODLAW can help with GDPR compliance, as well as highlight some of the key challenges that arise in real-world settings.
Dynamic Query Refinement for Interactive Data Exploration
Movebank · 2020-01-01 · 1 citations
articleOpen accessDeepSqueeze: Deep Semantic Compression for Tabular Data
2020 · 28 citations
Senior authorCorresponding- Computer Science
- Computer Science
- Data Mining
With the rapid proliferation of large datasets, efficient data compression has become more important than ever. Columnar compression techniques (e.g., dictionary encoding, run-length encoding, delta encoding) have proved highly effective for tabular data, but they typically compress individual columns without considering potential relationships among columns, such as functional dependencies and correlations. Semantic compression techniques, on the other hand, are designed to leverage such relationships to store only a subset of the columns necessary to infer the others, but existing approaches cannot effectively identify complex relationships across more than a few columns at a time. We propose DeepSqueeze, a novel semantic compression framework that can efficiently capture these complex relationships within tabular data by using autoencoders to map tuples to a lower-dimensional representation. DeepSqueeze also supports guaranteed error bounds for lossy compression of numerical data and works in conjunction with common columnar compression formats. Our experimental evaluation uses real-world datasets to demonstrate that DeepSqueeze can achieve over a 4x size reduction compared to state-of-the-art alternatives.
The Case for a Learned Sorting Algorithm
2020 · 44 citations
- Computer Science
- Computer Science
- Algorithm
Sorting is one of the most fundamental algorithms in Computer Science and a common operation in databases not just for sorting query results but also as part of joins (i.e., sort-merge-join) or indexing. In this work, we introduce a new type of distribution sort that leverages a learned model of the empirical CDF of the data. Our algorithm uses a model to efficiently get an approximation of the scaled empirical CDF for each record key and map it to the corresponding position in the output array. We then apply a deterministic sorting algorithm that works well on nearly-sorted arrays (e.g., Insertion Sort) to establish a totally sorted order. We compared this algorithm against common sorting approaches and measured its performance for up to 1 billion normally-distributed double-precision keys. The results show that our approach yields an average 3.38x performance improvement over C++ STL sort, which is an optimized Quicksort hybrid, 1.49x improvement over sequential Radix Sort, and 5.54x improvement over a C++ implementation of Timsort, which is the default sorting function for Java and Python.
Recent grants
III: Medium: Longview: Querying the Future Now
NSF · $1.2M · 2009–2013
III: Small: BigSolver: Data-Intensive Solver Support for Big Data Exploration and Mining
NSF · $500k · 2015–2019
CAREER: Infrastructures for Sensor-based Data-centric Monitoring Applications
NSF · $500k · 2005–2011
ITR Collaborative Proposal: Aurora - Enabling Stream-Based Monitoring Applications
NSF · $1.8M · 2003–2009
Frequent coauthors
- 47 shared
Yanif Ahmad
Johns Hopkins University
- 46 shared
Stan Zdonik
John Brown University
- 34 shared
Tim Kraska
Amazon (United States)
- 28 shared
Mert Akdere
Brown University
- 28 shared
Eli Upfal
- 27 shared
Carsten Binnig
Technical University of Darmstadt
- 27 shared
Olga Papaemmanouil
- 26 shared
Jeong-Hyon Hwang
Albany State University
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Ugur Cetintemel
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup