
Zachary Ives
· Assistant ProfessorVerifiedUniversity of Pennsylvania · Computer and Information Science
Active 1998–2025
Research topics
- Computer Science
- Information Retrieval
- Programming language
- Computer Security
- Data Mining
- Database
- Artificial Intelligence
- Natural Language Processing
- Archaeology
- Theoretical computer science
- Geology
- Mathematics
- Engineering
- Data science
Selected publications
Implementing Views for Property Graphs
ACM SIGMOD Record · 2025-04-28 · 1 citations
articleSenior authorProperty graph databases are increasingly used to integrate heterogeneous data, motivating graph views to abstract, simplify, and unify the data, e.g., to capture individual-level vs. organization-level relationships. This paper considers the tasks of implementing such views using rewriting techniques — both using existing property graph DBMSs and converting to relational RDBMSs. We consider both virtual and materialized views, ways of rewriting queries, and structures for indexing data. We also note a common use case of graph views, which involves preserving a graph except minor local transformations; we develop novel extensions and semantics for this. We evaluate and compare the performance of our techniques under a variety of workloads, and we compare existing graph and relational DBMS platforms.
Low Rank Learning for Offline Query Optimization
Proceedings of the ACM on Management of Data · 2025-06-17 · 2 citations
articleOpen accessRecent deployments of learned query optimizers use expensive neural networks and ad-hoc search policies. To address these issues, we introduce LimeQO, a framework for offline query optimization leveraging low-rank learning to efficiently explore alternative query plans with minimal resource usage. By modeling the workload as a partially observed, low-rank matrix, we predict unobserved query plan latencies using purely linear methods, significantly reducing computational overhead compared to neural networks. We formalize offline exploration as an active learning problem, and present simple heuristics that reduces a 3-hour workload to 1.5 hours after just 1.5 hours of exploration. Additionally, we propose a transductive Tree Convolutional Neural Network (TCNN) that, despite higher computational costs, achieves the same workload reduction with only 0.5 hours of exploration. Unlike previous approaches that place expensive neural networks directly in the query processing ''hot'' path, our approach offers a low-overhead solution and a no-regressions guarantee, all without making assumptions about the underlying DBMS.
Data-Agnostic Cardinality Learning from Imperfect Workloads
Proceedings of the VLDB Endowment · 2025-04-01
articleOpen accessSenior authorCardinality estimation (CardEst) is a critical aspect of query optimization. Traditionally, it leverages statistics built directly over the data. However, organizational policies (e.g., regulatory compliance) may restrict global data access. Fortunately, query-driven cardinality estimation can learn CardEst models using query workloads. However, existing query-driven models often require access to data or summaries for best performance, and they assume perfect training workloads with complete and balanced join templates (or join graphs). Such assumptions rarely hold in real-world scenarios, in which join templates are incomplete and imbalanced. We present GRASP, a data-agnostic cardinality learning system designed to work under these real-world constraints. GRASP's compositional design generalizes to unseen join templates and is robust to join template imbalance. It also introduces a new pertable CardEst model that handles value distribution shifts for range predicates, and a novel learned count sketch model that captures join correlations across base relations. Across three database instances, we demonstrate that GRASP consistently outperforms existing query-driven models on imperfect workloads, both in terms of estimation accuracy and query latency. Remarkably, GRASP achieves performance comparable to, or even surpassing, traditional approaches built over the underlying data on the complex CEB-IMDb-full benchmark — despite operating without any data access and using only 10% of all possible join templates.
A Practical Theory of Generalization in Selectivity Learning
Proceedings of the VLDB Endowment · 2025-02-01
articleSenior authorQuery-driven machine learning models have emerged as a promising estimation technique for query selectivities. Yet, surprisingly little is known about the efficacy of these techniques from a theoretical perspective, as there exist substantial gaps between practical solutions and state-of-the-art (SOTA) theory based on the Probably Approximately Correct (PAC) learning framework. In this paper, we aim to bridge the gaps between theory and practice. First, we demonstrate that selectivity predictors induced by signed measures are learnable, which relaxes the reliance on probability measures in SOTA theory. More importantly, beyond the PAC learning framework (which only allows us to characterize how the model behaves when both training and test workloads are drawn from the same distribution), we establish, under mild assumptions, that selectivity predictors from this class exhibit favorable out-of-distribution (OOD) generalization error bounds. These theoretical advances provide us with a better understanding of both the in-distribution and OOD generalization capabilities of query-driven selectivity learning, and facilitate the design of two general strategies to improve OOD generalization for existing query-driven selectivity models. We empirically verify that our techniques help query-driven selectivity models generalize significantly better to OOD queries both in terms of prediction accuracy and query latency performance, while maintaining their superior in-distribution generalization performance.
QuoteInspector: Gaining Insight about Social Media Discussions
Proceedings of the VLDB Endowment · 2024-08-01
articleSenior authorOur greatest source of insight into the real world today is via social media. Here, a major statement or quote by a public figure (world leader, politician, celebrity, scientist) can have wide-ranging impact, igniting extensive discussions and triggering reactions. It would be helpful to have tools for monitoring, querying, and inspecting the "flow" of social discourse. We introduce QuoteInspector, a system uniquely designed for efficient tracking and analysis of social media discussions around quotes. QuoteInspector leverages modern text embeddings and employs a clustering-based methodology for extracting topics from posts; it further integrates various NLP techniques for in-depth cluster analysis. Additionally, the system enhances the user experience by combining keyword- and relationship-based (structured) search for efficient and precise quote retrieval.
Searching Data Lakes for Nested and Joined Data
Proceedings of the VLDB Endowment · 2024-07-01
articleSenior authorExploratory data science is driving new platforms that assist data scientists with everyday tasks, such as integration and wrangling, to assemble training datasets. Such tools take scientists' work-in-progress data as a search object (table or JSON) and find relevant supplementary data from an organizational data lake , which can be unioned or joined with the current data. Existing data lake search tools find single , relational tables to match or join with a search object. Yet many data science applications revolve around hierarchical data, which can only be matched by creating views that simultaneously join and transform several tables in the data lake. In this paper, we extend the Juneau data lake search system [46] for this broader class of matches at scale. Our contribution is a general framework for efficiently merging ranked results to match hierarchical data, leveraging novel techniques for indexing and sketching, and incorporating existing single-table search techniques and ranking functions. We experimentally validate our methods' benefits and broad applicability using real data from data science computational notebooks. Our results indicate that, with different ranking functions, our approach can return the optimal set of views up to 4.8x faster and 43% more related compared to heuristics, and increase the data domain coverage by up to 28%. In a case study to show the utility of our results to data science downstream tasks, we reduce regression error by up to 6.6%, and improve classification accuracy by up to 19.5%.
A Practical Theory of Generalization in Selectivity Learning
ArXiv.org · 2024-09-11
preprintOpen accessSenior authorQuery-driven machine learning models have emerged as a promising estimation technique for query selectivities. Yet, surprisingly little is known about the efficacy of these techniques from a theoretical perspective, as there exist substantial gaps between practical solutions and state-of-the-art (SOTA) theory based on the Probably Approximately Correct (PAC) learning framework. In this paper, we aim to bridge the gaps between theory and practice. First, we demonstrate that selectivity predictors induced by signed measures are learnable, which relaxes the reliance on probability measures in SOTA theory. More importantly, beyond the PAC learning framework (which only allows us to characterize how the model behaves when both training and test workloads are drawn from the same distribution), we establish, under mild assumptions, that selectivity predictors from this class exhibit favorable out-of-distribution (OOD) generalization error bounds. These theoretical advances provide us with a better understanding of both the in-distribution and OOD generalization capabilities of query-driven selectivity learning, and facilitate the design of two general strategies to improve OOD generalization for existing query-driven selectivity models. We empirically verify that our techniques help query-driven selectivity models generalize significantly better to OOD queries both in terms of prediction accuracy and query latency performance, while maintaining their superior in-distribution generalization performance.
Low Rank Approximation for Learned Query Optimization
2024-05-17 · 1 citations
articleWe present LimeQO, a learned steering query optimizer based on linear methods, such as matrix completion, for repetitive workloads. LimeQO can forgo expensive neural networks by taking advantage of the low-rank structure of query workloads. Using offline execution, LimeQO can accelerate workloads by up to 2x with zero regressions in just a few hours, while using 100-1000x fewer computational resources than deep learning techniques.
Modeling Shifting Workloads for Learned Database Systems
Proceedings of the ACM on Management of Data · 2024-03-12 · 10 citations
articleSenior authorLearned database systems address several weaknesses of traditional cost estimation techniques in query optimization: they learn a model of a database instance, e.g., as queries are executed. However, when the database instance has skew and correlation, it is nontrivial to create an effective training set that anticipates workload shifts, where query structure changes and/or different regions of the data contribute to query answers. Our predictive model may perform poorly with these out-of-distribution inputs. In this paper, we study how the notion of a replay buffer can be managed through online algorithms to build a concise yet representative model of the workload distribution --- allowing for rapid adaptation and effective prediction of cardinalities and costs. We experimentally validate our methods over several data domains.
Implementation Strategies for Views over Property Graphs
Proceedings of the ACM on Management of Data · 2024-05-29 · 8 citations
articleSenior authorThe need to query complex interactions and relationships has motivated interest in property graph database platforms. For some graph applications, graph views are required to abstract the data, e.g., to capture individual-level vs. organization-level relationships; or show single computational steps vs. composite workflows. Emerging efforts to standardize graph query languages have developed semantics and language constructs for graph views. This paper considers the task of implementing such views using rewriting techniques --- both using existing property graph DBMSs and converting to relational RDBMSs. We consider both virtual and materialized views, ways of rewriting queries, and structures for indexing data. We also note a common use case of graph views, which involves preserving a graph except minor local transformations; we develop novel extensions and semantics for this. We evaluate and compare the performance of our techniques under a variety of workloads, and we compare existing graph and relational DBMS platforms.
Recent grants
III: EAGER: Data Integration as a Dialogue with the User
NSF · $150k · 2010–2012
NIH · $451k · 2016–2018
CICI: Data Provenance: Provenance-Based Trust Management for Collaborative Data Curation
NSF · $500k · 2015–2019
III: Small: Promoting Reuse and Retargeting in Data Science
NSF · $500k · 2019–2023
NeTS/NOSS: ASPEN: Abstraction-based Sensor Programming Environment
NSF · $450k · 2007–2012
Frequent coauthors
- 42 shared
Alon Halevy
- 22 shared
AnHai Doan
University of Wisconsin–Madison
- 21 shared
Boon Thau Loo
- 15 shared
Val Tannen
- 15 shared
Igor Tatarinov
- 11 shared
Andreas Haeberlen
University of Pennsylvania
- 11 shared
Todd J. Green
University of California, Davis
- 10 shared
Jonathan M. Smith
California University of Pennsylvania
Education
- 2002
PhD, Computer Science and Engineering
University of Washington
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Zachary Ives
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup