Zachary Ives

· Assistant ProfessorVerified

University of Pennsylvania · Computer and Information Science

Active 1998–2025

h-index40

Citations10.7k

Papers17015 last 5y

Funding$4.4M

Faculty page

See your match with Zachary Ives — sign in to PhdFit.Sign in

Research topics

Computer Science
Information Retrieval
Programming language
Computer Security
Data Mining
Database
Artificial Intelligence
Natural Language Processing
Archaeology
Theoretical computer science
Geology
Mathematics
Engineering
Data science

Selected publications

Implementing Views for Property Graphs
ACM SIGMOD Record · 2025-04-28 · 1 citations
articleSenior author
Property graph databases are increasingly used to integrate heterogeneous data, motivating graph views to abstract, simplify, and unify the data, e.g., to capture individual-level vs. organization-level relationships. This paper considers the tasks of implementing such views using rewriting techniques — both using existing property graph DBMSs and converting to relational RDBMSs. We consider both virtual and materialized views, ways of rewriting queries, and structures for indexing data. We also note a common use case of graph views, which involves preserving a graph except minor local transformations; we develop novel extensions and semantics for this. We evaluate and compare the performance of our techniques under a variety of workloads, and we compare existing graph and relational DBMS platforms.
Publisher DOI
Low Rank Learning for Offline Query Optimization
Proceedings of the ACM on Management of Data · 2025-06-17 · 2 citations
articleOpen access
Recent deployments of learned query optimizers use expensive neural networks and ad-hoc search policies. To address these issues, we introduce LimeQO, a framework for offline query optimization leveraging low-rank learning to efficiently explore alternative query plans with minimal resource usage. By modeling the workload as a partially observed, low-rank matrix, we predict unobserved query plan latencies using purely linear methods, significantly reducing computational overhead compared to neural networks. We formalize offline exploration as an active learning problem, and present simple heuristics that reduces a 3-hour workload to 1.5 hours after just 1.5 hours of exploration. Additionally, we propose a transductive Tree Convolutional Neural Network (TCNN) that, despite higher computational costs, achieves the same workload reduction with only 0.5 hours of exploration. Unlike previous approaches that place expensive neural networks directly in the query processing ''hot'' path, our approach offers a low-overhead solution and a no-regressions guarantee, all without making assumptions about the underlying DBMS.
Publisher OA PDF DOI
Data-Agnostic Cardinality Learning from Imperfect Workloads
Proceedings of the VLDB Endowment · 2025-04-01
articleOpen accessSenior author
Cardinality estimation (CardEst) is a critical aspect of query optimization. Traditionally, it leverages statistics built directly over the data. However, organizational policies (e.g., regulatory compliance) may restrict global data access. Fortunately, query-driven cardinality estimation can learn CardEst models using query workloads. However, existing query-driven models often require access to data or summaries for best performance, and they assume perfect training workloads with complete and balanced join templates (or join graphs). Such assumptions rarely hold in real-world scenarios, in which join templates are incomplete and imbalanced. We present GRASP, a data-agnostic cardinality learning system designed to work under these real-world constraints. GRASP's compositional design generalizes to unseen join templates and is robust to join template imbalance. It also introduces a new pertable CardEst model that handles value distribution shifts for range predicates, and a novel learned count sketch model that captures join correlations across base relations. Across three database instances, we demonstrate that GRASP consistently outperforms existing query-driven models on imperfect workloads, both in terms of estimation accuracy and query latency. Remarkably, GRASP achieves performance comparable to, or even surpassing, traditional approaches built over the underlying data on the complex CEB-IMDb-full benchmark — despite operating without any data access and using only 10% of all possible join templates.
Publisher OA PDF DOI
A Practical Theory of Generalization in Selectivity Learning
Proceedings of the VLDB Endowment · 2025-02-01
articleSenior author
Query-driven machine learning models have emerged as a promising estimation technique for query selectivities. Yet, surprisingly little is known about the efficacy of these techniques from a theoretical perspective, as there exist substantial gaps between practical solutions and state-of-the-art (SOTA) theory based on the Probably Approximately Correct (PAC) learning framework. In this paper, we aim to bridge the gaps between theory and practice. First, we demonstrate that selectivity predictors induced by signed measures are learnable, which relaxes the reliance on probability measures in SOTA theory. More importantly, beyond the PAC learning framework (which only allows us to characterize how the model behaves when both training and test workloads are drawn from the same distribution), we establish, under mild assumptions, that selectivity predictors from this class exhibit favorable out-of-distribution (OOD) generalization error bounds. These theoretical advances provide us with a better understanding of both the in-distribution and OOD generalization capabilities of query-driven selectivity learning, and facilitate the design of two general strategies to improve OOD generalization for existing query-driven selectivity models. We empirically verify that our techniques help query-driven selectivity models generalize significantly better to OOD queries both in terms of prediction accuracy and query latency performance, while maintaining their superior in-distribution generalization performance.
Publisher DOI
QuoteInspector: Gaining Insight about Social Media Discussions
Proceedings of the VLDB Endowment · 2024-08-01
articleSenior author
Our greatest source of insight into the real world today is via social media. Here, a major statement or quote by a public figure (world leader, politician, celebrity, scientist) can have wide-ranging impact, igniting extensive discussions and triggering reactions. It would be helpful to have tools for monitoring, querying, and inspecting the "flow" of social discourse. We introduce QuoteInspector, a system uniquely designed for efficient tracking and analysis of social media discussions around quotes. QuoteInspector leverages modern text embeddings and employs a clustering-based methodology for extracting topics from posts; it further integrates various NLP techniques for in-depth cluster analysis. Additionally, the system enhances the user experience by combining keyword- and relationship-based (structured) search for efficient and precise quote retrieval.
Publisher DOI
Searching Data Lakes for Nested and Joined Data
Proceedings of the VLDB Endowment · 2024-07-01
articleSenior author
Exploratory data science is driving new platforms that assist data scientists with everyday tasks, such as integration and wrangling, to assemble training datasets. Such tools take scientists' work-in-progress data as a search object (table or JSON) and find relevant supplementary data from an organizational data lake , which can be unioned or joined with the current data. Existing data lake search tools find single , relational tables to match or join with a search object. Yet many data science applications revolve around hierarchical data, which can only be matched by creating views that simultaneously join and transform several tables in the data lake. In this paper, we extend the Juneau data lake search system [46] for this broader class of matches at scale. Our contribution is a general framework for efficiently merging ranked results to match hierarchical data, leveraging novel techniques for indexing and sketching, and incorporating existing single-table search techniques and ranking functions. We experimentally validate our methods' benefits and broad applicability using real data from data science computational notebooks. Our results indicate that, with different ranking functions, our approach can return the optimal set of views up to 4.8x faster and 43% more related compared to heuristics, and increase the data domain coverage by up to 28%. In a case study to show the utility of our results to data science downstream tasks, we reduce regression error by up to 6.6%, and improve classification accuracy by up to 19.5%.
Publisher DOI
A Practical Theory of Generalization in Selectivity Learning
ArXiv.org · 2024-09-11
preprintOpen accessSenior author
Query-driven machine learning models have emerged as a promising estimation technique for query selectivities. Yet, surprisingly little is known about the efficacy of these techniques from a theoretical perspective, as there exist substantial gaps between practical solutions and state-of-the-art (SOTA) theory based on the Probably Approximately Correct (PAC) learning framework. In this paper, we aim to bridge the gaps between theory and practice. First, we demonstrate that selectivity predictors induced by signed measures are learnable, which relaxes the reliance on probability measures in SOTA theory. More importantly, beyond the PAC learning framework (which only allows us to characterize how the model behaves when both training and test workloads are drawn from the same distribution), we establish, under mild assumptions, that selectivity predictors from this class exhibit favorable out-of-distribution (OOD) generalization error bounds. These theoretical advances provide us with a better understanding of both the in-distribution and OOD generalization capabilities of query-driven selectivity learning, and facilitate the design of two general strategies to improve OOD generalization for existing query-driven selectivity models. We empirically verify that our techniques help query-driven selectivity models generalize significantly better to OOD queries both in terms of prediction accuracy and query latency performance, while maintaining their superior in-distribution generalization performance.
Publisher OA PDF DOI
Low Rank Approximation for Learned Query Optimization
2024-05-17 · 1 citations
article
We present LimeQO, a learned steering query optimizer based on linear methods, such as matrix completion, for repetitive workloads. LimeQO can forgo expensive neural networks by taking advantage of the low-rank structure of query workloads. Using offline execution, LimeQO can accelerate workloads by up to 2x with zero regressions in just a few hours, while using 100-1000x fewer computational resources than deep learning techniques.
Publisher DOI
Modeling Shifting Workloads for Learned Database Systems
Proceedings of the ACM on Management of Data · 2024-03-12 · 10 citations
articleSenior author
Learned database systems address several weaknesses of traditional cost estimation techniques in query optimization: they learn a model of a database instance, e.g., as queries are executed. However, when the database instance has skew and correlation, it is nontrivial to create an effective training set that anticipates workload shifts, where query structure changes and/or different regions of the data contribute to query answers. Our predictive model may perform poorly with these out-of-distribution inputs. In this paper, we study how the notion of a replay buffer can be managed through online algorithms to build a concise yet representative model of the workload distribution --- allowing for rapid adaptation and effective prediction of cardinalities and costs. We experimentally validate our methods over several data domains.
Publisher DOI
Implementation Strategies for Views over Property Graphs
Proceedings of the ACM on Management of Data · 2024-05-29 · 8 citations
articleSenior author
The need to query complex interactions and relationships has motivated interest in property graph database platforms. For some graph applications, graph views are required to abstract the data, e.g., to capture individual-level vs. organization-level relationships; or show single computational steps vs. composite workflows. Emerging efforts to standardize graph query languages have developed semantics and language constructs for graph views. This paper considers the task of implementing such views using rewriting techniques --- both using existing property graph DBMSs and converting to relational RDBMSs. We consider both virtual and materialized views, ways of rewriting queries, and structures for indexing data. We also note a common use case of graph views, which involves preserving a graph except minor local transformations; we develop novel extensions and semantics for this. We evaluate and compare the performance of our techniques under a variety of workloads, and we compare existing graph and relational DBMS platforms.
Publisher DOI