
Susan Davidson
· ProfessorVerifiedUniversity of Pennsylvania · Computer and Information Science
Active 1943–2025
Research topics
- Artificial Intelligence
- Machine Learning
- Computer Science
- Business
Selected publications
SHARQ: Explainability Framework for Association Rules on Relational Data
Proceedings of the ACM on Management of Data · 2025-02-10 · 2 citations
articleAssociation rules are an important technique for gaining insights over large relational datasets consisting of tuples of elements (i.e. attribute-value pairs). However, it is difficult to explain the relative importance of data elements with respect to the rules in which they appear. This paper develops a measure of an element's contribution to a set of association rules based on Shapley values, denoted SHARQ (ShApley Rules Quantification). As is the case with many Shapely-based computations, the cost of a naive calculation of the score is exponential in the number of elements. To that end, we present an efficient framework for computing the exact SHARQ value of a single element whose running time is practically linear in the number of rules. Going one step further, we develop an efficient multi-element SHARQ algorithm which amortizes the cost of the single element SHARQ calculation over a set of elements. Based on the definition of SHARQ for elements we describe two additional use-cases for association rules explainability: rule importance and attribute importance. Extensive experiments over a novel benchmark dataset containing 67 instances of mined rule sets show the effectiveness of our approach.
Holistic Query Approximation via RL Modeling
Proceedings of the VLDB Endowment · 2025-02-01
article1st authorCorrespondingIn data exploration, executing queries over a large database can be time-consuming. Previous work has proposed approximate query processing as a way to speed up aggregate queries in this context, but do not address non-aggregate queries. Our paper introduces a novel holistic approach to handle both types of queries by finding an optimized subset of data, referred to as an approximation set. The goal is to maximize query result quality while using a smaller set of data, thereby significantly reducing the query execution time. We formalize this problem as Holistic Approximate Query Processing and establish its NP-completeness. To tackle this, we propose an approximate solution using Reinforcement Learning , termed HARLM. While HARLM does not provide theoretical guarantees due to its reliance on Reinforcement Learning, it effectively overcomes challenges related to the large action space and the need for generalization beyond a known query workload. Experimental results on both non-aggregate and aggregate benchmarks show that HARLM significantly outperforms the baselines both in terms of accuracy (30% improvement) and efficiency (10–35X).
ASQP-RL Demo: Learning Approximation Sets for Exploratory Queries
2024-05-23
articleOpen access1st authorCorrespondingWe demonstrate the Approximate Selection Query Processing (ASQP-RL) system, which uses Reinforcement Learning to select a subset of a large external dataset to process locally in a notebook during data exploration. Given a query workload over an external database and notebook memory size, the system translates the workload to select-project-join (non-aggregate) queries and finds a subset of each relation such that the data subset - called the approximation set - fits into the notebook memory and maximizes query result quality. The data subset can then be loaded into the notebook, and rapidly queried by the analyst. Our demonstration shows how ASQP-RL can be used during data exploration and achieve comparable results to external queries over the large dataset at significantly reduced query times. It also shows how ASQP-RL can be used for aggregation queries, achieving surprisingly good results compared to state-of-the-art techniques.
SHARQ: Explainability Framework for Association Rules on Relational Data
arXiv (Cornell University) · 2024-12-24 · 1 citations
preprintOpen accessAssociation rules are an important technique for gaining insights over large relational datasets consisting of tuples of elements (i.e. attribute-value pairs). However, it is difficult to explain the relative importance of data elements with respect to the rules in which they appear. This paper develops a measure of an element's contribution to a set of association rules based on Shapley values, denoted SHARQ (ShApley Rules Quantification). As is the case with many Shapely-based computations, the cost of a naive calculation of the score is exponential in the number of elements. To that end, we present an efficient framework for computing the exact SharQ value of a single element whose running time is practically linear in the number of rules. Going one step further, we develop an efficient multi-element SHARQ algorithm which amortizes the cost of the single element SHARQ calculation over a set of elements. Based on the definition of SHARQ for elements we describe two additional use cases for association rules explainability: rule importance and attribute importance. Extensive experiments over a novel benchmark dataset containing 45 instances of mined rule sets show the effectiveness of our approach.
Learning Approximation Sets for Exploratory Queries
arXiv (Cornell University) · 2024-01-30
preprintOpen access1st authorCorrespondingIn data exploration, executing complex non-aggregate queries over large databases can be time-consuming. Our paper introduces a novel approach to address this challenge, focusing on finding an optimized subset of data, referred to as the approximation set, for query execution. The goal is to maximize query result quality while minimizing execution time. We formalize this problem as Approximate Non-Aggregates Query Processing (ANAQP) and establish its NP-completeness. To tackle this, we propose an approximate solution using advanced Reinforcement Learning architecture, termed ASQP-RL. This approach overcomes challenges related to the large action space and the need for generalization beyond a known query workload. Experimental results on two benchmarks demonstrate the superior performance of ASQP-RL, outperforming baselines by 30% in accuracy and achieving efficiency gains of 10-35X. Our research sheds light on the potential of reinforcement learning techniques for advancing data management tasks. Experimental results on two benchmarks show that ASQP-RL significantly outperforms the baselines both in terms of accuracy (30% better) and efficiency (10-35X). This research provides valuable insights into the potential of RL techniques for future advancements in data management tasks.
Selecting Sub-tables for Data Exploration
2023-04-01 · 5 citations
articleData scientists frequently examine the raw content of large tables when exploring an unknown dataset. In such cases, small subsets of the full tables (sub-tables) that accurately capture table contents are useful. We present a framework which, given a large data table T, creates a sub-table of small, fixed dimensions by selecting a subset of T’s rows and projecting them over a subset of T’s columns. The question is: Which rows and columns should be selected to yield an informative sub-table?Our first contribution is an informativeness metric for sub-tables with two complementary dimensions: cell coverage, which measures how well the sub-table captures prominent data patterns in T, and diversity. We use association rules as the patterns captured by sub-tables, and show that computing optimal sub-tables directly using this metric is infeasible. We then develop an efficient algorithm that indirectly accounts for association rules using table embedding. The resulting framework produces sub-tables for the full table as well as for the results of queries over the table, enabling the user to quickly understand results and determine subsequent queries. Experimental results show that high-quality sub-tables can be efficiently computed, and verify the soundness of our metrics as well as the usefulness of selected sub-tables through user studies.
Selecting Sub-tables for Data Exploration
arXiv (Cornell University) · 2022-03-05 · 1 citations
preprintOpen accessWe present a framework for creating small, informative sub-tables of large data tables to facilitate the first step of data science: data exploration. Given a large data table table T, the goal is to create a sub-table of small, fixed dimensions, by selecting a subset of T's rows and projecting them over a subset of T's columns. The question is: which rows and columns should be selected to yield an informative sub-table? We formalize the notion of "informativeness" based on two complementary metrics: cell coverage, which measures how well the sub-table captures prominent association rules in T, and diversity. Since computing optimal sub-tables using these metrics is shown to be infeasible, we give an efficient algorithm which indirectly accounts for association rules using table embedding. The resulting framework can be used for visualizing the complete sub-table, as well as for displaying the results of queries over the sub-table, enabling the user to quickly understand the results and determine subsequent queries. Experimental results show that we can efficiently compute high-quality sub-tables as measured by our metrics, as well as by feedback from user-studies.
ShapGraph: An Holistic View of Explanations through Provenance Graphs and Shapley Values
Proceedings of the 2022 International Conference on Management of Data · 2022-06-10 · 10 citations
article1st authorCorrespondingExplaining query results is an essential tool for enhancing the transparency and quality of data processing, and has been extensively studied in recent years. In particular, Data Provenance -- the tracking of transformations that data undergoes in query evaluation -- has been shown to be a key component of explanations. A hurdle that remains is that data provenance itself is often too large and complex to be presented in its entirety. To that end, we propose to leverage novel advancements on quantifying and computing the contributions of individual input tuples to query answers, based on the game-theoretic notion of the Shapley value. Our proposed prototype solution, called ShapGraph, combines the global view of explanations through provenance graphs with a local quantification of contributions through Shapley values. The graphical interface allows users to switch between and combine these two views to obtain a deeper understanding of the most influential parts of the database and how they interact to yield query answers.
SubTab: Data Exploration with Informative Sub-Tables
Proceedings of the 2022 International Conference on Management of Data · 2022-06-10 · 7 citations
articleWe demonstrate SubTab, a framework for creating small, informative sub-tables of large data tables to speed up data exploration. Given a table with n rows and m columns where n and m are large, SubTab creates a sub-table T_sub with k<n rows and l<m columns, i.e. a subset of k rows of the table projected over a subset of l columns. The rows and columns are chosen as representatives of prominent data patterns within and across columns in the input table. SubTab can also be used for query results, enabling the user to quickly understand the results and determine subsequent queries.
Credit distribution in relational scientific databases
Information Systems · 2022-05-10 · 5 citations
articleOpen accessDigital data is a basic form of research product for which citation, and the generation of credit or recognition for authors, are still not well understood. The notion of data credit has therefore recently emerged as a new measure, defined and based on data citation groundwork. Data credit is a real value representing the importance of data cited by a research entity. We can use credit to annotate data contained in a curated scientific database and then as a proxy of the significance and impact of that data in the research world. It is a method that, together with citations, helps recognize the value of data and its creators. In this paper, we explore the problem of Data Credit Distribution, the process by which credit is distributed to the database parts responsible for producing data being cited by a research entity. We adopt as use case the IUPHAR/BPS Guide to Pharmacology (GtoPdb), a widely-used curated scientific relational database. We focus on Select-Project-Join (SPJ) queries under bag semantics, and we define three distribution strategies based on how-provenance, responsibility, and the Shapley value. Using these distribution strategies, we show how credit can highlight frequently used database areas and how it can be used as a new bibliometric measure for data and their curators. In particular, credit rewards data and authors based on their research impact, not only on the citation count. We also show how these distribution strategies vary in their sensitivity to the role of an input tuple in the generation of the output data and reward input tuples differently.
Recent grants
II: Data Cooperatives: Rapid and Incremental Data Sharing with Applications to Bioinformatics
NSF · $1.3M · 2005–2009
NSF · $616k · 2010–2014
III: Medium: Collaborative Research: Citing Structured and Evolving Data
NSF · $856k · 2013–2018
Preserving Constraints in XML Data Exchange
NSF · $305k · 2005–2009
NSF · $220k · 2006–2009
Frequent coauthors
- 67 shared
Sarah Cohen‐Boulakia
Laboratoire Interdisciplinaire des Sciences du Numérique
- 40 shared
Tova Milo
- 26 shared
Olivier Biton
University of Pennsylvania
- 22 shared
Christine Froidevaux
- 22 shared
Peter Buneman
- 22 shared
Sanjeev Khanna
- 21 shared
Sudeepa Roy
Duke University
- 19 shared
Julia Stoyanovich
New York University
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Susan Davidson
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup