Sanjay Krishnan

· Assistant Professor of Computer ScienceVerified

University of Chicago · Computer Science

Active 1978–2025

h-index35

Citations4.8k

Papers18571 last 5y

Funding—

Faculty page Lab page Website

See your match with Sanjay Krishnan — sign in to PhdFit.Sign in

About

Sanjay Krishnan is an Assistant Professor of Computer Science at the University of Chicago. His research group studies corrupted, missing, or otherwise uncertain data in database and information retrieval systems. His current research focuses on systems that can provide certifiable accuracy guarantees in partially complete databases, query accuracy evaluation in corrupted databases, and automatic detection of data leakage. His research areas include Data Science, Databases, and Machine Learning, with a particular emphasis on managing and analyzing data at scale. He is involved in labs and groups such as ChiDATA and the Systems Group, conducting research on large-scale video analysis, efficient data processing systems, and the economics of data. His work aims to advance understanding and development of systems that handle uncertain data, ensuring reliability and security in data management and retrieval processes.

Research topics

Computer Science
Data Mining
Information Retrieval
Telecommunications
Algorithm
World Wide Web
Database
Real-time computing
Data science
Engineering
Computer network

Selected publications

Natural scenes are more compressible and less memorable than human-made scenes
2025-05-15 · 1 citations
preprintOpen access
Humans often cannot process all the information available within an environment, but instead filter out much of it. This study examines whether the extent of information filtering may differ between environments, specifically natural and human-made environments. Across three behavioral experiments and computational analysis of 108,754 scene images, we analyzed the spectral and edge content of scenes to quantify the proportion of noticeable information. Our findings reveal that natural scenes have a lower proportion of noticeable information compared to human-made scenes, resulting in higher compressibility. Furthermore, natural scenes were consistently less memorable than human-made scenes, suggesting that greater information filtering occurs during encoding into memory. The lower memorability of natural scenes was partially explained by their higher compressibility. Our results indicate that compressibility, or the density of noticeable information, could be a key feature distinguishing natural environments from human-made environments, potentially explaining the benefits of interacting with natural environments.
Publisher DOI
Metadata Management for AI-Augmented Data Workflows
ArXiv.org · 2025-08-09
preprintOpen accessSenior author
AI-augmented data workflows introduce complex governance challenges, as both human and model-driven processes generate, transform, and consume data artifacts. These workflows blend heterogeneous tools, dynamic execution patterns, and opaque model decisions, making comprehensive metadata capture difficult. In this work, we present TableVault, a metadata governance framework designed for human-AI collaborative data creation. TableVault records ingestion events, traces operation status, links execution parameters to their data origins, and exposes a standardized metadata layer. By combining database-inspired guarantees with AI-oriented design, such as declarative operation builders and lineage-aware references, TableVault supports transparency and reproducibility across mixed human-model pipelines. Through a document classification case study, we demonstrate how TableVault preserves detailed lineage and operational context, enabling robust metadata management, even in partially observable execution environments.
Publisher OA PDF DOI
DeepScribe: Localization and Classification of Elamite Cuneiform Signs via Deep Learning
Journal on Computing and Cultural Heritage · 2025-03-07
articleOpen accessSenior author
Twenty-five hundred years ago, the “paperwork” of the Achaemenid Empire was recorded on clay tablets. In 1933, archaeologists from the University of Chicago’s Institute for the Study of Ancient Cultures (ISAC, formerly Oriental Institute) found tens of thousands of these tablets and fragments during the excavation of Persepolis. Many of these tablets have been painstakingly photographed and annotated by expert cuneiformists, and now provide a rich dataset consisting of over 5,000 annotated tablet images and 100,000 cuneiform sign bounding boxes encoding the Elamite language. We leverage this dataset to develop DeepScribe, the first computer vision pipeline capable of localizing Elamite cuneiform signs and providing suggestions for the identity of each sign. We investigate the difficulty of learning subtasks relevant to Elamite cuneiform tablet transcription on ground-truth data, finding that a RetinaNet object detector achieves a localization mAP of 0.78 and a ResNet classifier achieves a top-5 sign classification accuracy of 0.89. The end-to-end pipeline achieves a top-5 classification accuracy of 0.80. As part of the classification module, DeepScribe groups cuneiform signs into morphological clusters. We consider how this automatic clustering approach differs from the organization of standard, printed sign lists and what we learn from it. These components, trained individually, are sufficient to produce a system that can analyze photos of cuneiform tablets from the Achaemenid period and provide useful transliteration suggestions to researchers. We evaluate the model’s end-to-end performance on locating and classifying signs, providing a roadmap to a linguistically aware transliteration system, then consider the model’s potential utility when applied to other periods of cuneiform writing.
Publisher OA PDF DOI
Range (Rényi) Entropy Queries and Partitioning
Logical Methods in Computer Science · 2025-10-30
articleOpen access
Data partitioning that maximizes/minimizes the Shannon entropy, or more generally the Rényi entropy is a crucial subroutine in data compression, columnar storage, and cardinality estimation algorithms. These partition algorithms can be accelerated if we have a data structure to compute the entropy in different subsets of data when the algorithm needs to decide what block to construct. Such a data structure will also be useful for data analysts exploring different subsets of data to identify areas of interest. While it is generally known how to compute the Shannon or the Rényi entropy of a discrete distribution in the offline or streaming setting efficiently, we focus on the query setting where we aim to efficiently derive the entropy among a subset of data that satisfy some linear predicates. We solve this problem in a typical setting when we deal with real data, where data items are geometric points and each requested area is a query (hyper)rectangle. More specifically, we consider a set $P$ of $n$ weighted and colored points in $\mathbb{R}^d$, where $d$ is a constant. For the range S-entropy (resp. R-entropy) query problem, the goal is to construct a low space data structure, such that given a query (hyper)rectangle $R$, it computes the Shannon (resp. Rényi) entropy based on the colors and the weights of the points in $P\cap R$, in sublinear time. We show conditional lower bounds proving that we cannot hope for data structures with near-linear space and near-constant query time for both the range S-entropy and R-entropy query problems. Then, we propose exact data structures for $d=1$ and $d&gt;1$ with $o(n^{2d})$ space and $o(n)$ query time for both problems. Finally, we propose near linear space data structures for returning either an additive or a multiplicative approximation of the Shannon (resp. Rényi) entropy in $P\cap R$.
Publisher OA PDF DOI
Agentic Workflows for Extraction of Access Control Matrices from Policy Documents
2025-05-19
articleSenior author
Configuring access control on databases is a critical task that can be achieved by determining the intended access control permissions from a precise format, such as an access control matrix (ACM). However, database access control permissions are often expressed as parts of security policy documents. Currently, these permissions must be manually extracted as an access control matrix in order to correctly configure the intended access control permissions. To accurately automate this tedious extraction process over documents of varying structures, we envision an agentic workflow for training an ACM extraction agent. To this end, we identify ACM Extraction and ACM Comparison as key components of such a procedure, and develop LLM agents for these components. We generate a benchmark to train and evaluate ACM Extraction using a real-world security policy and systematic perturbations of it. We find that our ACM Extraction agent has near-perfect accuracy, indicating our proposed agentic workflow is likely to be highly accurate for documents similar to those in our benchmark. As future work, we identify the iterative addition of failure-mode-specific rules as a strategy that would generalize our agentic workflow to broader classes of documents.
Publisher DOI
Neighborhood sociome factors and pediatric asthma exacerbations: Protective role of tree crown density and importance of pharmacy access in Chicago's south side
Pediatric Allergy and Immunology · 2025-07-01 · 4 citations
articleOpen access
BACKGROUND: Pediatric asthma exacerbations remain a critical public health concern, particularly in historically underserved urban settings. OBJECTIVE: This study investigates sociome factors-the social context of disease-associated with asthma exacerbations among children living in Chicago's South Side, leveraging clinical and publicly available generalizable census tract-level datasets from agencies including ChiVes, the City of Chicago Data Portal, EPA, Census Bureau, HUD, NOAA, and more. The aim is to uncover novel hypotheses for potential new interventions. METHODS: A generalized linear model assessed associations with the outcome of asthma exacerbations while accounting for clustering at the patient level. Predictors included all variables from the Sociome Data Commons, including social, environmental, behavioral, economic, housing, and school variables. RESULTS: Predictors of decreased risk included patient age (+4.8 years, -22%), tree crown density (+6% coverage, -17%), parks per acre (+0.41, -8%), and labor market engagement (+0.8 points, -9%). Conversely, predictors of increased risk included increased distance to the nearest pharmacy (+0.28 miles, +12%), limited English skills (+2.3%, +10%), higher inequality (+0.08 points, +8%), and visits in the Spring (+11%) and Fall (+20%). CONCLUSION: The results suggest that tree crown density, a novel finding in the context of asthma exacerbations, may play a protective role. Limited access to health care facilities such as pharmacies continues to complicate care. CLINICAL IMPLICATIONS: These findings provide hypotheses for future interventions for long-standing asthma disparities.
Publisher OA PDF DOI
Author response for "Neighborhood sociome factors and pediatric asthma exacerbations: Protective role of tree crown density and importance of pharmacy access in Chicago's south side"
2025-05-28
peer-review
Publisher DOI
Author response for "Neighborhood sociome factors and pediatric asthma exacerbations: Protective role of tree crown density and importance of pharmacy access in Chicago's south side"
2025-05-22
peer-review
Publisher DOI
Fast Capture of Cell-Level Provenance in Numpy
2025-06-22
articleOpen accessSenior author
Effective provenance tracking enhances reproducibility, governance, and data quality in array workflows.However, significant challenges arise in capturing this provenance, including: (1) rapidly evolving APIs, (2) diverse operation types, and (3) large-scale datasets.To address these challenges, this paper presents a prototype annotation system designed for arrays, which captures cell-level provenance specifically within the numpy library.With this prototype, we explore straightforward memory optimizations that substantially reduce annotation latency.We envision this provenance capture approach for arrays as part of a broader governance system for tracking for structured data workflows and diverse data science applications.
Publisher OA PDF DOI
Data Makes Better Data Scientists
arXiv (Cornell University) · 2024-05-27
preprintOpen accessSenior author
With the goal of identifying common practices in data science projects, this paper proposes a framework for logging and understanding incremental code executions in Jupyter notebooks. This framework aims to allow reasoning about how insights are generated in data science and extract key observations into best data science practices in the wild. In this paper, we show an early prototype of this framework and ran an experiment to log a machine learning project for 25 undergraduate students.
Publisher OA PDF DOI

Frequent coauthors

Michael J. Franklin
University of Chicago
61 shared
Ken Goldberg
University of California, Berkeley
52 shared
Ken Goldberg
41 shared
Aaron J. Elmore
University of Chicago
38 shared
Arie van Deursen
Delft University of Technology
36 shared
Peter A. Raymond
36 shared
D. Bosscher
University of Aberdeen
36 shared
Marinus van Hulst
University of Groningen
36 shared

Labs

Systems GroupPI

Education

Ph.D., Computer Science
University of Chicago
2005
M.S., Computer Science
University of Illinois at Urbana-Champaign
2000
B.S., Computer Science
University of Illinois at Urbana-Champaign
1998

Awards & honors

CAREER Award for Resource-Efficient Databases (2021)

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Sanjay Krishnan

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you