Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Sanjay Krishnan

Sanjay Krishnan

· Assistant Professor of Computer ScienceVerified

University of Chicago · Computer Science

Active 1978–2025

h-index35
Citations4.8k
Papers18571 last 5y
Funding
See your match with Sanjay Krishnan — sign in to PhdFit.Sign in

About

Sanjay Krishnan is an Assistant Professor of Computer Science at the University of Chicago. His research group studies corrupted, missing, or otherwise uncertain data in database and information retrieval systems. His current research focuses on systems that can provide certifiable accuracy guarantees in partially complete databases, query accuracy evaluation in corrupted databases, and automatic detection of data leakage. His research areas include Data Science, Databases, and Machine Learning, with a particular emphasis on managing and analyzing data at scale. He is involved in labs and groups such as ChiDATA and the Systems Group, conducting research on large-scale video analysis, efficient data processing systems, and the economics of data. His work aims to advance understanding and development of systems that handle uncertain data, ensuring reliability and security in data management and retrieval processes.

Research topics

  • Computer Science
  • Data Mining
  • Information Retrieval
  • Telecommunications
  • Algorithm
  • World Wide Web
  • Database
  • Real-time computing
  • Data science
  • Engineering
  • Computer network

Selected publications

  • Natural scenes are more compressible and less memorable than human-made scenes

    2025-05-15 · 1 citations

    preprintOpen access

    Humans often cannot process all the information available within an environment, but instead filter out much of it. This study examines whether the extent of information filtering may differ between environments, specifically natural and human-made environments. Across three behavioral experiments and computational analysis of 108,754 scene images, we analyzed the spectral and edge content of scenes to quantify the proportion of noticeable information. Our findings reveal that natural scenes have a lower proportion of noticeable information compared to human-made scenes, resulting in higher compressibility. Furthermore, natural scenes were consistently less memorable than human-made scenes, suggesting that greater information filtering occurs during encoding into memory. The lower memorability of natural scenes was partially explained by their higher compressibility. Our results indicate that compressibility, or the density of noticeable information, could be a key feature distinguishing natural environments from human-made environments, potentially explaining the benefits of interacting with natural environments.

  • Metadata Management for AI-Augmented Data Workflows

    ArXiv.org · 2025-08-09

    preprintOpen accessSenior author

    AI-augmented data workflows introduce complex governance challenges, as both human and model-driven processes generate, transform, and consume data artifacts. These workflows blend heterogeneous tools, dynamic execution patterns, and opaque model decisions, making comprehensive metadata capture difficult. In this work, we present TableVault, a metadata governance framework designed for human-AI collaborative data creation. TableVault records ingestion events, traces operation status, links execution parameters to their data origins, and exposes a standardized metadata layer. By combining database-inspired guarantees with AI-oriented design, such as declarative operation builders and lineage-aware references, TableVault supports transparency and reproducibility across mixed human-model pipelines. Through a document classification case study, we demonstrate how TableVault preserves detailed lineage and operational context, enabling robust metadata management, even in partially observable execution environments.

  • DeepScribe: Localization and Classification of Elamite Cuneiform Signs via Deep Learning

    Journal on Computing and Cultural Heritage · 2025-03-07

    articleOpen accessSenior author

    Twenty-five hundred years ago, the “paperwork” of the Achaemenid Empire was recorded on clay tablets. In 1933, archaeologists from the University of Chicago’s Institute for the Study of Ancient Cultures (ISAC, formerly Oriental Institute) found tens of thousands of these tablets and fragments during the excavation of Persepolis. Many of these tablets have been painstakingly photographed and annotated by expert cuneiformists, and now provide a rich dataset consisting of over 5,000 annotated tablet images and 100,000 cuneiform sign bounding boxes encoding the Elamite language. We leverage this dataset to develop DeepScribe, the first computer vision pipeline capable of localizing Elamite cuneiform signs and providing suggestions for the identity of each sign. We investigate the difficulty of learning subtasks relevant to Elamite cuneiform tablet transcription on ground-truth data, finding that a RetinaNet object detector achieves a localization mAP of 0.78 and a ResNet classifier achieves a top-5 sign classification accuracy of 0.89. The end-to-end pipeline achieves a top-5 classification accuracy of 0.80. As part of the classification module, DeepScribe groups cuneiform signs into morphological clusters. We consider how this automatic clustering approach differs from the organization of standard, printed sign lists and what we learn from it. These components, trained individually, are sufficient to produce a system that can analyze photos of cuneiform tablets from the Achaemenid period and provide useful transliteration suggestions to researchers. We evaluate the model’s end-to-end performance on locating and classifying signs, providing a roadmap to a linguistically aware transliteration system, then consider the model’s potential utility when applied to other periods of cuneiform writing.

  • Range (Rényi) Entropy Queries and Partitioning

    Logical Methods in Computer Science · 2025-10-30

    articleOpen access

    Data partitioning that maximizes/minimizes the Shannon entropy, or more generally the Rényi entropy is a crucial subroutine in data compression, columnar storage, and cardinality estimation algorithms. These partition algorithms can be accelerated if we have a data structure to compute the entropy in different subsets of data when the algorithm needs to decide what block to construct. Such a data structure will also be useful for data analysts exploring different subsets of data to identify areas of interest. While it is generally known how to compute the Shannon or the Rényi entropy of a discrete distribution in the offline or streaming setting efficiently, we focus on the query setting where we aim to efficiently derive the entropy among a subset of data that satisfy some linear predicates. We solve this problem in a typical setting when we deal with real data, where data items are geometric points and each requested area is a query (hyper)rectangle. More specifically, we consider a set $P$ of $n$ weighted and colored points in $\mathbb{R}^d$, where $d$ is a constant. For the range S-entropy (resp. R-entropy) query problem, the goal is to construct a low space data structure, such that given a query (hyper)rectangle $R$, it computes the Shannon (resp. Rényi) entropy based on the colors and the weights of the points in $P\cap R$, in sublinear time. We show conditional lower bounds proving that we cannot hope for data structures with near-linear space and near-constant query time for both the range S-entropy and R-entropy query problems. Then, we propose exact data structures for $d=1$ and $d>1$ with $o(n^{2d})$ space and $o(n)$ query time for both problems. Finally, we propose near linear space data structures for returning either an additive or a multiplicative approximation of the Shannon (resp. Rényi) entropy in $P\cap R$.

  • Agentic Workflows for Extraction of Access Control Matrices from Policy Documents

    2025-05-19

    articleSenior author

    Configuring access control on databases is a critical task that can be achieved by determining the intended access control permissions from a precise format, such as an access control matrix (ACM). However, database access control permissions are often expressed as parts of security policy documents. Currently, these permissions must be manually extracted as an access control matrix in order to correctly configure the intended access control permissions. To accurately automate this tedious extraction process over documents of varying structures, we envision an agentic workflow for training an ACM extraction agent. To this end, we identify ACM Extraction and ACM Comparison as key components of such a procedure, and develop LLM agents for these components. We generate a benchmark to train and evaluate ACM Extraction using a real-world security policy and systematic perturbations of it. We find that our ACM Extraction agent has near-perfect accuracy, indicating our proposed agentic workflow is likely to be highly accurate for documents similar to those in our benchmark. As future work, we identify the iterative addition of failure-mode-specific rules as a strategy that would generalize our agentic workflow to broader classes of documents.

  • Neighborhood sociome factors and pediatric asthma exacerbations: Protective role of tree crown density and importance of pharmacy access in Chicago's south side

    Pediatric Allergy and Immunology · 2025-07-01 · 4 citations

    articleOpen access

    BACKGROUND: Pediatric asthma exacerbations remain a critical public health concern, particularly in historically underserved urban settings. OBJECTIVE: This study investigates sociome factors-the social context of disease-associated with asthma exacerbations among children living in Chicago's South Side, leveraging clinical and publicly available generalizable census tract-level datasets from agencies including ChiVes, the City of Chicago Data Portal, EPA, Census Bureau, HUD, NOAA, and more. The aim is to uncover novel hypotheses for potential new interventions. METHODS: A generalized linear model assessed associations with the outcome of asthma exacerbations while accounting for clustering at the patient level. Predictors included all variables from the Sociome Data Commons, including social, environmental, behavioral, economic, housing, and school variables. RESULTS: Predictors of decreased risk included patient age (+4.8 years, -22%), tree crown density (+6% coverage, -17%), parks per acre (+0.41, -8%), and labor market engagement (+0.8 points, -9%). Conversely, predictors of increased risk included increased distance to the nearest pharmacy (+0.28 miles, +12%), limited English skills (+2.3%, +10%), higher inequality (+0.08 points, +8%), and visits in the Spring (+11%) and Fall (+20%). CONCLUSION: The results suggest that tree crown density, a novel finding in the context of asthma exacerbations, may play a protective role. Limited access to health care facilities such as pharmacies continues to complicate care. CLINICAL IMPLICATIONS: These findings provide hypotheses for future interventions for long-standing asthma disparities.

  • Author response for "Neighborhood sociome factors and pediatric asthma exacerbations: Protective role of tree crown density and importance of pharmacy access in Chicago's south side"

    2025-05-28

    peer-review
  • Author response for "Neighborhood sociome factors and pediatric asthma exacerbations: Protective role of tree crown density and importance of pharmacy access in Chicago's south side"

    2025-05-22

    peer-review
  • Fast Capture of Cell-Level Provenance in Numpy

    2025-06-22

    articleOpen accessSenior author

    Effective provenance tracking enhances reproducibility, governance, and data quality in array workflows.However, significant challenges arise in capturing this provenance, including: (1) rapidly evolving APIs, (2) diverse operation types, and (3) large-scale datasets.To address these challenges, this paper presents a prototype annotation system designed for arrays, which captures cell-level provenance specifically within the numpy library.With this prototype, we explore straightforward memory optimizations that substantially reduce annotation latency.We envision this provenance capture approach for arrays as part of a broader governance system for tracking for structured data workflows and diverse data science applications.

  • Data Makes Better Data Scientists

    arXiv (Cornell University) · 2024-05-27

    preprintOpen accessSenior author

    With the goal of identifying common practices in data science projects, this paper proposes a framework for logging and understanding incremental code executions in Jupyter notebooks. This framework aims to allow reasoning about how insights are generated in data science and extract key observations into best data science practices in the wild. In this paper, we show an early prototype of this framework and ran an experiment to log a machine learning project for 25 undergraduate students.

Frequent coauthors

Labs

Education

  • Ph.D., Computer Science

    University of Chicago

    2005
  • M.S., Computer Science

    University of Illinois at Urbana-Champaign

    2000
  • B.S., Computer Science

    University of Illinois at Urbana-Champaign

    1998

Awards & honors

  • CAREER Award for Resource-Efficient Databases (2021)
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Sanjay Krishnan

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup