
About
Raul Castro Fernandez is an Assistant Professor of Computer Science at the University of Chicago. His research focuses on data management, data science, databases, systems, and the value of data. He is interested in understanding how to make the best use of data by building systems to share, discover, prepare, integrate, and process data, often utilizing techniques from data management, statistics, and machine learning. Fernandez is part of ChiData, the data systems research group at The University of Chicago, where he conducts research on large-scale video analysis, efficient data processing systems, and the economics of data. His work aims to advance the understanding of data sharing markets and improve data utilization, contributing to the fields of systems, architecture, and networking.
Research topics
- Computer Science
- Data Mining
- Artificial Intelligence
- Information Retrieval
- Computer Security
- Machine Learning
- Data science
- Database
- World Wide Web
- Distributed computing
- Business
- Operating system
- Economics
- Finance
- Computer network
Selected publications
Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking
Lecture notes in computer science · 2025-11-23
book-chapterWhat is the Value of Data? A Theory and Systematization
ACM / IMS Journal of Data Science · 2025-03-31
articleOpen access1st authorCorrespondingData powers economies, shapes societies, and fuels decision-making, yet its value remains poorly understood. Despite its centrality, we lack a unified framework for defining, measuring, and reasoning about data’s worth. This article develops a theory and systematization of the value of data—explaining why, how, and when data generates value. We distinguish data from documents, separate objective value from subjective judgments, and identify key dimensions of data’s worth. Our framework reconciles disparate notions of information, knowledge, and utility, offering insights that validate known principles while uncovering new opportunities to extract value from data. More than a taxonomy, this work provides a conceptual foundation for integrating perspectives from computer science, economics, and beyond. The conceptual foundation clarifies data’s role in technology, markets, and governance, advancing our ability to systematically understand and harness its value.
Proceedings of the ACM on Management of Data · 2025-06-17 · 6 citations
articleOpen accessSenior authorFinding relevant tables among databases, lakes, and repositories is the first step in extracting value from data. Such a task remains difficult because assessing whether a table is relevant to a problem does not always depend only on its content but also on the context, which is usually tribal knowledge known to the individual or team. While tools like data catalogs and academic data discovery systems target this problem, they rely on keyword search or more complex interfaces, limiting non-technical users' ability to find relevant data. The advent of large language models (LLMs) offers a unique opportunity for users to ask questions directly in natural language, making dataset discovery more intuitive, accessible, and efficient. In this paper, we introduce Pneuma , a retrieval-augmented generation (RAG) system designed to efficiently and effectively discover tabular data. Pneuma leverages large language models (LLMs) for both table representation and table retrieval. For table representation, Pneuma preserves schema and row-level information to ensure comprehensive data understanding. For table retrieval, Pneuma augments LLMs with traditional information retrieval techniques, such as full-text and vector search, harnessing the strengths of both to improve retrieval performance. To evaluate Pneuma , we generate comprehensive benchmarks that simulate table discovery workload on six real-world datasets including enterprise data, scientific databases, warehousing data, and open data. Our results demonstrate that Pneuma outperforms widely used table search systems (such as full-text search and state-of-the-art RAG systems) in accuracy and resource efficiency.
Where Does Academic Database Research Go From Here?
ArXiv.org · 2025-04-11
preprintOpen accessSenior authorPanel proposal for an open forum to discuss and debate the future of database research in the context of industry, other research communities, and AI. Includes summaries of past panels, positions from panelists, as well as positions from a sample of the data management community.
Where Does Academic Database Research Go from Here?
Proceedings of the VLDB Endowment · 2025-08-01
articleSenior authorAn open forum to discuss and debate the future of database research in the context of industry, other research communities, and AI.
2025-06-22
articleOpen accessSenior authorCore Hours and Carbon Credits: Incentivizing Sustainability in HPC
2025-11-12 · 2 citations
articleOpen accessEfforts to reduce the environmental impact of HPC often focus on resource providers, but choices made by users, e.g., concerning where to run, can be equally consequential. Here we present evidence that new accounting methods that charge users for energy used can incentivize significantly more efficient behavior. We first survey 300 HPC users and find that fewer than 30% are aware of their energy consumption, and that energy efficiency is a low priority concern. We then propose two new multi-resource accounting methods that charge for computations based on their energy consumption or carbon footprint, respectively. Finally, we conduct both simulation studies and a user study to evaluate the impact of these two methods on user behavior. We find that while only providing users feedback on their energy use had no impact on their behavior, associating energy with cost incentivized users to select more efficient resources, and use 40% less energy.
Core Hours and Carbon Credits: Incentivizing Sustainability in HPC
arXiv (Cornell University) · 2025-01-16
preprintOpen accessRealizing a shared responsibility between providers and consumers is critical to manage the sustainability of HPC. However, while cost may motivate efficiency improvements by infrastructure operators, broader progress is impeded by a lack of user incentives. We conduct a survey of HPC users that reveals fewer than 30 percent are aware of their energy consumption, and that energy efficiency is among users' lowest priority concerns. One explanation is that existing pricing models may encourage users to prioritize performance over energy efficiency. We propose two transparent multi-resource pricing schemes, Energy- and Carbon-Based Accounting, that seek to change this paradigm by incentivizing more efficient user behavior. These two schemes charge for computations based on their energy consumption or carbon footprint, respectively, rewarding users who leverage efficient hardware and software. We evaluate these two pricing schemes via simulation, in a prototype, and a user study.
Exploiting LLMs for Automatic Hypothesis Assessment via a Logit-Based Calibrated Prior
ArXiv.org · 2025-06-03
preprintOpen accessSenior authorAs hypothesis generation becomes increasingly automated, a new bottleneck has emerged: hypothesis assessment. Modern systems can surface thousands of statistical relationships-correlations, trends, causal links-but offer little guidance on which ones are novel, non-trivial, or worthy of expert attention. In this work, we study the complementary problem to hypothesis generation: automatic hypothesis assessment. Specifically, we ask: given a large set of statistical relationships, can we automatically assess which ones are novel and worth further exploration? We focus on correlations as they are a common entry point in exploratory data analysis that often serve as the basis for forming deeper scientific or causal hypotheses. To support automatic assessment, we propose to leverage the vast knowledge encoded in LLMs' weights to derive a prior distribution over the correlation value of a variable pair. If an LLM's prior expects the correlation value observed, then such correlation is not surprising, and vice versa. We propose the Logit-based Calibrated Prior, an LLM-elicited correlation prior that transforms the model's raw output logits into a calibrated, continuous predictive distribution over correlation values. We evaluate the prior on a benchmark of 2,096 real-world variable pairs and it achieves a sign accuracy of 78.8%, a mean absolute error of 0.26, and 95% credible interval coverage of 89.2% in predicting Pearson correlation coefficient. It also outperforms a fine-tuned RoBERTa classifier in binary correlation prediction and achieves higher precision@K in hypothesis ranking. We further show that the prior generalizes to correlations not seen during LLM pretraining, reflecting context-sensitive reasoning rather than memorization.
Where Does Academic Database Research Go From Here?
2025-06-17
articleSenior author
Frequent coauthors
- 19 shared
Samuel Madden
- 18 shared
Bhuvana Ramabhadran
- 12 shared
Sainyam Galhotra
Cornell University
- 10 shared
Peter Pietzuch
- 10 shared
Michael Stonebraker
Massachusetts Institute of Technology
- 10 shared
Pranav Subramaniam
- 10 shared
Ron Hoory
- 9 shared
Rosalind W. Picard
Massachusetts Institute of Technology
Labs
Education
- 2010
Ph.D., Computer Science
University of Chicago
- 2006
M.S., Computer Science
University of Chicago
- 2004
B.S., Computer Science
University of Chicago
Awards & honors
- 2025 SLOAN Fellowship
- 2024 NSF CAREER Award
- 2023 SIGMOD Test of Time Award
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Raul Castro Fernandez
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup